Homepage
Programming
INFS3200 Advanced Database Systems - Individual Assignment: Database Schema, Data Warehousing and Data Integration of Book datasets

INFS3200 Advanced Database Systems - Individual Assignment: Database Schema, Data Warehousing and Data Integration of Book datasets

Engage in a Conversation

INFS3200 Advanced Database Systems CourseNana.COM

Individual Assignment (25%) CourseNana.COM

Semester 2, 2022 CourseNana.COM

In this assignment, you will complete four distinct tasks with a total of seven questions (totaling 25 marks) to demonstrate your understanding of multiple topics including distibuted databases, data warehousing, data integration, and data quality management. You will also be required to implement programs to demonstrate your problem solving ability. CourseNana.COM

Warning: This is an individual assignment. You must not share your answers with anyone else. Sharing answers or plagiarism is unacceptable and will be penalized. CourseNana.COM

Preliminaries Tips CourseNana.COM

It is recommended that you complete practical 3 before working on the coding part (that is, part 4) of this assignment. Although the assignment is independent to the practicals, the code introduced in practical 3 can be used as a starting framework for the tasks required in this assignment.
The datasets used in this assignment contain thousands of records. As such, it can be difficult to manually check your answers record-by-record. As such, we recommend using tools such as a fast text editor, a spreadsheet program, or command line tools like grep and awk to view, sort, and search the raw data as appropriate. Please be careful not to change any of the original data, as it may affect your results.
You are free to implement your code in Java, Python, or directly as SQL scripts. The code must be readible, and must be accompanied with comments, to facilitate easy marking. It is also good practice to write clear code with appropriate comments.

Dataset Description CourseNana.COM

In this assignment, there are four datasets containing information pertaining to books. The data can be found on the LMS (Assessment > Individual Assignment > Assignment Specification and Data). These datasets are derived from different sources. Each dataset schema is outlined below: CourseNana.COM

Book1(id, title, authors, pubyear, pubmonth, pubday, edition, publisher, isbn13, language, series, pages) CourseNana.COM

Book2 (id, book_title, authors, publication_year, publication_month, publication_day, edition, publisher_name, isbn13, language, series, pages) CourseNana.COM

Book3(ID, Title, Author1, Author2, Author3, Publisher, ISBN13, Date, Pages, ProductDimensions, SalesRank, RatingsCount, RatingValue, PaperbackPrice, HardcoverPrice, EbookPrice, AudiobookPrice) CourseNana.COM

Book4(ID, Title, UsedPrice, NewPrice, Author, ISBN10, ISBN13, Publisher, Publication_Date, Pages, Dimensions) CourseNana.COM

Part 1: Database Schema and Fragmentation [6 marks total] CourseNana.COM

Read the above schemas carefully and understand the meaning of the attributes. If you don’t know the meaning of a certain attribute, inspect the data under it (within the data files) or try to find the meaning via a web search engine (especially for some abbreviations, like ISBN, for example). CourseNana.COM

Answer the following questions based on your understanding. CourseNana.COM

Question 1: [2 marks]
Given four datasets that are stored in one relational database as separate relations. CourseNana.COM

Write an SQL query to solve the following problem: Find the top 5 books with the highest ratings, and 5 books that have the lowest rating, return their ranks (sorted in descending order), titles, publishers and number of pages.
Which table schema(s) is/are used to answer the above query?

Question 2: [4 marks]
Given that Book3 is stored in a distributed database A, and the two most frequent queries are: CourseNana.COM

· Find all books whose publisher name is [Some Name] (or, a list of names), return their book titles and author info. CourseNana.COM

· Find all books that are published in a given year, and return their book IDs, languages, number of pages, HardcoverPrice, and EbookPrice. CourseNana.COM

Answer the following questions: CourseNana.COM

1. [2 marks] If the goal of database A is to handle each query via a dedicated local site (with no information needed from the other site(s)), which fragmentation strategy should be used to fragment the Book3 table? If only two fragments are generated, write their schemas (if vertically fragmented) or predicates (if horizontally fragmented), respectively. (Note: there are many valid fragmentation solutions; just provide one of them.) CourseNana.COM

2. [2 marks] Assume that we horizontally fragment the table into three fragments based on the following predicates: CourseNana.COM

◦ Fragment 1: RatingsCount ≤ 25
◦ Fragment 2: 25 < RatingsCount ≤ 120 ◦ Fragment 3: RatingsCount > 125 CourseNana.COM

Is this set of predicates valid? CourseNana.COM

· If so, please explain (using plain English) the insertion process if we want to insert a new record into Book3. CourseNana.COM

· If not, please generate a valid predicate set using minterm predicates (show the calculation process). Then, explain the insertion process for a new record after the valid predicate set is made. CourseNana.COM

Part 2: Data Warehousing [7 marks total] CourseNana.COM

In this section, we design a Data Warehouse on book sales with respect to the aforementioned Book1, Book2, Book3, and Book4 datasets. In particular, we need to use the data from the given datasets and create a Data Warehouse Schema. The designed Data Warehouse will contain summary data, such as the total sales of each publisher, for each day and each language. The following table shows an example: CourseNana.COM

Question 3: Design a Data Warehouse Schema that can accommodate the above example, and then answer the following questions: CourseNana.COM

1. [1 mark] Show the schema and point out the dimensions and fact table. Given that we have a dimension table for each dimension and there are 4000 records in the fact table; Among all dimension tables and the fact table, which table has the most records? Why? Explain your answer. CourseNana.COM

Question 4: Now we want to the create bitmap indices for the given model: CourseNana.COM

[2 marks] What are the advantages of building a bitmap index? Which type of column is not

suitable for a bitmap index? Why? CourseNana.COM

[2 marks] Suppose the Publisher column only contains four distinct values and Language only contains two, which are all shown in the above example. Please create bitmap indices for both Publisher and Language.
[2 marks] Explain how to use the bitmap indices to find the total sales of Japanese books published by AAAI Press.

Part 3: Data Integration [4 marks total] CourseNana.COM

Assuming that the data warehouse loads data from the above four sources (Book 1,2,3,4), you are asked to integrate their data and address various data quality issues. In this part, the database sources (i.e., owners) only give you their schemas (shown in the Dataset Description section of the Preliminaries on page 2), and you are asked to design an integrated schema based on the given schemas. You must assume that the data records within tables Book 1,2,3,4 are not available for you at this stage – you only have access to the schemas CourseNana.COM

Question 5: Define a global schema (using the approach namely, Global as a View) which can integrate data from all four sources. CourseNana.COM

[2 marks] Design a global schema which will combine the common attributes from each schema together. Your design should include any information that is represented in all four schemas. If an attribute cannot be found or derived in the given schemas, then it should be left out of your global schema.
[2 marks] Identify any structural or semantic heterogeneity issues that may occur during your integration. Using the data, provide a concrete example of each, if applicable, and outline a possible resolution.

Part 4: Data Quality [8 marks total] CourseNana.COM

Now assume you are provided with the actual data from each source, namely Book1.csv, Book2.csv, Book3.csv, and Book4.csv (see the Assignment-Data.zip file). As it is very common that the same book is recorded by different sources, it is crucial to identify the redundant information by merging and eliminating the duplicated records during the data integration process, which relies on data linkage techniques. CourseNana.COM

To assist you with your analysis, we provide a human-labelled gold-standard dataset (refer to Prac 3 Part 2.2 for more information about gold-standard data), named as Book1and2_pair.csv, which lists all correct matchings between Book1 and Book2. It will be used in the following tasks. Its schema is as follows: CourseNana.COM

Book1and2_pair(Book1_ID, Book2_ID) CourseNana.COM

All of the datasets are provided as CSV files, and hence have the fields separated by the comma character. Thus, if two commas appear consecutively, it means the value in the corresponding field between two commas is NULL (absent). Furthermore, if an attribute field contains a comma naturally, the field will be enclosed by a double quote ("") to differentiate the actual comma notation inside attribute from the outside comma separator. An example record from Book2 is as follows: CourseNana.COM

1725,Informix Unleashed,"John McNally, Jose Fortuny, Jim Prajesh, Glenn Miller",97,6,28,1,Sams,9.78E+12,,Unleashed Series,1195 CourseNana.COM

According to the Book2 schema, we can infer the following fields: CourseNana.COM

id CourseNana.COM	1725 CourseNana.COM
book_title CourseNana.COM	Informix Unleashed CourseNana.COM
authors CourseNana.COM	John McNally, Jose Fortuny, Jim Prajesh, Glenn Miller CourseNana.COM
publication_year CourseNana.COM	97 CourseNana.COM
publication_month CourseNana.COM	6 CourseNana.COM
publication_day CourseNana.COM	28 CourseNana.COM
edition CourseNana.COM	1 CourseNana.COM
publisher_name CourseNana.COM	Sams CourseNana.COM
isbn13 CourseNana.COM	9.78E+12 CourseNana.COM
language CourseNana.COM	NULL CourseNana.COM
series CourseNana.COM	Unleashed Series CourseNana.COM
pages CourseNana.COM	1195 CourseNana.COM

Here, since there are commas within the authors field, the whole field is enclosed by a notation of double quotes. Also, since there are two consecutive commas before Unleashed Series, it means that the language is NULL. CourseNana.COM

In this part, you are asked to answer the following questions by writing code to complete the tasks (if “code required” is specified) and provide answers based on the results of your programs. Make sure you maintain a copy of your code as you will be required to submit it along with the report; refer to the Submission Format section on page one for specific instructions on packaging your code for submission. CourseNana.COM

Question 6: Sample records from Book3.csv to measure its data quality: CourseNana.COM

[1 mark, code required] By sampling records whose id is the multiple of 100 (that is: 0, 100,

200, 300, ...), how many records are there in the sample set? CourseNana.COM

[1 mark, code required] Among the samples found in the previous question (question 6.1),

how many fields containing NULL values are present? CourseNana.COM

[2 marks] Calculate the defects per million opportunities (DPMO) according to your samples. You can assume that any NULL value is an error, and that the remaining values are valid. Hint: you can sample the records manually to validate the correctness of your program results.

Question 7: Perform data linkage on Book1 and Book2 using the methods shown in Practical 3: Given two author strings from Book1 and Book2 that refer to the same author list: CourseNana.COM

s1 = “Peter Rob, Carlos Coronel”
s2 = “Carlos Coronel;Steven Morris;Peter Rob;” CourseNana.COM

[1 mark] Compute the edit distance between s1 and s2 – What is the value?
[1 mark] Compute the Jaccard distance between s1 and s2 – What is the value?

[2 marks, code required] Write a program to link the data between Book1 and Book2 using the Jaccard coefficient with 3-gram tokenization as the similarity measure, performing the comparison only on the book title field, and using a matching threshold of 0.75 (that is, only return matches with a similarity of 0.75 or more). Compare your output with the gold-standard dataset and compute the precision, recall and F-measure. CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: CNSCC311 Distributed Systems Coursework Assignment: Online shopping system, Online food delivery system, Online library, Online travel booking system

Next: CS425 Database Organization - Course Project: Banking Application Development

Australian代写,UQ代写,University of Queensland代写,INFS3200代写,Advanced Database Systems代写,Database Schema代写,Data Warehousing代写,Data Integration 代写,Java代写,Python代写,SQL代写,Australian代编,UQ代编,University of Queensland代编,INFS3200代编,Advanced Database Systems代编,Database Schema代编,Data Warehousing代编,Data Integration 代编,Java代编,Python代编,SQL代编,Australian代考,UQ代考,University of Queensland代考,INFS3200代考,Advanced Database Systems代考,Database Schema代考,Data Warehousing代考,Data Integration 代考,Java代考,Python代考,SQL代考,Australianhelp,UQhelp,University of Queenslandhelp,INFS3200help,Advanced Database Systemshelp,Database Schemahelp,Data Warehousinghelp,Data Integration help,Javahelp,Pythonhelp,SQLhelp,Australian作业代写,UQ作业代写,University of Queensland作业代写,INFS3200作业代写,Advanced Database Systems作业代写,Database Schema作业代写,Data Warehousing作业代写,Data Integration 作业代写,Java作业代写,Python作业代写,SQL作业代写,Australian编程代写,UQ编程代写,University of Queensland编程代写,INFS3200编程代写,Advanced Database Systems编程代写,Database Schema编程代写,Data Warehousing编程代写,Data Integration 编程代写,Java编程代写,Python编程代写,SQL编程代写,Australianprogramming help,UQprogramming help,University of Queenslandprogramming help,INFS3200programming help,Advanced Database Systemsprogramming help,Database Schemaprogramming help,Data Warehousingprogramming help,Data Integration programming help,Javaprogramming help,Pythonprogramming help,SQLprogramming help,Australianassignment help,UQassignment help,University of Queenslandassignment help,INFS3200assignment help,Advanced Database Systemsassignment help,Database Schemaassignment help,Data Warehousingassignment help,Data Integration assignment help,Javaassignment help,Pythonassignment help,SQLassignment help,Australiansolution,UQsolution,University of Queenslandsolution,INFS3200solution,Advanced Database Systemssolution,Database Schemasolution,Data Warehousingsolution,Data Integration solution,Javasolution,Pythonsolution,SQLsolution,