Assignment 2 – Data Warehousing and Integration
Assessment Due Date
Weighting: 25%
Due Date: 23:59 Wed 5th of Jun
Group or individual assignment.
Submission
Submit a zip file to contain the following files
people.txt
mydblp.xml
report.docx
extract.xql
transform1.xql
Assignment Overview
This assignment aims for you to apply your knowledge of XML, Json, data warehousing and integration.
You complete the assignment in a group of 2 students whom you choose yourself. If you decide to do it individually, you must complete all parts. Please complete the people.txt file by adding the usernames, names, and the responsible parts of group members before you start the work. You can backup your work by sending the report periodically to an email run by a third-party company like outlook or gmail.
If you do the assignment in a group, only one of you submits.
The assignment is reasonably open. As a result, justification in decision points is important. Make sure that your writing is logical and cohesive so that your answers are not misunderstood.
Assignment Tasks
Task 1: Retrieve data. (1 mark)
Retrieve all publications of a major type and a minor type as defined below from dblp.xml:
Major types:
A: article
B: inproceedings
Minor types:
1: mastersthesis
2: phdthesis
3: www
4: proceedings
5: book
6: incollection
From the major type choose A if the last digit of your student ID (like 110123123) is an even number (including 0) and B if the last digit is an odd number.
From the minor type, choose n = (d % 6)+1 where d is the last digit of your student id and n is the number against the type you should choose. For example, if the last digit is 8, n=(8%6)+1=2+1=3, the student should use www as his/her minor type.
Your code to do this should be in extract.xql.
The retrieved data should be in mydblp.xml.
Task 2: Design a star schema for publication analyses (8 marks)
a) Publication data can be analyzed for different purposes and in different ways to quantitively measure the amount of research outputs of organizations and/or individuals, journal popularity and impact, hot research directions etc.
Design a star schema for this purpose. The analysis subject and its metric are given. The subject is publication performance, and the metric for the subject is the number of publications. You design dimension tables for the subject based on your mydblp.xml.
Describe the fact table and dimension tables in the following format.
facttable(SK1,Sk2,…,Skn [, kpi])
constraints like keys and foreign keys
dim1(SK1, … )
constraints
…
dimn (SKn,…)
constraints
In the above schema, ‘SK’ means surrogate key, ‘…’ means more, kpi means a metric of the number of publications. [,kpi] means kip is optional. When kpi is not appearing, you must use an aggregate function to derive a metric, and this definition must be listed between the fact table schema and its constraints. Please replace the place holders for table names and attributes with your own.
b) For each dimension table, describe how the dimension is helpful in an analysis.
Here are some notes about dimension tables: If a dimension table does not support any analysis, it is not needed. Time must be represented in one of the dimensions. A dimension should model an entity type and an entity type should be modelled in one dimension table only. For example, you should not put the author information and journal information in one dimension.
c) Create your fact table and dimension tables in an oracle database with all keys and foreign keys defined. Please note that your table and attribute names mut be consistent in the design and in the implementation. Add table creation statements to the report.
Display the schema of each table by ‘select * from tablename;’, take a screenshot of only the schema line as below and add it to the report after the table name as shown in the following.
tablename1:
tablename2:
Display the keys and foreign keys of all the tables.
By the state ‘select owner, table_name, constraint_type, constraint_name, status from user_constraints;’
Take a screenshot of the output including the schema line and data records and add the screenshot to the report.
Task 3: Transform XML data and load to the tables. (8 marks)
a) You transform the data in mydblp.xml and load it into the warehouse tables in the oracle database server. This part is up to your freedom. You can do it by generating csv and then uploading csv, or by a program transforming and directly inserting data to oracle.
Your code for this part should be named as ‘transform?.*’ where ‘?’ stands for a step number and ‘*’ is your code extension. An example of your code file can be tranform1.xql. The code files must be submitted as part of the zip.
In the report, add summaries for these transformation files and, if applicable, for all manual processes. For each file, the summary should include the aim, the input file, main processes, and the output. For a manual process, you describe the main steps.
Take a screenshot of top 5 records of each table from oracle and include them in the report.
b) Use two example publication entries, one per type, to justify that your processes,
preserve all values, and
preserve all one-to-many relationships.
Data from tables should be presented in screenshots, and the screenshots must be minimal.
c) Data for an analysis, often called a dataset, is a table from some joins of relevant dimensional tables and the fact table. This implies that the data in the warehouse is often modelled differently from the data for an analysis (e.g., multiple dimensional visualization).
Write an SQL query to generate the dataset for one of the analyses you describe in Task2.
Include the query and a screenshot of the schema and the first five lines of output in the report.
Task 4: Integrate more data to the warehouse. (7 marks)
For this task, you present a possible data collection and integration idea for the warehouse. The idea must be presented with logical details for practical scrutiny which will be defined in the section. The answer for this part is very open.
Data warehouse data often needs to be integrated from multiple sources. The data from dblp.xml does not support analyses against author institutions and journal impact data. We aim to find more data from other sources. The newly identified data can be additional columns to existing tables or a new dimension table.
This task requires you to find a data source that can be integrated into the warehouse to support more analyses. You can use any source such as the internet (such as http://scholar.google.com for paper citation and author search, https://www.sciencedirect.com/search for journal quality search), or organizational webpages. The data is often unstructured. Show the following items in the report.
a) An example of data from some source(s) for two data records, which can be presented as small screenshots. Data sources must be presented together with the example.
b) The data ready for integration from the source data after transformation, which would be a small table with two records. These records will be added to a table in the current star schema.
c) Assumptions and processes used for identifying and extracting the source data and for transforming the source data in Item (a) to data ready for integration in Item (b). This part is very important as it validates whether an idea is practical. You do not need to write code, by the description must be logical.
d) The purpose of new data. Does it extend an existing dimension or add a new dimension? Present the updated schema for the relevant table.
e) The way in which the new data records will be linked/matched to existing records.
f) The challenges for Item 3) to be automated.
The core of this part is two data records and processes for making them ready for integration. You do not need to write code, but your descriptions must be cohesive and tight in logical. If they are not logical, the reader will not be able to follow them to repeat your steps. If they are not cohesive, the reader will not get what you want to deliver.
Marking Criteria
All code submitted should run.
Design decisions must be justified.
Writing must be logical and cohesive.
Plagiarism
If your solution, or part of it, is not written by you (your idea and your own typing), you commit plagiarism. The investigation will involve an oral or written test. If you commit plagiarism, you will be penalised and a record will be kept against you.
Extensions
Extensions for assignments are available under the following conditions
· permanent or temporary disability, or
· compassionate grounds
In all cases, documentary evidence (e.g. valid medical certificate, road accident report, obituary) must be presented to the Course Coordinator. A medical certificate produced on or after the due date will not be accepted unless you are hospitalized.
If you apply for extension within 24 hours before the deadline, you must see the course coordinator in person unless you are in an emergency situation like being admitted in a hospital.
Late Penalties
Unless you have an extension, late submission will incur a penalty of 30% deduction per day (or part of it) of lateness.