CourseNana | COMP62421 Querying Data on the Web Coursework Manual

COMP62421 Querying Data on the Web Coursework Manual
CourseNana.COM

On Assigned Readings CourseNana.COM

These papers provide background information of relevance to the material in the unit. In addition, the contents of these papers may be examined in the quizzes. CourseNana.COM

Week 1
Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: CourseNana.COM

34-43 http://doi.acm.org/10.1145/275487.275492 CourseNana.COM

Week 2 CourseNana.COM

Craig Brown, Xavier Franc, Michael Paddon, Native XML Databases: Death or Coming of Age? XMLPrague 2015 Conference Proceedings, pp. 107-119. http://archive.xmlprague.cz/2015/files/xmlprague-2015-proceedings.pdf [pp. 107-119 only] CourseNana.COM

Week 3
Olaf Hartig, Andreas Langegger: A Database Perspective on Consuming Linked Data on the CourseNana.COM

Web. Datenbank-Spektrum 10(2): 57-66 (2010). http://dx.doi.org/10.1007/s13222-010-0021-7 Week 4 CourseNana.COM

Jeffrey D. Ullman: Designing Good MapReduce Algorithms. ACM Crossroads 19(1): 30-34 (2012) http://doi.acm.org/10.1145/2331042.2331053 CourseNana.COM

On Coursework CourseNana.COM

There are two types of coursework in COMP62421: quizzes and lab work. CourseNana.COM

On Quizzes CourseNana.COM

Quizzes are done synchronously, and as a result are timetabled. CourseNana.COM
The topics for a quiz on a given teaching week consist of the material in the assigned articles for reading in the previous weeks, as well as all the taught content thus far in the course unit (and therefore not just in that specific teaching week). CourseNana.COM
There are 4 quizzes, which run in Weeks 2 to 5. On Duration, Question Types, and Consultation CourseNana.COM

Quizzes last 25 minutes. CourseNana.COM
They consist of five multiple choice style questions. CourseNana.COM
Incorrect answers do not subtract from the total. CourseNana.COM

On Lab Work CourseNana.COM

Lab work is associated with timetabled lab sessions. CourseNana.COM
The topics for lab work consist of application and exploration of the taught content in the course unit. CourseNana.COM
All the lab work can be carried out on departmental computers or on your own machines, though the descriptions are generally targeted at and have been tested on unix. CourseNana.COM
The lab machines have sqlite3 installed; if you want to do the labs that use sqlite3 (RW, QO) on a laptop, you may have to install SQLite (https://www.sqlite.org/download.html). CourseNana.COM
The software used in the unit uses licenses that will be suitable for public use, so you can download and install software to your personal machine. However, be aware that we may struggle to support students with machine-specific issues. CourseNana.COM

Page 2 of 14 CourseNana.COM

COMP62421 Querying Data on the Web Coursework Manual CourseNana.COM

On Lab Work’s Contribution to the Course Unit Mark CourseNana.COM

There are three pieces of lab work, RQ (Week 1), QO (Weeks 2-3) and SP (Weeks 4-6). CourseNana.COM
The coursework mark is subdivided as follows: CourseNana.COM
- Quizzes: 10% CourseNana.COM
- Lab RQ: 0% (so this is a formative assessment – you will be given the solutions). CourseNana.COM
- Lab QO: 40% CourseNana.COM
- Lab SP: 50% CourseNana.COM
The coursework marks, as a whole, contribute 50/100 marks towards the final course unit mark. CourseNana.COM
The remaining 50/100 comes from the exam. CourseNana.COM

On Deadlines CourseNana.COM

The deadlines on the lab work are as follows: CourseNana.COM

QO: Week 4, Thursday at 18:00 CourseNana.COM
SP: Week 6, Friday at 18:00 CourseNana.COM

On What Your Submission Consists of and Where to Upload it CourseNana.COM

The submission is always a report in PDF format. This must be such that we can open it and read it to mark it. Anything that prevents us from doing so cannot be compensated. You should upload the report associated with each submission as a pdf document called: CourseNana.COM

<username>_COMP62421_LW_Report CourseNana.COM

On Preparing a Report File CourseNana.COM

If your assumptions are convincing (i.e., if they are both made in response to genuine uncertainty and incompleteness and are consistent with all the information explicitly given), we will take them into account in the marking. CourseNana.COM

Lab Work RQ
CourseNana.COM

There is material provided on Blackboard to support the execution of SQL (using SQLite) and Relational Algebra (using RA). All queries are to be written over the Mondial database, which is provided in the download. CourseNana.COM

Task 1: Write tuple-relational calculus (TRC) expressions that, upon evaluation, return the data characterized by each of the following English-language specifications: CourseNana.COM

(1) Return the name of any country that has a lake. CourseNana.COM
(2) Return all the available attributes on cities whose population is between 3 and 5 million inhabitants. CourseNana.COM
(3) Return the country code and the continent of every country not in Europe or in Australia/Oceania. CourseNana.COM
(4) Return the names of countries that also give their name to one of its own provinces. CourseNana.COM
(5) Return the names of countries that are not landlocked (i.e., have a sea coast). CourseNana.COM

Task 3: Write SQL expressions that, upon evaluation, returns the data characterized by each of the following English-language specifications. Use duplicate removal where appropriate (e.g., when a duplicate is not required in the intended answer). CourseNana.COM

(12) Return the names of up to 10 countries and the value corresponding to half the country’s population. CourseNana.COM
(13) Return all the information available in the City table about cities whose name is Manchester. CourseNana.COM
(14) Return the name of cities whose name starts with the substring 'Man’. CourseNana.COM
(15) Return the name of the country, and the names of the organizations of which the country is a member, for countries with Buddhist populations and organizations established after 1st December 1994. CourseNana.COM
(16) Return the name of each country with the number of islands in it. CourseNana.COM

Some Comments CourseNana.COM

In all the queries above, do not get bogged down by the complexity of the Mondial encoding of geographic properties. Mondial is not a spatial database and lacks spatial types (and corresponding spatial operations). CourseNana.COM

For Task 1, you are expected to submit the TRC expressions only. You are not required to run the query against the data as there is no easily available TRC evaluator to use, but if you want to ensure your expression computes the intended result, consider mapping the expression into an iterative computation in your favourite language. CourseNana.COM

Page 4 of 14 CourseNana.COM

COMP62421 Querying Data on the Web Coursework Manual CourseNana.COM

For Task 2, you should use the RA software1 to evaluate your expressions. We propose using the command line options to read the input from a file and to redirect the output to a file, as described in the download that is available on Blackboard. Note that RA is dependent on the SQLite DBMS. CourseNana.COM

For Task 3, you should use SQLite to evaluate your expressions. CourseNana.COM

Software/Data CourseNana.COM

A folder is provided in Blackboard that gives you access to relevant data and scripts. CourseNana.COM

On Scripting, Echoing and Logging CourseNana.COM

We note that familiarity with SQLite will be useful for Coursework QO. To give an example using SQLite, let's suppose that the query to be evaluated is: CourseNana.COM

Return all the information about any 4 countries CourseNana.COM

The answer to this is: CourseNana.COM

select * from country limit 4;
You would then compose a script, let's call it Atan_Luring_LW1.sql (note the suffix), that, in this simple case, CourseNana.COM

would be as follows: CourseNana.COM

-- Atan_Luring_LW1.sql -- CourseNana.COM

     -- set echoing on
     .echo on

-- set spooling out: note the suffix .output Atan_Luring_LW1.log CourseNana.COM

--
select * from country limit 4; CourseNana.COM

You can then start SQLite (using the sqliterc initialization file we provided you with to normalize columns width, etc., and run the script with the SQLite read <filename> command, where <filename> in the example above would be Atan_Luring_LW1.sql. CourseNana.COM

sqlite3 -init sqliterc mondial.db < Atan_Luring_LW1.sql
On successful completion, the file called (in this example) Atan_Luring_LW1.log would have the following CourseNana.COM

content (modulo formatting): CourseNana.COM

select * from country limit 4; CourseNana.COM

Name Code ---------- ---------- Albania AL
Greece GR Macedonia MK
Serbia SRB CourseNana.COM

Capital
----------
Tirana
Athina
Skopje
Beograd

Province
----------
Albania
Attikis
Macedonia
Serbia

Area Population ---------- ---------- 28750 2800138 131940 10816286 25333 2059794 77474 7120666 CourseNana.COM

In the case of the RA software:
java -jar ra.jar mondial.properties -i Atan_Luring_LW1.ra -o Atan_Luring_LW1.log CourseNana.COM

runs the RA executable (ra.jar), taking as input the query in the file Atan_Luring_LW1.ra and writing the output to Atan_Luring_LW1.log. The homepage of RA is: CourseNana.COM

https://users.cs.duke.edu/~junyang/ra2/ CourseNana.COM

1 https://users.cs.duke.edu/~junyang/ra2/ CourseNana.COM

Page 5 of 14 CourseNana.COM

COMP62421 Querying Data on the Web Coursework Manual CourseNana.COM

Mondial CourseNana.COM

You can find documentation about Mondial in its website2. CourseNana.COM

SQLite
For SQLite, the documentation on the website is comprehensive, but if you prefer learning from books, you CourseNana.COM

have free access (from an UoM IP address) to this one: CourseNana.COM

The Definitive Guide to SQLite
Grant Allen, Mike Owens
ISBN: 978-1-4302-3225-4
Apress, 2010 http://link.springer.com/content/pdf/10.1007%2F978-1-4302-3226-1.pdf CourseNana.COM

Lab Work QO
CourseNana.COM

This lab involves an experimental investigation into the SQLite query optimizer. Before embarking on the CourseNana.COM

experiments it will be useful to refer to the SQLite documentation: https://www.sqlite.org/docs.html CourseNana.COM

and in particular, this section: CourseNana.COM

https://www.sqlite.org/optoverview.html CourseNana.COM

Consider, also, studying this section in detail: CourseNana.COM

https://www.sqlite.org/queryplanner-ng.html CourseNana.COM

Task 1: Write three pairs of semantically-equivalent SQL queries against the Mondial database such that: CourseNana.COM

in each query pair, either the two queries are syntactically distinct (but semantically equivalent, i.e., they CourseNana.COM

return the same result set) and one query in the pair is optimizable by SQLite whereas the other is not; CourseNana.COM
or else the two queries are syntactically identical but are run under different conditions (e.g., once CourseNana.COM

without access to an index, and again with access to an index) CourseNana.COM

so that, as a result, we would expect the response times in evaluating the queries in the same pair to be different. Essentially, we want to know by how much, and why. CourseNana.COM

You must: CourseNana.COM

produce the query pair, CourseNana.COM
explain which optimization/evaluation strategy you have in mind for each query in the pair, CourseNana.COM
record the execution times for each query in each pair, CourseNana.COM
discuss any execution time difference. CourseNana.COM

You must use SQLite to evaluate your expressions and obtain results for your submission.
Task 2: Summarize your investigation with a plot, accompanied by interpretation and comment, that CourseNana.COM

evaluates the benefits/shortcomings of using the SQLite optimizer and evaluator. Software/Data
A folder is provided in Blackboard that gives you access to relevant data and scripts. Some Comments CourseNana.COM

We expect you to study the documentation, i.e., to find out about facts, notions, ideas, techniques, policies and heuristics yourself. We want to gauge how much you have learned about query optimization and execution. Some topics (like the use of indexes) haven’t been covered in detail in the course, but do some research. CourseNana.COM

By all means, ask questions in the lab session and in the discussion board, but we are asking you to think and reflect and ask yourself questions, propose some answers, and verify whether your proposed answers are correct or not. CourseNana.COM

Note, to start with, that the goal is not for you to try and create complex (in the sense of challenging, or even tricky) queries per se, but rather to study the SQLite optimizer and identify cases where it can make a difference in potentially speeding up a query and cases where it cannot. If you read them from this course unit’s viewpoint, the documentation on the SQLite optimizer (sometimes implicitly) describes conditions in which the optimizer can make an impact and conditions in which it cannot make an impact. CourseNana.COM

To get you started, study the conditions under which the optimizer may or may not be able to use an index. Start by considering whether there is a substantive difference in the amount of data that would be processed if the index were used or not. Also, in the case of indexes, note that you may need to create the index yourself and run the query, then drop the index and run the query again. This is the special case in Note, however, that if you were to produce three pairs of queries that are simply instances of the same hypotheses (e.g., '"The availability of indexes allows the optimizer to generate more efficient plans.") you would lose out on some marks. Contrast the case above with one in which you explore different optimization strategies that are based on indexing (e.g., one might be the case of complex predicates that prevent the optimizer from taking advantage of indexes, another the case where the availability of an index allows the optimizer to select a more efficient physical operator, etc.). In this example, the hypotheses are different, and the corresponding queries aim to trigger different behaviours in the optimizer, and hence merit more marks. CourseNana.COM

Your submission will be a report containing the pairs, the explanation of the optimizations you have in mind, and the plots with interpretation and comment. CourseNana.COM

Task 1: Query Analysis:
The report should be structured with a section for each pair of queries that includes: CourseNana.COM

The pair of queries. The queries should be in a form that can easily be copied and pasted into a script, in case we want to run it. CourseNana.COM
An explanation that describes the features of the SQLite optimizer and evaluator that are being investigated, and the impact of these on the times observed. CourseNana.COM
Evidence that has been collected that the pair of queries provide different execution times. Task 2: Reflection: CourseNana.COM

At least one plot that provides information on the approaches to evaluation pursued for Task 1. CourseNana.COM
A discussion of the plot, the information that was used in the experiment, what can and can’t be CourseNana.COM

observed from this information, and what other information might be useful. CourseNana.COM

Thus, the discussion in Task 1 relates to the specific queries, and the discussion in Task 2 seeks to draw more general lessons from your overall experience, in the light of the information in the plot.The marks are allocated 75% to Task 1 and 25% to Task 2. The total length of the report should be no longer than 2000 words. CourseNana.COM

On Plotting, Interpreting and Commenting on Experimental Results CourseNana.COM

You are doing a study in which you measure the response time of queries. So, what goes in the Y axis? And you're plotting the response time for various queries, which come in pairs. So, what goes in the X axis? In other words, think about this as an opportunity for you to exercise/acquire basic transferable skills in doing empirical explorations of system performance that might be useful in non-database settings. CourseNana.COM

To throw further light on what you’re expected to do when you’re asked (above) to explain the optimization you have in mind, the starting point for you is to think of these tasks as being grounded on the notion of hypothesis testing. CourseNana.COM

Your comments should avoid simply reading the values off the graphs. For example, don’t simply write In the first pair, the non-optimized query took 3.5ms. CourseNana.COM

the reader can do that without your help (unless you plot is poorly presented, which is another, serious, matter). CourseNana.COM

When reporting times, you should provide information about the setup of the experiment, including the database size, the software version, and information about the environment used. For example, here’s a pretty complete specification of a (hypothetical) underlying hardware/software environment: CourseNana.COM

These measurements were taken in a MacBook Air, 1.8 GHz Intel Core i7 with 4GB 1333 MHz DDR3 with 251 GB Flash Storage running OS X Yosemite V 10.10.5 and SQLite version 3.8.11.1 2015-07-29 20:00:57. CourseNana.COM

You may not have as much information, but provide as much as you can on the CPU, the primary memory, the secondary memory, the operating system and the DBMS you’re using. CourseNana.COM

Sometimes, a lot of time may be spent in printing results. For exercises in which the result is not of special interest, only the timings are, you should switch off the echoing in your script. CourseNana.COM

Furthermore, because of hot-cold effects (i.e., full v. empty buffers, on-the-fly indexing, etc.), when taking measurements of query response time, it is good practice to (close and re-)open the database anew before running each query, and then run each query four times at least, discarding the first and averaging the last three to obtain the value you plot. CourseNana.COM

If running on a public machine, to check whether there are other users logged into a machine, and thus potentially interfering with measured times, log into it and issue the who command on a shell: it lists the login names of every used logged into the machine. CourseNana.COM

When explaining how results have been obtained, it may be useful to use the SQLite explain query plan command4. The CourseNana.COM

.eqp on
command switches on explanation, and will provide information on how a query is to be evaluated. For example: CourseNana.COM

sqlite> .eqp on
sqlite> select * from country limit 4; QUERY PLAN
`--SCAN TABLE country CourseNana.COM

Lab Work SP CourseNana.COM

In relational databases, the content of every table must conform to an explicitly declared schema. In semi- structured databases, including RDF stores, however, this is not the case. CourseNana.COM

The overall, high-level goal of this lab work is for you to gain an insight into some of the consequences ensuing from the lack of schema constraints over RDF stores. In particular, if one is asked to write a set of SPARQL queries to achieve a certain goal, some questions arise, such as: CourseNana.COM

How does one go about understanding what data is there, and what can we ask of it? CourseNana.COM
How much time is spent finding out? CourseNana.COM
Does this make the task of writing queries harder? CourseNana.COM
Is it more likely that you will spend longer debugging the queries before they work? CourseNana.COM

In terms of information content, your goal is to create a set of SPARQL queries over DBpedia that, for a country in the relational version of the Mondial database, generates the same information (excluding the geographical aspects such as lakes, mountains, seas, rivers, etc., but including, if possible continent, country, city, province, organization, ethnic group, language and religion ) as is available in Mondial. CourseNana.COM

In other words, you will: CourseNana.COM

study again the relational version of Mondial, CourseNana.COM
focus on the non-geographical information that you can retrieve (basically using selection- CourseNana.COM

projection-join-aggregation), as shown in the mondial-abh.pdf referential dependency diagram, CourseNana.COM

then CourseNana.COM
set yourself the target of obtaining that information from DBpedia using SPARQL queries. CourseNana.COM

Note that we will not be strict about minor variations (e.g., the names given to countries, the populations of a city, and similar cases), minor inconsistencies (e.g., the province to which a city belongs, and similar cases), and minor examples of incompleteness (e.g, some missing cities or missing properties, and similar cases). CourseNana.COM

Note also that there is a version of Mondial in RDF, but you will not use it. You will try and create, from DBpedia and using SPARQL, the information content of the relational Mondial restricted to what you can obtain by querying country, City, etc, and ignoring geographical information (i.e., information about geographical features and concepts such as lakes, rivers and islands). CourseNana.COM

However, creating the set of queries is not your main task. Moreover, the task for which most marks are available is not to get the set of SPARQL queries absolutely right. CourseNana.COM

Your main task is to write a report about your experience of writing the set of SPARQL queries. In other words, the main goal, really, is for you to record, reflect and report on your experience of writing the SPARQL CourseNana.COM

To what extent did you succeed in writing a set of SPARQL queries that gathers from DBpedia the same information that is available through relational Mondial? CourseNana.COM

Note that we are not expecting you to be 100% effective: you may not be able to retrieve all the information, your set of SPARQL queries may be incomplete. CourseNana.COM

Also note that, here, you don’t need to worry about single tuples, single values, single pieces of information. We’re only interested in questions such as: Can we find the provinces of a country using SPARQL against DBpedia? Note that we don’t necessarily mean all the provinces of all the countries. CourseNana.COM

By CourseNana.COM

we mean an answer to the following questions: CourseNana.COM

• How much time did it take you to explore of the information content of the DBpedia using SPARQL queries that try and gauge what is available in that resource? CourseNana.COM

• How much time did it take you to discover how to write the set of SPARQL queries you have succeeded in writing? CourseNana.COM

Again, note that we are not asking you to be extremely efficient: it may take you quite long to even begin to write the first SPARQL query (most probably, one that tells you which are the countries in DBpedia). CourseNana.COM

You don’t lose marks if it takes you very long. You gain marks for explaining what the problem was that prevented you, be it a problem with DBpedia, be it a problem of complexity (e.g., you ran out of time before the submission deadline), be it a problem of inadequacy of SPARQL, be it the lack of a schema, or yet something else. CourseNana.COM

Also note that, here, you don’t need to worry about very detailed timekeeping. Just keep a timesheet of the amount of hours and minutes (down to 5 minutes accuracy) that you spend in this lab. CourseNana.COM

For example, you may have several rows saying something (slightly more specific perhaps than) CourseNana.COM

Task 1: Write a set of SPARQL queries for the task described above, compiling a timesheet as you do, and writing comments on the efficiency and effectiveness of your work on the task.
CourseNana.COM

The queries will be typed, tested and executed in the landing page for the DBpedia SPARQL endpoint: CourseNana.COM

http://dbpedia.org/sparql CourseNana.COM

You will find it useful to also consult the DBpedia Ontology directly: CourseNana.COM

http://mappings.dbpedia.org/server/ontology/classes/ CourseNana.COM

We expect that you will use SPARQL queries for two related, but distinct, purposes: one purpose is to explore how the information content of Mondial is (or is not) made available in DBpedia (call these exploratory queries), the other is to actually retrieve data from DBpedia that corresponds to data you might have retrieved from Mondial using SQL or XQuery (call these retrieval queries). Note that, in the absence of a schema, one needs to use the former kind of SPARQL query to obtain the information needed to write the latter kind of SPARQL query. CourseNana.COM

Task 2: Write a report that includes the set of queries produced in Task 1, and, using the timesheet and comments compiled during Task 1, assesses the impact that the schema-less nature of SPARQL/RDF had on the effectiveness and efficiency of your work on Task 1. CourseNana.COM

Some Comments and Tips Using exploratory queries CourseNana.COM

By exploratory query is meant one that essentially is trying to explore (a sample of) what subjects, predicates and objects there are. One query that might get you started is the following: CourseNana.COM

SELECT DISTINCT ?C
WHERE {?C a dbo:Country} CourseNana.COM

Try it. You would expect this to be a list of countries, right? But while indeed resources representing countries are returned, e.g. CourseNana.COM

http://dbpedia.org/page/Brazil CourseNana.COM

we also get information that is not about a country per se, e.g. http://dbpedia.org/page/Captaincies_of_Brazil CourseNana.COM

(These are just illustrations! You don’t need to stop and understand the resources the links above point you to. It is just so that you begin to get in thinking mode!) CourseNana.COM

We expect that a significant part of your time (but prove us wrong!) will be spent exploring the content of DBpedia this way. You are asked to capture and report on the process of exploring DBpedia. CourseNana.COM

Being aware of the namespace prefixes you can use CourseNana.COM

In the landing page, the Tables drop down menu links to the predefined Namespace Prefixes; you should explore this. CourseNana.COM

Using 'Results Format' to help exploration CourseNana.COM

Remember that if you use HTML as 'Results Format’, you are given links that you can click on and explore more. For example, the query above returns a table with HTML links. If you click on CourseNana.COM

http://dbpedia.org/ontology/Place CourseNana.COM

you will be taken to a description of the concept Place in the DBpedia Ontology. This may tell you more about the data and how to use it to achieve your goal. There are other options than simple HTML. You should explore your options. CourseNana.COM

Using LIMIT
Be careful to use LIMIT otherwise you’ll get a flood of results. On the other hand, you may need to increase CourseNana.COM

LIMIT or remove it altogether to find what you’re looking for. CourseNana.COM

COMP62421 Querying Data on the Web Coursework Manual

Get in Touch with Our Experts