COMP62421
Querying Data on the Web
Coursework Manual
On Assigned Readings
These papers provide background information of relevance to the material in the unit. In addition, the contents of these papers may be examined in the quizzes.
Week 1
Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998:
34-43 http://doi.acm.org/10.1145/275487.275492
Week 2
Craig Brown, Xavier Franc, Michael Paddon, Native XML Databases: Death or Coming of Age? XMLPrague 2015 Conference Proceedings, pp. 107-119. http://archive.xmlprague.cz/2015/files/xmlprague-2015-proceedings.pdf [pp. 107-119 only]
Week 3
Olaf Hartig, Andreas Langegger: A Database Perspective on Consuming Linked Data on the
Web. Datenbank-Spektrum 10(2): 57-66 (2010). http://dx.doi.org/10.1007/s13222-010-0021-7 Week 4
Jeffrey D. Ullman: Designing Good MapReduce Algorithms. ACM Crossroads 19(1): 30-34 (2012) http://doi.acm.org/10.1145/2331042.2331053
On Coursework
There are two types of coursework in COMP62421: quizzes and lab work.
On Quizzes
-
Quizzes are done synchronously, and as a result are timetabled.
-
The topics for a quiz on a given teaching week consist of the material in the assigned articles for reading in the previous weeks, as well as all the taught content thus far in the course unit (and therefore not just in that specific teaching week).
-
There are 4 quizzes, which run in Weeks 2 to 5. On Duration, Question Types, and Consultation
-
Quizzes last 25 minutes.
-
They consist of five multiple choice style questions.
-
Incorrect answers do not subtract from the total.
On Lab Work
-
Lab work is associated with timetabled lab sessions.
-
The topics for lab work consist of application and exploration of the taught content in the course unit.
-
All the lab work can be carried out on departmental computers or on your own machines, though the descriptions are generally targeted at and have been tested on unix.
-
The lab machines have sqlite3 installed; if you want to do the labs that use sqlite3 (RW, QO) on a laptop, you may have to install SQLite (https://www.sqlite.org/download.html).
-
The software used in the unit uses licenses that will be suitable for public use, so you can download and install software to your personal machine. However, be aware that we may struggle to support students with machine-specific issues.
Page 2 of 14
COMP62421 Querying Data on the Web Coursework Manual
On Lab Work’s Contribution to the Course Unit Mark
-
There are three pieces of lab work, RQ (Week 1), QO (Weeks 2-3) and SP (Weeks 4-6).
-
The coursework mark is subdivided as follows:
-
Quizzes: 10%
-
Lab RQ: 0% (so this is a formative assessment – you will be given the solutions).
-
Lab QO: 40%
-
Lab SP: 50%
-
-
The coursework marks, as a whole, contribute 50/100 marks towards the final course unit mark.
-
The remaining 50/100 comes from the exam.
On Deadlines
The deadlines on the lab work are as follows:
-
QO: Week 4, Thursday at 18:00
-
SP: Week 6, Friday at 18:00
On What Your Submission Consists of and Where to Upload it
The submission is always a report in PDF format. This must be such that we can open it and read it to mark it. Anything that prevents us from doing so cannot be compensated. You should upload the report associated with each submission as a pdf document called:
<username>_COMP62421_LW_Report
On Preparing a Report File
If your assumptions are convincing (i.e., if they are both made in response to genuine uncertainty and incompleteness and are consistent with all the information explicitly given), we will take them into account in the marking.
Lab Work RQ
There is material provided on Blackboard to support the execution of SQL (using SQLite) and Relational Algebra (using RA). All queries are to be written over the Mondial database, which is provided in the download.
Task 1: Write tuple-relational calculus (TRC) expressions that, upon evaluation, return the data characterized by each of the following English-language specifications:
-
(1) Return the name of any country that has a lake.
-
(2) Return all the available attributes on cities whose population is between 3 and 5 million inhabitants.
-
(3) Return the country code and the continent of every country not in Europe or in Australia/Oceania.
-
(4) Return the names of countries that also give their name to one of its own provinces.
-
(5) Return the names of countries that are not landlocked (i.e., have a sea coast).
Task 3: Write SQL expressions that, upon evaluation, returns the data characterized by each of the following English-language specifications. Use duplicate removal where appropriate (e.g., when a duplicate is not required in the intended answer).
-
(12) Return the names of up to 10 countries and the value corresponding to half the country’s population.
-
(13) Return all the information available in the City table about cities whose name is Manchester.
-
(14) Return the name of cities whose name starts with the substring 'Man’.
-
(15) Return the name of the country, and the names of the organizations of which the country is a member, for countries with Buddhist populations and organizations established after 1st December 1994.
-
(16) Return the name of each country with the number of islands in it.
Some Comments
In all the queries above, do not get bogged down by the complexity of the Mondial encoding of geographic properties. Mondial is not a spatial database and lacks spatial types (and corresponding spatial operations).
For Task 1, you are expected to submit the TRC expressions only. You are not required to run the query against the data as there is no easily available TRC evaluator to use, but if you want to ensure your expression computes the intended result, consider mapping the expression into an iterative computation in your favourite language.
Page 4 of 14
COMP62421 Querying Data on the Web Coursework Manual
For Task 2, you should use the RA software1 to evaluate your expressions. We propose using the command line options to read the input from a file and to redirect the output to a file, as described in the download that is available on Blackboard. Note that RA is dependent on the SQLite DBMS.
For Task 3, you should use SQLite to evaluate your expressions.
Software/Data
A folder is provided in Blackboard that gives you access to relevant data and scripts.
On Scripting, Echoing and Logging
We note that familiarity with SQLite will be useful for Coursework QO. To give an example using SQLite, let's suppose that the query to be evaluated is:
Return all the information about any 4 countries
The answer to this is:
select * from country limit 4;
You would then compose a script, let's call it Atan_Luring_LW1.sql (note the suffix), that, in this simple case,
would be as follows:
-- Atan_Luring_LW1.sql --
-- set echoing on
.echo on
-- set spooling out: note the suffix .output Atan_Luring_LW1.log
--
select * from country limit 4;
You can then start SQLite (using the sqliterc initialization file we provided you with to normalize columns width, etc., and run the script with the SQLite read <filename> command, where <filename> in the example above would be Atan_Luring_LW1.sql.
sqlite3 -init sqliterc mondial.db < Atan_Luring_LW1.sql
On successful completion, the file called (in this example) Atan_Luring_LW1.log would have the following
content (modulo formatting):
select * from country limit 4;
Name Code
---------- ----------
Albania AL
Greece GR
Macedonia MK
Serbia SRB
Capital
----------
Tirana
Athina
Skopje
Beograd
Province
----------
Albania
Attikis
Macedonia
Serbia
Area Population ---------- ---------- 28750 2800138 131940 10816286 25333 2059794 77474 7120666
In the case of the RA software:
java -jar ra.jar mondial.properties -i Atan_Luring_LW1.ra -o Atan_Luring_LW1.log
runs the RA executable (ra.jar), taking as input the query in the file Atan_Luring_LW1.ra and writing the output to Atan_Luring_LW1.log. The homepage of RA is:
https://users.cs.duke.edu/~junyang/ra2/
1 https://users.cs.duke.edu/~junyang/ra2/
Page 5 of 14
COMP62421 Querying Data on the Web Coursework Manual
Mondial
You can find documentation about Mondial in its website2.
SQLite
For SQLite, the documentation on the website is comprehensive, but if you prefer learning from books, you
have free access (from an UoM IP address) to this one:
The Definitive Guide to SQLite
Grant Allen, Mike Owens
ISBN: 978-1-4302-3225-4
Apress, 2010
http://link.springer.com/content/pdf/10.1007%2F978-1-4302-3226-1.pdf
Lab Work QO
This lab involves an experimental investigation into the SQLite query optimizer. Before embarking on the
experiments it will be useful to refer to the SQLite documentation: https://www.sqlite.org/docs.html
and in particular, this section:
https://www.sqlite.org/optoverview.html
Consider, also, studying this section in detail:
https://www.sqlite.org/queryplanner-ng.html
Task 1: Write three pairs of semantically-equivalent SQL queries against the Mondial database such that:
-
in each query pair, either the two queries are syntactically distinct (but semantically equivalent, i.e., they
return the same result set) and one query in the pair is optimizable by SQLite whereas the other is not;
-
or else the two queries are syntactically identical but are run under different conditions (e.g., once
without access to an index, and again with access to an index)
so that, as a result, we would expect the response times in evaluating the queries in the same pair to be different. Essentially, we want to know by how much, and why.
You must:
-
produce the query pair,
-
explain which optimization/evaluation strategy you have in mind for each query in the pair,
-
record the execution times for each query in each pair,
-
discuss any execution time difference.
You must use SQLite to evaluate your expressions and obtain results for your submission.
Task 2: Summarize your investigation with a plot, accompanied by interpretation and comment, thatevaluates the benefits/shortcomings of using the SQLite optimizer and evaluator. Software/Data
A folder is provided in Blackboard that gives you access to relevant data and scripts. Some CommentsWe expect you to study the documentation, i.e., to find out about facts, notions, ideas, techniques, policies and heuristics yourself. We want to gauge how much you have learned about query optimization and execution. Some topics (like the use of indexes) haven’t been covered in detail in the course, but do some research.
By all means, ask questions in the lab session and in the discussion board, but we are asking you to think and reflect and ask yourself questions, propose some answers, and verify whether your proposed answers are correct or not.
Note, to start with, that the goal is not for you to try and create complex (in the sense of challenging, or even tricky) queries per se, but rather to study the SQLite optimizer and identify cases where it can make a difference in potentially speeding up a query and cases where it cannot. If you read them from this course unit’s viewpoint, the documentation on the SQLite optimizer (sometimes implicitly) describes conditions in which the optimizer can make an impact and conditions in which it cannot make an impact.
To get you started, study the conditions under which the optimizer may or may not be able to use an index. Start by considering whether there is a substantive difference in the amount of data that would be processed if the index were used or not. Also, in the case of indexes, note that you may need to create the index yourself and run the query, then drop the index and run the query again. This is the special case in Note, however, that if you were to produce three pairs of queries that are simply instances of the same hypotheses (e.g., '"The availability of indexes allows the optimizer to generate more efficient plans.") you would lose out on some marks. Contrast the case above with one in which you explore different optimization strategies that are based on indexing (e.g., one might be the case of complex predicates that prevent the optimizer from taking advantage of indexes, another the case where the availability of an index allows the optimizer to select a more efficient physical operator, etc.). In this example, the hypotheses are different, and the corresponding queries aim to trigger different behaviours in the optimizer, and hence merit more marks.
Your submission will be a report containing the pairs, the explanation of the optimizations you have in mind, and the plots with interpretation and comment.
Task 1: Query Analysis:
The report should be structured with a section for each pair of queries that includes:
-
The pair of queries. The queries should be in a form that can easily be copied and pasted into a script, in case we want to run it.
-
An explanation that describes the features of the SQLite optimizer and evaluator that are being investigated, and the impact of these on the times observed.
-
Evidence that has been collected that the pair of queries provide different execution times. Task 2: Reflection:
-
At least one plot that provides information on the approaches to evaluation pursued for Task 1.
-
A discussion of the plot, the information that was used in the experiment, what can and can’t be
observed from this information, and what other information might be useful.
Thus, the discussion in Task 1 relates to the specific queries, and the discussion in Task 2 seeks to draw more general lessons from your overall experience, in the light of the information in the plot.The marks are allocated 75% to Task 1 and 25% to Task 2. The total length of the report should be no longer than 2000 words.
On Plotting, Interpreting and Commenting on Experimental Results
You are doing a study in which you measure the response time of queries. So, what goes in the Y axis? And you're plotting the response time for various queries, which come in pairs. So, what goes in the X axis? In other words, think about this as an opportunity for you to exercise/acquire basic transferable skills in doing empirical explorations of system performance that might be useful in non-database settings.
To throw further light on what you’re expected to do when you’re asked (above) to explain the optimization you have in mind, the starting point for you is to think of these tasks as being grounded on the notion of hypothesis testing.
Your comments should avoid simply reading the values off the graphs. For example, don’t simply write In the first pair, the non-optimized query took 3.5ms.
the reader can do that without your help (unless you plot is poorly presented, which is another, serious, matter).
When reporting times, you should provide information about the setup of the experiment, including the database size, the software version, and information about the environment used. For example, here’s a pretty complete specification of a (hypothetical) underlying hardware/software environment:
These measurements were taken in a MacBook Air, 1.8 GHz Intel Core i7 with 4GB 1333 MHz DDR3 with 251 GB Flash Storage running OS X Yosemite V 10.10.5 and SQLite version 3.8.11.1 2015-07-29 20:00:57.
You may not have as much information, but provide as much as you can on the CPU, the primary memory, the secondary memory, the operating system and the DBMS you’re using.
Sometimes, a lot of time may be spent in printing results. For exercises in which the result is not of special interest, only the timings are, you should switch off the echoing in your script.
Furthermore, because of hot-cold effects (i.e., full v. empty buffers, on-the-fly indexing, etc.), when taking measurements of query response time, it is good practice to (close and re-)open the database anew before running each query, and then run each query four times at least, discarding the first and averaging the last three to obtain the value you plot.
If running on a public machine, to check whether there are other users logged into a machine, and thus potentially interfering with measured times, log into it and issue the who command on a shell: it lists the login names of every used logged into the machine.
When explaining how results have been obtained, it may be useful to use the SQLite explain query plan command4. The
.eqp on
command switches on explanation, and will provide information on how a query is to be evaluated. For
example:
sqlite> .eqp on
sqlite> select * from country limit 4;
QUERY PLAN
`--SCAN TABLE country
Lab Work SP
In relational databases, the content of every table must conform to an explicitly declared schema. In semi- structured databases, including RDF stores, however, this is not the case.
The overall, high-level goal of this lab work is for you to gain an insight into some of the consequences ensuing from the lack of schema constraints over RDF stores. In particular, if one is asked to write a set of SPARQL queries to achieve a certain goal, some questions arise, such as:
-
How does one go about understanding what data is there, and what can we ask of it?
-
How much time is spent finding out?
-
Does this make the task of writing queries harder?
-
Is it more likely that you will spend longer debugging the queries before they work?
In terms of information content, your goal is to create a set of SPARQL queries over DBpedia that, for a country in the relational version of the Mondial database, generates the same information (excluding the geographical aspects such as lakes, mountains, seas, rivers, etc., but including, if possible continent, country, city, province, organization, ethnic group, language and religion ) as is available in Mondial.
In other words, you will:
-
study again the relational version of Mondial,
-
focus on the non-geographical information that you can retrieve (basically using selection-
projection-join-aggregation), as shown in the mondial-abh.pdf referential dependency diagram,
then
-
set yourself the target of obtaining that information from DBpedia using SPARQL queries.
Note that we will not be strict about minor variations (e.g., the names given to countries, the populations of a city, and similar cases), minor inconsistencies (e.g., the province to which a city belongs, and similar cases), and minor examples of incompleteness (e.g, some missing cities or missing properties, and similar cases).
Note also that there is a version of Mondial in RDF, but you will not use it. You will try and create, from DBpedia and using SPARQL, the information content of the relational Mondial restricted to what you can obtain by querying country, City, etc, and ignoring geographical information (i.e., information about geographical features and concepts such as lakes, rivers and islands).
However, creating the set of queries is not your main task. Moreover, the task for which most marks are available is not to get the set of SPARQL queries absolutely right.
Your main task is to write a report about your experience of writing the set of SPARQL queries. In other words, the main goal, really, is for you to record, reflect and report on your experience of writing the SPARQL
To what extent did you succeed in writing a set of SPARQL queries that gathers from DBpedia the same information that is available through relational Mondial?
Note that we are not expecting you to be 100% effective: you may not be able to retrieve all the information, your set of SPARQL queries may be incomplete.
Also note that, here, you don’t need to worry about single tuples, single values, single pieces of information. We’re only interested in questions such as: Can we find the provinces of a country using SPARQL against DBpedia? Note that we don’t necessarily mean all the provinces of all the countries.
By
we mean an answer to the following questions:
• How much time did it take you to explore of the information content of the DBpedia using SPARQL queries that try and gauge what is available in that resource?
• How much time did it take you to discover how to write the set of SPARQL queries you have succeeded in writing?
Again, note that we are not asking you to be extremely efficient: it may take you quite long to even begin to write the first SPARQL query (most probably, one that tells you which are the countries in DBpedia).
You don’t lose marks if it takes you very long. You gain marks for explaining what the problem was that prevented you, be it a problem with DBpedia, be it a problem of complexity (e.g., you ran out of time before the submission deadline), be it a problem of inadequacy of SPARQL, be it the lack of a schema, or yet something else.
Also note that, here, you don’t need to worry about very detailed timekeeping. Just keep a timesheet of the amount of hours and minutes (down to 5 minutes accuracy) that you spend in this lab.
For example, you may have several rows saying something (slightly more specific perhaps than)
Task 1: Write a set of SPARQL queries for the task described above, compiling a timesheet as you do, and
writing comments on the efficiency and effectiveness of your work on the task.
The queries will be typed, tested and executed in the landing page for the DBpedia SPARQL endpoint:
http://dbpedia.org/sparql
You will find it useful to also consult the DBpedia Ontology directly:
http://mappings.dbpedia.org/server/ontology/classes/
We expect that you will use SPARQL queries for two related, but distinct, purposes: one purpose is to explore how the information content of Mondial is (or is not) made available in DBpedia (call these exploratory queries), the other is to actually retrieve data from DBpedia that corresponds to data you might have retrieved from Mondial using SQL or XQuery (call these retrieval queries). Note that, in the absence of a schema, one needs to use the former kind of SPARQL query to obtain the information needed to write the latter kind of SPARQL query.
Task 2: Write a report that includes the set of queries produced in Task 1, and, using the timesheet and comments compiled during Task 1, assesses the impact that the schema-less nature of SPARQL/RDF had on the effectiveness and efficiency of your work on Task 1.
Some Comments and Tips Using exploratory queries
By exploratory query is meant one that essentially is trying to explore (a sample of) what subjects, predicates and objects there are. One query that might get you started is the following:
SELECT DISTINCT ?C
WHERE {?C a dbo:Country}
Try it. You would expect this to be a list of countries, right? But while indeed resources representing countries are returned, e.g.
http://dbpedia.org/page/Brazil
we also get information that is not about a country per se, e.g. http://dbpedia.org/page/Captaincies_of_Brazil
(These are just illustrations! You don’t need to stop and understand the resources the links above point you to. It is just so that you begin to get in thinking mode!)
We expect that a significant part of your time (but prove us wrong!) will be spent exploring the content of DBpedia this way. You are asked to capture and report on the process of exploring DBpedia.
Being aware of the namespace prefixes you can use
In the landing page, the Tables drop down menu links to the predefined Namespace Prefixes; you should explore this.
Using 'Results Format' to help exploration
Remember that if you use HTML as 'Results Format’, you are given links that you can click on and explore more. For example, the query above returns a table with HTML links. If you click on
http://dbpedia.org/ontology/Place
you will be taken to a description of the concept Place in the DBpedia Ontology. This may tell you more about the data and how to use it to achieve your goal. There are other options than simple HTML. You should explore your options.
Using LIMIT
Be careful to use LIMIT otherwise you’ll get a flood of results. On the other hand, you may need to increase
LIMIT or remove it altogether to find what you’re looking for.