Problem Set 2
Part I due by 11:59 p.m. on Tuesday, October 8, 2024.
Part II due by 11:59 p.m. on Tuesday, October 15, 2024.
Part I
40 points total
Creating the necessary folder
Create a subfolder called ps2
within your cs460
folder, and put all of the files for this assignment in that folder.
Creating the necessary file
This part of the assignment will all be completed in a single PDF file. To create it, you should do the following:
Access the template that we have created by clicking on this link and signing into your Google account as needed.
When asked, click on the Make a copy button, which will save a copy of the template file to your Google Drive.
Select File->Rename, and change the name of the file to
ps2_partI
.Add your work for the problems from Part I to this file.
Once you have completed all of these problems, choose File->Download->PDF document, and save the PDF file in your
ps2
folder. The resulting PDF file (ps2_partI.pdf
) is the one that you will submit. See the submission guidelines at the end of Part I.
Problem 1: Fixed-length and variable-length records
19 points total
Recall the Movie
table from our movie database in Problem Set 1. Assume that we are using a simplified version of that table with the following schema:
Movie(id CHAR(5), name VARCHAR(20), year INTEGER, runtime INTEGER, rating VARCHAR(5))
Consider the following tuple from that table:
('14310', 'Deadpool', 2016, 108, 'R')
(3 points) What would this tuple look like if we stored it in a fixed-length record? In the 1.1 and 1.2 section of
ps2_partI
(see above), put your answer in the table labeled record contents.You should observe the following conventions:
Give each data value its own cell of the table. Adjust the widths of the cells as needed to better fit the sizes of the values, and delete any cells that are not needed.
Use a number sign (
'#'
) as a delimiter when it is necessary to record the end of a variable-length field’s value.Use hyphens (
'-'
) for any “wasted” bytes (i.e, bytes that are part of the record’s representation but are not actually storing useful data or metadata).
To illustrate these conventions, imagine that we were working with the
Enrolled
table in our university database, which has the following schema:Enrolled(student_id CHAR(9), course_name VARCHAR(10), credit_status VARCHAR(10));
If we wanted to show what the tuple
('U00000006', 'CS 460', 'ugrad')
would look like using a fixed-length record, we would fill in the table as follows:
(2 points) What is the length in bytes of the record from part 1? Assume that we are using:
four-byte integer field values
one-byte characters – including any digit characters that are part of a
CHAR
orVARCHAR
.
Put your final answer in the box labeled length in bytes, and show your work in the box below the answer.
(3 points) What would this tuple look like if we stored it in a variable-length record in which each field is preceded by its length?
In the 1.3 and 1.4 section of
ps2_partI
, put your answer in the table labeled record contents.In addition to the conventions that we specified for part 1, you should also give each metadata value its own cell of the table. Change the background color of cells containing metadata to distinguish them from cells containing actual data values. You can do so by using the icon that looks like a paint can in the menu bar at the top of Google Docs.
In addition to the assumptions about the sizes of characters and integers that we gave you in part 2, you should assume that integers used for metadata are two bytes long (not four bytes).
(2 points) What is the length in bytes of the record from part 3? Make the same assumptions stated in parts 2 and 3. Put your final answer in the box labeled length in bytes, and show your work in the box below the answer.
(4 points) What would this tuple look like if we stored it in a variable-length record that begins with a header of offsets?
In the 1.5 and 1.6 section of
ps2_partI
, put your answer in the table labeled record contents. Use the same conventions that we specified for parts 1 and 3, and use the same assumptions about the sizes of characters, integer field values, and integer metadata that we gave you in parts 2 and 3.(2 points) What is the length in bytes of the record from part 5? Put your final answer in the box labeled length in bytes, and show your work in the box below the answer.
(3 points) Now consider the following
Movie
tuple:('12624', 'Wicked', 2024, NULL, 'PG')
This tuple is for the film adaptation of musical Wicked, which is coming to movie theatres this November. The
NULL
value for runtime reflects the fact that the movie’s runtime has not yet been officially announced.What would this tuple look like if we stored it in a variable-length record that begins with a header of offsets?
In the 1.7 section of
ps2_partI
, put your answer in the table labeled record contents. You should use:the approach to
NULL
values that we took in lecturethe same conventions that we specified for parts 1 and 3
the same assumptions about the sizes of characters, integer field values, and integer metadata that we gave you in parts 2 and 3.
There is no separate length-computation question for this record.
Problem 2: Index structures
21 points total; 7 points each part
Let’s say that you want to insert items with the following sequence of keys into a collection of records that uses some form of indexing:
10, 11, 12, 4, 9, 14, 1, 2, 7, 19, 18, 5, 3
Insert this key sequence into an initially empty B-tree of order 2. In section 2.1 of
ps2_partI
, show the tree after each insertion that causes a split of one or more nodes, and the final tree.We have given you a sample diagram that includes nodes of different sizes. Make copies of the diagram so that you can use separate diagrams for the results of each insertion that causes a split, and for the final tree. Note that you do not need to keep the shape of the tree that we have given you. Rather, you should edit it as needed: deleting or adding nodes and edges, replacing the Xs with keys, adding or removing keys, and making whatever other changes are needed.
Insert this same key sequence into an initially empty B+tree (note the +) of order 2. In section 2.2 of
ps2_partI
, show the tree after each insertion that causes a split of one or more nodes, and the final tree. Here again, you should make copies of the diagram that we have given you and edit them as needed.Insert this same key sequence into a hash table that uses linear hashing. In section 2.3 of
ps2_partI
, use the tables that we have provided to show the state of the table before and after each increase in the number of buckets, as well as the final state of the table.Important details:
The table should use the hash function h(x) = x, and it should start out with two empty buckets.
Within a given bucket, please list the keys in the order in which they were inserted.
Assume that a bucket is added whenever the number of items in the table exceeds three times the number of buckets.
An item that causes the table to grow should appear in both the before and after tables for that increase. See the PS 2 FAQ for more details.
Submitting your work for Part I
Login to Gradescope by clicking the link in the left-hand navigation bar. Once you are in logged in, click on the box for CS 460.
Submit your ps2_partI.pdf
file using these steps:
If you still need to create a PDF file, open your file on Google Drive, choose File->Download->PDF document, and save the PDF file in your
ps2
folder.Click on PS 2: Part I in the list of assignments on Gradescope. You should see a pop-up window labeled Submit Assignment. (If you don’t see it, click the Submit or Resubmit button at the bottom of the page.)
Choose the Submit PDF option, and then click the Select PDF button and find the
ps1_partI.pdf
that you created in step 1. Then click the Upload PDF button.You should see an outline of the problems along with thumbnails of the pages from your uploaded PDF. For each problem in the outline:
- Click the title of the problem.
- Click the page(s) on which your work for that problem can be found.
As you do so, click on the magnifying glass icon for each page and doublecheck that the pages that you see contain the work that you want us to grade.
Once you have assigned pages to all of the problems in the question outline, click the Submit button in the lower-right corner of the window.
You should see a box saying that your submission was successful. Click the
(x)
button to close that box.You can use the Resubmit button at the bottom of the page to resubmit your work as many times as needed before the final deadline.
Part II
60 points total
Problem 3: Converting the Oscar table to XML
25 points; pair-optional
This is the only problem of the assignment that you may complete with a partner. See the rules for working with a partner on pair-optional problems for details about how this type of collaboration must be structured.
In this problem, you will write a series of methods that can be used to create an XML version of the Oscar
table from Problem Set 1. Your methods will be part of a larger program that uses the JDBC framework to connect to the SQLite database that you used in PS 1 and to execute the SQL queries needed to extract the necessary data.
Obtaining the necessary files
You should begin by downloading the following zip file:
problem3.zip
Unzip/extract the contents of the file.
Depending on your system, after extracting the contents you will either have:
a folder named
problem3
that contains all of the files that you need for this probleman outer folder called
problem3
that contains an inner folder namedproblem3
that contains all of the Java files that you need for this problem.
Take the problem3
folder that actually contains the necessary files and drag it into your ps2
folder so that you can easily find and open it from within VS Code.
Getting started
Read through our overview of the JDBC framework.
Launch VS Code on your laptop.
In VS Code, select the File->Open Folder or File->Open menu option, and use the resulting dialog box to find and open the
problem3
folder that you created above – the one that contains the provided files. (Note: You must open the folder; it is not sufficient to simply open one of the Java files in the folder.)The name of the folder should appear in the Explorer pane on the left-hand side of the VS Code window, along with a list of all of its contents.
Click on the name
XMLforOscars.java
in the Explorer pane, which will open the file that you need to modify.Review all of the code that we’ve provided before you start writing any new code. See below for some additional information on what we’ve given you.
The provided code
The class that you will be completing is called XMLforOscars
. In this class we’ve given you:
the constructor for the class, which takes the name of a SQLite file that should contain a relational database with the schema outlined in Problem Set 1; the constructor establishes a connection to the SQLite database, and it stores the resulting
Connection
object in a field calleddb
.a helper method called
simpleElem()
, which takes as inputs the name and value of a simple XML element and returns a string of the form"<name>value</name>"
. This method should only be used to form simple elements – ones that do not have any attributes or child elements.a helper method called
resultsFor()
, which takes as input a string presenting a SQL query for the movie database and returns aResultSet
object that can be used to process the results of that query. You are welcome to use this method in the code that you write, although doing so is not required.a method called
personIdFor()
, which takes as input the name of a person and performs a query to find and return the person’s id; this method will be useful when testing the methods that you write.a method called
movieIdFor()
, which takes as input the name of a movie and performs a query to find and return the movie’s id; this method will be also useful when testing the methods that you write. (Note: Although movie names are not unique, we can safely ignore that fact for the purposes of this method.)a method called
createFile()
, which performs a query to obtain all distinct years in theOscar
table, and which processes them one at a timethe
main
method, which runs the full program.
Note that many of these methods – and all of the methods that you will implement – include a throws
clause in their method header. This clause is needed because the code included in these methods may throw the exception(s) mentioned in the throws
clause, and instead of catching them, we are simply declaring that they may be thrown.
The methods you will write
You will implement a number of non-static methods of the XMLforOscars
class.
Here are your tasks:
Implement the method called
movieElemFor()
whose header we have provided. It takes a string representing the id number of a movie, and it should return a string containing an XML element of typemovie
that has anid
attribute whose value is the id that was passed in and whose text value is the name of the movie.If there is no movie with the specified id (including id values of null), the method should simply return the empty string. If there is a movie with the specified id, the string returned by the method should end with a single newline character (
\n
).For example, if you run the following test code (adding it to the
main
method inTestDriver.java
):XMLforOscars xml = new XMLforOscars("movie.sqlite"); System.out.println(xml.movieElemFor(xml.movieIdFor("Black Panther"))); System.out.println(xml.movieElemFor("1234567")); // no movie with that id System.out.println(xml.movieElemFor(xml.movieIdFor("Barbie")));
you should see:
<movie id=”1825683”>Black Panther</movie> <movie id=”1517268”>Barbie</movie>
Notes:
You should see an extra blank line when you print the results of any call that produces a
movie
element (including the blank line after the results for Barbie shown above), because the string returned by the method should end with a newline, and theprintln
method adds its own newline.The second
println
statement prints only a blank line because there is no movie whose id is 1234567, and thus the callxml.movieElemFor("1234567")
returns an empty string. As a result, there are two blank lines after the results for Black Panther: one after its element, and one from the printing of the empty string.
Important guidelines:
You must begin by performing the appropriate SQL query. Use the
personIdFor()
andmovieIdFor()
methods as models for what you should do.When processing the results, make sure to follow the approach given in our JDBC overview.
Because the
movie
element that you are forming has an attribute, you should not use thesimpleElem()
method to create it. Rather, you should construct it yourself using string concatenation.The start tag of the returned
movie
element should be preceded by exactly six spaces.Don’t forget to include the movie’s id as an attribute within the start tag for
movie
, as shown above. In order to include the quotes around the id, you will need to use the escape sequence"\""
for each double-quote character.There should be a single newline character (
\n
) and no extra spaces after the end tag of the element.
Implement the method called
personElemFor()
whose header we have provided. It takes a string representing the id number of a person, and it should return a string containing the XML for a single complex element of typeperson
that includes nested child elements for:the name of the person (which you may assume is never null)
the dob of the person (if it is non-null)
In addition, the returned
person
element must have an attribute namedid
for the person’s id number.If there is no person with the specified id (including id values of
null
), the method should simply return the empty string. If there is a person with the specified id, the string returned by the method should end with a single newline character (\n
).For example, if you run the following test code (adding it to the
main
method inTestDriver.java
):XMLforOscars xml = new XMLforOscars("movie.sqlite"); System.out.println(xml.personElemFor(xml.personIdFor("Julianne Moore"))); System.out.println(xml.personElemFor("1234567")); System.out.println(xml.personElemFor(xml.personIdFor("Chris Buck")));
you should see:
<person id=”0000194”> <name>Julianne Moore</name> <dob>1960-12-03</dob> </person> <person id=”0118333”> <name>Chris Buck</name> </person>
Notes:
Chris Buck has a
dob
value ofnull
in our database, so hisperson
element does not include a nesteddob
element.Here again, you should see an extra blank line when you print the results of any call that produces a
person
element (including the blank line after the results for Chris Buck shown above), because the string returned by the method should end with a newline, and theprintln
method adds its own newline.The second
println
statement prints only a blank line because there is no person whose id is 1234567, and thus the callxml.personElemFor("1234567")
returns an empty string. As a result, there are two blank lines after the results for Julianne Moore: one after her element, and one from the printing of the empty string.
Important guidelines:
Here again, you must begin by performing the appropriate SQL query and processing the results following the approach given in our JDBC overview.
The outer start and end tags of the returned
person
element each be on their own line preceded by exactly six spaces.Don’t forget to include the person’s id as an attribute within the start tag for
person
, as shown above. In order to include the quotes around the id, you will need to use the escape sequence"\""
for each double-quote character.You must use the provided
simpleElem()
method to form thename
anddob
child elements.Each child element should be on its own line, and its start tag should be preceded by exactly eight spaces.
There should be no extra spaces at the end of any line of the returned string.
Implement the method called
awardElemFor()
whose header we have provided. It takes three strings representing the type, person ID and movie ID of a single Oscar award, and it should return a string containing the XML for a complex element of typeaward
that includes nested child elements for:the award’s type (use
simpleElem()
to get this)the person (if any) associated with the award (use
personElemFor()
to get this); see below for more detailsthe movie associated with the award (use
movieElemFor()
to get this).
The string returned by the method should end with a single newline character (
\n
).For example, if you run the following test code:
XMLforOscars xml = new XMLforOscars("movie.sqlite"); String movieId = xml.movieIdFor("Oppenheimer"); String personId = xml.personIdFor("Cillian Murphy"); System.out.println(xml.awardElemFor("BEST-PICTURE", null, movieId)); System.out.println(xml.awardElemFor("BEST-ACTOR", personId, movieId));
you should see:
<award> <type>BEST-PICTURE</type> <movie id=”1539877”>Oppenheimer</movie> </award> <award> <type>BEST-ACTOR</type> <person id=”0614165”> <name>Cillian Murphy</name> <dob>1976-05-25</dob> </person> <movie id=”1539877”>Oppenheimer</movie> </award>
Here again, you should see an extra blank line when you print the results of a call to
awardElemFor
(including the blank lines after each of theaward
elements shown above), because the string returned by the method should end with a newline, and theprintln
method adds its own newline.
Once you have completed and tested all of your methods, running the XMLforOscars
program should create a file named oscars.xml
that represents the entire Oscar
table in XML!
Problem 4: Querying an XML database
30 points total
This problem asks you to construct XPath and XQuery queries for an XML version of our entire movie database. The schema of this XML database is described here.
Installing the software
To allow you to check your work, we’ll make use of a freely available XML DBMS called BaseX. You should begin by following the instructions for installing and using it that are available here.
Performing queries in BaseX
As outlined in our instructions, you can perform queries by taking the following steps:
Start up the BaseX GUI by double-clicking on the JAR file that you downloaded.
Select the Database->New menu option, click the Browse button, and use the resulting dialog box to find the
imdb.xml
file that you downloaded above.Click Open to select the file, and click OK to create the database.
To execute a query, enter it in the Editor pane in BaseX, and click the green play button to execute it. (You can also use Ctrl+Enter or Ctrl+Return for this purpose.)
The results (if any) will be displayed in the Result pane.
If you have trouble getting BaseX to work on your machine, see the troubleshooting tips on our BaseX page.
Important guidelines
If you’re using a Mac, you should disable smart quotes, because they may lead to errors in BaseX and in our testing. There are instructions for doing so here.
ps2_queries.py
is a Python file, so you could use a Python IDE to edit it, but a regular text editor like TextEdit or Notepad++ would also be fine. However, if you use a text editor, you must ensure that you save it as a plain-text file.Construct the XQuery commands needed to solve the problems given below. Test each command in BaseX to make sure that it works.
Once you have finalized the XQuery command for a given problem, copy the command into your
ps2_queries.py
file, putting it between the triple quotes provided for that query’s variable. We have included a sample query to show you what the format of your answers should look like.Each of the problems must be solved by means of a single query. Unless the problem specifies otherwise, you may use either a standalone XPath expression or an XQuery FLWOR expression.
The only place that you may use a subquery (i.e., a nested FLWOR expression) is in the
results
clause of an outer FLWOR expression. You should NOT have a nested FLWOR expression in afor
clause or alet
clause.The order of the clauses in each query/subquery must follow the FLWOR acronym: a
for
clause (F), followed optionally by alet
clause (L), followed optionally by awhere
clause (W), followed optionally by anorder by
clause (O), followed by areturn
clause (R). You should not put the clauses in a different order – e.g.,for
, followed bylet
, followed by anotherfor
, etc. BaseX may allow you to do this, but it is never necessary to do so, and such a query will often fail to run to completion in the Autograder.Your queries should only use information provided in the problem itself. In addition, they should work for any XML database that follows the schema that we have specified.
When the results of a query include nested child elements, those child elements must be in the specified order with respect to each other. See the example results that are provided for each such problem.
You do not need to worry about indenting and line breaks in the results of your queries.
The query problems
Make sure to read and follow the guidelines given above.
In Problem Set 1, you wrote a SQL query to find information about two movies set in the Boston area: The Holdovers and Spotlight. Write a standalone XPath expression (not a FLWOR expression) to find the
id
attributes of these two movies. The results of the query should be two lines that each have the following form:id="idval"
where
idval
is the value of the movie’sid
attribute.In Problem Set 1, you wrote a SQL query to find the names of all movies in our database in which Robert Downey Jr. has acted. Write a FLWOR expression to find several pieces of information about those movies. The results of the query should be new elements of type
downey_movie
that each include three nested child elements:- the
name
element of a movie that Downey acted in - the
year
element of the movie - the
rating
element of the movie.
For example:
<downey_movie> <name>Oppenheimer</name> <year>2023</year> <runtime>180</runtime> </downey_movie>
Note: Remember that you do not need to worry about indenting or line breaks in the results of your queries.
Hints:
You should use the
contains
function to perform substring matching, as discussed in lecture and lab.You will need to use curly braces and commas as part of your
return
clause. In lecture and lab, we’ve seen examples of queries that use these delimiters to transform the elements selected by a query into new types of elements.
- the
In Problem Set 1, you wrote a SQL query to find, for every movie rating that occurs in the top-grossing movies, the number of top-grossing movies with that rating, and the earnings rank of the most profitable movie with that rating. Write a FLWOR expression to solve the same problem. The results of the query should be new elements of type
profitable_rating
that include:a nested child element for the rating; see below for more details about this element
a new child element of type
num_top_grossers
that has as its value the number of top-grossing movies in the database that have that rating; see below for more detailsa new child element of type
best_rank
that has as its value the earnings rank of the most profitable movie with that rating. Don’t forget that the smaller a movie’s earnings rank, the more profitable it is.
For example:
<profitable_rating> <rating>G</rating> <num_top_grossers>12</num_top_grossers> <best_rank>32</best_rank> </profitable_rating>
Order the results by rating.
Hints:
As was the case in PS 1, top-grossing movies are ones with an earnings rank that is less than or equal to 200. You can also take advantage of the fact that our XML database only includes a movie’s earnings rank if the movie is a top-grossing movie.
To ensure that you only consider a given rating once, you should use the
distinct-values
function. For example, to iterate over distinct Oscar types, you could do the following:for $t in distinct-values(//oscar/type)
distinct-values
gives you the text values of the corresponding elements, not the elements themselves. As a result, yourreturn
clause will need to construct newrating
elements by adding back in the begin and end tags that were removed bydistinct-values
.You will need to use two of the following built-in aggregate functions:
count()
,sum()
,avg()
,min()
andmax()
. In lecture and lab, we’ve seen examples of queries that illustrate how to use this type of function.
In Problem Set 1, you wrote a SQL query to find the number of Oscar winners in our database who were born in Massachusetts. Write a FLWOR expression that solves a related problem: finding information about all Oscar winners from Massachusetts. The results of the query should be new elements of type
mass_oscar_winner
that include the following child elements:- the
name
element of the person - the
pob
element of the person - for each Oscar won by the person, a new child element called
oscar
that has as its text value the type and year of the Oscar, separated by a hyphen (-
).
For example:
<mass_oscar_winner> <name>Bette Davis</name> <pob>Lowell, Massachusetts, USA</pob> <oscar>BEST-ACTRESS - 1936</oscar> <oscar>BEST-ACTRESS - 1939</oscar> </mass_oscar_winner>
Order the results by the name of the person. Within a given
mass_oscar_winner
element, order theoscar
child elements by year.Hints:
You may assume that all people born in Massachusetts have a
pob
value that includes the string"Massachusetts, USA"
.You will need to use the
string()
function to obtain the values of some elements, without their begin and end tags.Your
return
clause will need to include the string literal"-"
. Note that you should not need to include any spaces in this string literal, because XQuery will automatically add a space between string values that are separated by a comma in thereturn
clause.You will need to use a subquery (i.e., a nested FLWOR expression).
There are different approaches to obtaining the necessary people. However, you will need to ensure that a given person appears at most once in the results. Among other things, this means that your outer query’s
for
clause should have only one XPath expression. In other words, you should not attempt to use the outerfor
clause to joinperson
elements withoscar
elements.
- the
Write a FLWOR expression to find all people who have acted in at least 10 of the movies in our database. The results of the query should be new elements of type
prolific_actor
that include the following child elements:the
name
element of the acctora new child element called
first_appeared
that has as its value the year of the first (i.e., oldest) movie in our database that the person acted ina new child element called
last_appeared
that has as its value the year of the last (i.e., most recent) movie in our database that the person acted infor each movie that the person has acted in, a new child element called
movie
that has as its text value the name and year of the movie, separated by a hyphen (-
).
For example:
<prolific_actor> <name>Harrison Ford</name> <first-appeared>1977</first-appeared> <last-appeared>2015</last-appeared> <movie>Star Wars: Episode IV - A New Hope - 1977</movie> <movie>Star Wars: Episode V - The Empire Strikes Back - 1980</movie> <movie>Raiders of the Lost Ark - 1981</movie> <movie>Star Wars: Episode VI - Return of the Jedi - 1983</movie> <movie>Indiana Jones and the Temple of Doom - 1984</movie> <movie>Indiana Jones and the Last Crusade - 1989</movie> <movie>The Fugitive - 1993</movie> <movie>Air Force One - 1997</movie> <movie>What Lies Beneath - 2000</movie> <movie>Indiana Jones and the Kingdom of the Crystal Skull - 2008</movie> <movie>Star Wars: The Force Awakens - 2015</movie> </prolific_actor>
Order the results by the name of the actor. Within a given
prolific_actor
element, order themovie
child elements by year.