CS109A Introduction to Data Science - Homework 0 - Part 2: Python Poetry Challenge

Engage in a Conversation

Part 2: Python Poetry Challenge

CS109A assumes students are already familiar with the basics of Python programming. CourseNana.COM

To assess your level of Python preparedness, you'll be creating a random poem generator from a collection of Emily Dickinson poems made available through Project Gutemberg. CourseNana.COM

Core Python skills utilized to completing this challenge include: CourseNana.COM

File I/O
Data structures
- strings
- lists & slicing
- dictionaries
Logical operators
Iteration
Functions
Classes
Debugging
Reading documentation
Interpretting code

Q2.1 - Read in the file

The document we'll be working with is a digital transcription assembling three early Emily Dickinson poetry collections for which the U.S. copyright has expired. CourseNana.COM

We've included this document with HW0 as a plain text file. The filename ispg12242.txt and it's located in the data subdirectory. CourseNana.COM

Read the contents of pg12242.txt into a variabled named text. The arcane file name is a stipulation of The Project Gutenberg License included in the file. CourseNana.COM

Take the contents of the file and save the substring containing the first 10,000 characters as the variable head and the substring containing the final 25,000 characters as tail. CourseNana.COM

In [ ]:

...
    text = ...

head = ...
tail = ...

In [ ]:

# We can inspect the beginning of the content
print(head)

In [ ]:

# final 25,000 characters 
print(tail)

In [ ]:

# entire contents
print(text)

Q2.2 - Remove non-poem content

Keeping in mind that this is a compilation of three publications (or 'series') and inspecting the head and tail sections of the text we printed above, we can identify some useful structure in the document: CourseNana.COM

Project Gutenberg: Header - Metadata
Series 1: Preface
Series 1: Poems
Series 2: Preface
Series 2: Poems
Series 3: Preface
Series 3: Poems
Project Gutenberg: Footer - License

Recognizing and making use of structure is a large part of data science. CourseNana.COM

Create three variables, series1, series2, and series3, each consiting of one of the 3 sections of text dedicated to the poem content, exludingprefaces and the Project Gutenberg metadata and license. Be sure to strip any trailing or leading whitespace from the final strings. CourseNana.COM

Hints: CourseNana.COM

Each preface section begins with "POEMS\n"
Each poem section after a preface starts with a numbered subsection beginning with "I."
The Project Gutenberg License section begins with "End of Project Gutenberg"
You may import standard library modules such as re to help here, but this task can also be done using only Python built-ins with no additional imports.
Review indexing and slicing if needed
The string .index() method returns the index of the first occurance of a target string in a source string.

In [ ]:

series1 = ...
series2 = ...
series3 = ...

We will then join the three series into a single string using join. For the delimiter we use six newline characters. The reason for this choice will soon become clear. CourseNana.COM

In [ ]:

# join series into single string connected by "\n\n\n\n\n\n"
poems_text = ('\n'*6).join([series1.strip(),
                            series2.strip(),
                            series3.strip()])

And we can check to see that we only have the poem content. CourseNana.COM

In [ ]:

print(poems_text[:445])
print('\n---\n')
print(poems_text[-275:])

Q2.3 - Remove annotation

The first poem contains some editor annotation in square brackets. CourseNana.COM

In [ ]:

print(poems_text[:200], '...')

This is the only annotation like this in the text. You can see that these are the only two square brackets that appear in the string. CourseNana.COM

In [ ]:

# Confirm that brackets occur only once
assert (poems_text.count("[") == 1) and (poems_text.count("]") == 1)

Create poems_text_clean which has this annotation (and the additional blank line below it) removed. CourseNana.COM

Hints: The string method split could be put to use here CourseNana.COM

In [ ]:

poems_text_clean = ...

In [ ]:

print(poems_text_clean[:100], '...')

Q2.4 - Identify sections and numbering

Now let's remove some of the other content that was not the work of the original author: the section titles and poem numbers. It is helpful to notice that all such lines begin with roman numerals. CourseNana.COM

Implement the startswith_rn function which we can use to identify lines that serve as section titles or poem numbering. CourseNana.COM

Note: The function does not need to be sophisticated enough to reject 'invalid' roman numerals such as 'IIIV.' We are assuming (correctly) that these sorts of strings do not occur and so the function only needs to be powerful enough to recognize those numerals that do occur at the start of headers in the text while excluding the titles and the content of any of the poems. CourseNana.COM

In [ ]:

def startswith_rn(s: str) -> bool:
    '''
    Returns: True if `s` starts with a roman numeral and False otherwise
    Ex: 
        startswith_rn("III. NATURE.") -> True
        startswith_rn("I'm nobody! Who are you?") -> False
    '''
    ...

In [ ]:

grader.check("q2.4")

Q2.5 - Remove sections and numbering

Use your new startswith_rn function to help you remove the lines starting with roman numerals from poems_text_clean. Strip any whitespace from the ends of the final result and store the string in the variable poems_text_nonum CourseNana.COM

In [ ]:

poems_text_nonum = ...

In [ ]:

print(poems_text_nonum[:500])

Q2.6 - Create a list of poems

Next, we'd like to split the text up into a list of poems. CourseNana.COM

To investigate how we might do this it is important to view all the charcters in the text. The pprint function from the standard library allows us to display the text with the newline characters '\n' while also making the output easier to read by including the line breaks as well. CourseNana.COM

In [ ]:

from pprint import pprint

In [ ]:

pprint(poems_text_nonum[:2000])

Notice how the poems are separated by multiple newline characters. Further exploration shows the minimum number of newlines between poems to be 6. CourseNana.COM

Use this information to create poem_list where the elements of the list are the poems in the order they occur in poems_text_nonum. For now, we will include the title if the editors provided one as part of the poem. Each poem should be stripped of leading and trailing whitespace. CourseNana.COM

Hint: Be sure to exclude any empty strings elements from your final poem_list. CourseNana.COM

In [ ]:

poem_list = ...

In [ ]:

print(poem_list[109])

In [ ]:

print(f"There are a total of {len(poem_list)} poems.")

Q2.7 - Create a poem dictionary

We now have all the poems separated, but we might want some means of accessing specific poems without having to remember their arbitrary position in the list. This is where the key-value pair structure of dictionaries will help. CourseNana.COM

We will create a dictionary, d, where the keys are the poem titles and the values are the poems themselves. CourseNana.COM

If the editors did not provide a title, we'll use the first line of the poem as the key. CourseNana.COM

Recall that some poems were given identical titles. And we know dictionary keys must be unique. Our approach will be to use incrementing numerical labels to denote poems with duplicate names beyond the first encountered. The titles should then look like: "ETERNITY.", "ETERNITY. (2)", "ETERNITY. (3)", etc. CourseNana.COM

We have provided one possible implementation that is nearly complete. Only one line needs to be filled in. CourseNana.COM

You are also welcome to create your own solution from scratch. CourseNana.COM

In [ ]:

# helper functions
# Answers: what number should I increment to having now seen a duplicate?
next_title_num = lambda x: x[-2] + 1 if x[-2].isnumeric() else 2 
# Answers: what was the number of the previous poem with this title?
prev_title = lambda d, k: sorted([k for k in d.keys() if k.startswith(k)])[-1]

def update(d: dict, k: str, v: str) -> None:
    '''
    Adds key-value pair 'k' & 'v' to dictionary 'd'
    Uses helper functions to increment key string if key already exists
    Dictionary is changed inplace; Returns None.
    '''
    if d.get(k):
        k = f'{k} ({next_title_num(prev_title(d,k))})'
    d[k] = v

In [ ]:

is_editor_title = lambda x: x.endswith('.') and x.isupper()
has_editor_title = lambda x: is_editor_title(x.split('\n')[0]) # check 1st line for editor title

d = {}
for p in poem_list:
    # first line will always be the key 
    k = p[:p.index('\n')] 
    
    if has_editor_title(p):
        # find string that should be the value (poem minus title)
        # YOUR CODE HERE
        v = ...
    # add new new pair to dictionary
    # update function handles altering the key if neccessary
    # (i.e., incrementing the numerical suffix of the title)
    update(d, k, v)

In [ ]:

print(d['HOPE.'])

Q2.8 - Most frequent long words

We can also ask statistical questions of the data such as, "what are the most common words?" CourseNana.COM

Find the 10 most frequently used words longer than 10 characters. Ignore case and any punctuation at the end of the word when counting. Consider only the poem text and not the editor provided titles. Store the top 10 words in a list called top_words in order of decreasing frequency. CourseNana.COM

You may find the punc variable below useful. It is the set of the punctuation characters occurring at the end of word in the text. You can assume that no words end with multiple punctuation characters. CourseNana.COM

In [ ]:

# punctuation characters to remove when counting words
punc = {w[-1] for p in d.values() for w in p.lower().split() if not w[-1].isalpha()}
punc

In [ ]:

    
top_words = ...

In [ ]:

top_words

Q2.9 - Random Poem Generator

We're now ready to create our random poem generator. CourseNana.COM

Complete the PoetryCollection class below by defining a __init__ method. It should take the author's full name and your poem dictionary as arguments and define the class attributes author, collection, and size as described in the docstring. CourseNana.COM

Then instantiate a PoetryCollection using our Emily Dickinson data and call it poems. CourseNana.COM

In [ ]:

class PoetryCollection():
    """
    Attributes
    ----------
    author : str
        full name of author
    collection : dict
        dictionary of (title, poem) key-value pairs
    size : int
        number of poems in collection
    
    Methods
    -------
    random_poem(seed: int = None) -> str
        returns random poem; use seed for reproducibility (default seed=None)
    """
    ...
        self.author = ...
        self.collection = ...
        self.size = ...
    
    def random_poem(self, seed : int = None) -> str:
        rng = np.random.default_rng(seed) 
        return str(rng.choice(list(self.collection.values())))

In [ ]:

poems = ...

In [ ]:

print('Author:', poems.author)

In [ ]:

print('Number of poems in collection:', poems.size)

In [ ]:

print(poems.random_poem(seed=109))

CS109A Introduction to Data Science - Homework 0 - Part 2: Python Poetry Challenge

Part 2: Python Poetry Challenge

Get in Touch with Our Experts