1.4 Plotting the classics

Download notebook Interact

This page is largely derived from Plotting_the_Classics of the UC Berkeley course - see the license file on the main website.

In this example, we will explore statistics for: Pride and Prejudice by Jane Austen. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the public domain, meaning that everyone has the right to copy or use the text in any way. Project Gutenberg is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.

This example is meant to illustrate some of the broad themes of this text. Don’t worry if the details of the program don’t yet make sense. Instead, focus on interpreting the images generated below. Later sections of the text will describe most of the features of the Python programming language used below.

First, we read the text of of the book into the memory of the computer.

# Get the text for Pride and Prejudice
book_url = 'http://www.gutenberg.org/ebooks/42671.txt.utf-8'
book_text = read_url(book_url)

On the last line, Python gets the text of the book (read_url(book_url)) and gives it a name (book_text). In Python, a name cannot contain any spaces, and so we will often use an underscore _ to stand in for a space. The = in gives a name (on the left) to the result of some computation described on the right.

A uniform resource locator or URL is an address on the Internet for some content; in this case, the text of a book. The # symbol starts a comment, which is ignored by the computer but helpful for people reading the code.

Now we have the text attached to the name book_text, we can ask Python to show us how the text starts:

# Show the first 500 characters of the text
print(book_text[:500])
The Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited
by R. W. (Robert William) Chapman


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org





Title: Pride and Prejudice


Author: Jane Austen

Editor: R. W. (Robert William) Chapman

Release Date: May 9, 2013 

You might want to check this is the same as the text you see by opening the URL in your browser: http://www.gutenberg.org/ebooks/42671.txt.utf-8

Now we have the text in memory, we can start to analyze it. First we break the text into chapters. Don’t worry about the details of the code, we will cover these in the rest of the course.

# Break the text into Chapters
book_chapters = book_text.split('CHAPTER ')
# Drop the first "Chapter" - it's the Project Gutenberg header
book_chapters = book_chapters[1:]

We can show the first half-line or so for each chapter, by putting the chapters into a table. You will see these tables or data frames many times during this course.

# Show the first few words of each chapter in a table.
pd.DataFrame(book_chapters, columns=['Chapters'])
Chapters
0 I.\r\n\r\n\r\nIt is a truth universally acknow...
1 II.\r\n\r\n\r\nMr. Bennet was among the earlie...
2 III.\r\n\r\n\r\nNot all that Mrs. Bennet, howe...
3 IV.\r\n\r\n\r\nWhen Jane and Elizabeth were al...
4 V.\r\n\r\n\r\nWithin a short walk of Longbourn...
5 VI.\r\n\r\n\r\nThe ladies of Longbourn soon wa...
6 VII.\r\n\r\n\r\nMr. Bennet's property consiste...
7 VIII.\r\n\r\n\r\nAt five o'clock the two ladie...
8 IX.\r\n\r\n\r\nElizabeth passed the chief of t...
9 X.\r\n\r\n\r\nThe day passed much as the day b...
10 XI.\r\n\r\n\r\nWhen the ladies removed after d...
11 XII.\r\n\r\n\r\nIn consequence of an agreement...
12 XIII.\r\n\r\n\r\n"I hope, my dear," said Mr. B...
13 XIV.\r\n\r\n\r\nDuring dinner, Mr. Bennet scar...
14 XV.\r\n\r\n\r\nMr. Collins was not a sensible ...
15 XVI.\r\n\r\n\r\nAs no objection was made to th...
16 XVII.\r\n\r\n\r\nElizabeth related to Jane the...
17 XVIII.\r\n\r\n\r\nTill Elizabeth entered the d...
18 XIX.\r\n\r\n\r\nThe next day opened a new scen...
19 XX.\r\n\r\n\r\nMr. Collins was not left long t...
20 XXI.\r\n\r\n\r\nThe discussion of Mr. Collins'...
21 XXII.\r\n\r\n\r\nThe Bennets were engaged to d...
22 XXIII.\r\n\r\n\r\nElizabeth was sitting with h...
23 I.\r\n\r\n\r\nMiss Bingley's letter arrived, a...
24 II.\r\n\r\n\r\nAfter a week spent in professio...
25 III.\r\n\r\n\r\nMrs. Gardiner's caution to Eli...
26 IV.\r\n\r\n\r\nWith no greater events than the...
27 V.\r\n\r\n\r\nEvery object in the next day's j...
28 VI.\r\n\r\n\r\nMr. Collins's triumph in conseq...
29 VII.\r\n\r\n\r\nSir William staid only a week ...
... ...
31 IX.\r\n\r\n\r\nElizabeth was sitting by hersel...
32 X.\r\n\r\n\r\nMore than once did Elizabeth in ...
33 XI.\r\n\r\n\r\nWhen they were gone, Elizabeth,...
34 XII.\r\n\r\n\r\nElizabeth awoke the next morni...
35 XIII.\r\n\r\n\r\nIf Elizabeth, when Mr. Darcy ...
36 XIV.\r\n\r\n\r\nThe two gentlemen left Rosings...
37 XV.\r\n\r\n\r\nOn Saturday morning Elizabeth a...
38 XVI.\r\n\r\n\r\nIt was the second week in May,...
39 XVII.\r\n\r\n\r\nElizabeth's impatience to acq...
40 XVIII.\r\n\r\n\r\nThe first week of their retu...
41 XIX.\r\n\r\n\r\nHad Elizabeth's opinion been a...
42 I.\r\n\r\n\r\nElizabeth, as they drove along, ...
43 II.\r\n\r\n\r\nElizabeth had settled it that M...
44 III.\r\n\r\n\r\nConvinced as Elizabeth now was...
45 IV.\r\n\r\n\r\nElizabeth had been a good deal ...
46 V.\r\n\r\n\r\n"I have been thinking it over ag...
47 VI.\r\n\r\n\r\nThe whole party were in hopes o...
48 VII.\r\n\r\n\r\nTwo days after Mr. Bennet's re...
49 VIII.\r\n\r\n\r\nMr. Bennet had very often wis...
50 IX.\r\n\r\n\r\nTheir sister's wedding day arri...
51 X.\r\n\r\n\r\nElizabeth had the satisfaction o...
52 XI.\r\n\r\n\r\nMr. Wickham was so perfectly sa...
53 XII.\r\n\r\n\r\nAs soon as they were gone, Eli...
54 XIII.\r\n\r\n\r\nA few days after this visit, ...
55 XIV.\r\n\r\n\r\nOne morning, about a week afte...
56 XV.\r\n\r\n\r\nThe discomposure of spirits, wh...
57 XVI.\r\n\r\n\r\nInstead of receiving any such ...
58 XVII.\r\n\r\n\r\n"My dear Lizzy, where can you...
59 XVIII.\r\n\r\n\r\nElizabeth's spirits soon ris...
60 XIX.\r\n\r\n\r\nHappy for all her maternal fee...

61 rows × 1 columns

This is your first view of a data frame. Ignore the first column for now - it is just a row number. The second column shows the first few characters of the text in the chapter. The text starts with the chapter number in Roman numerals. You might want to check the text from the link above to reassure yourself that this comes from the text we downloaded. Next you see some odd characters with backslashes, such as \r and \n. These are representations of new lines, or paragraph marks. Last you will see the beginning of the first sentence of the chapter.

This page has content from the Plotting_the_Classics notebook from the UC Berkeley course. See the Berkeley course section of the license