{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# HIDDEN\n", "# The standard set of libraries we need\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Make plots look a little bit more fancy\n", "plt.style.use('fivethirtyeight')\n", "\n", "# The standard library for data in tables\n", "import pandas as pd\n", "\n", "# A tiny function to read a file directly from a URL\n", "from urllib.request import urlopen\n", "\n", "def read_url(url):\n", " return urlopen(url).read().decode()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This page is largely derived from `Another_Kind_Of_Character` of the UC\n", "Berkeley course \\- see the license file on the main website." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# HIDDEN\n", "# Read the text of Pride and Prejudice, split into chapters.\n", "book_url = 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt'\n", "book_text = read_url(book_url)\n", "# Break the text into Chapters\n", "book_chapters = book_text.split('CHAPTER ')\n", "# Drop the first \"Chapter\" - it's the Project Gutenberg header\n", "book_chapters = book_chapters[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In some situations, the relationships between quantities allow us to make\n", "predictions. This text will explore how to make accurate predictions based on\n", "incomplete information and develop methods for combining multiple sources of\n", "uncertain information to make decisions.\n", "\n", "As an example of visualizing information derived from multiple sources, let us\n", "first use the computer to get some information that would be tedious to\n", "acquire by hand. In the context of novels, the word \"character\" has a second\n", "meaning: a printed symbol such as a letter or number or punctuation symbol.\n", "Here, we ask the computer to count the number of characters and the number of\n", "periods in each chapter of *Pride and Prejudice*." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# In each chapter, count the number of all characters;\n", "# Also count the number of periods.\n", "chars_periods = pd.DataFrame.from_dict({\n", " 'Number of chars in chapter': [len(s) for s in book_chapters],\n", " 'Number of periods': np.char.count(book_chapters, '.')\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the data. Each row of the table corresponds to one chapter of the\n", "novel and displays the number of characters as well as the number of periods\n", "in the chapter. Not surprisingly, chapters with fewer characters also tend to\n", "have fewer periods, in general – the shorter the chapter, the fewer sentences\n", "there tend to be, and vice versa. The relation is not entirely predictable,\n", "however, as sentences are of varying lengths and can involve other punctuation\n", "such as question marks." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Number of chars in chapter | \n", "Number of periods | \n", "
---|---|---|
0 | \n", "4613 | \n", "59 | \n", "
1 | \n", "4420 | \n", "63 | \n", "
2 | \n", "9746 | \n", "106 | \n", "
3 | \n", "6045 | \n", "54 | \n", "
4 | \n", "5390 | \n", "61 | \n", "
5 | \n", "13287 | \n", "114 | \n", "
6 | \n", "11492 | \n", "111 | \n", "
7 | \n", "11274 | \n", "109 | \n", "
8 | \n", "9971 | \n", "119 | \n", "
9 | \n", "12798 | \n", "126 | \n", "
10 | \n", "9160 | \n", "99 | \n", "
11 | \n", "4020 | \n", "29 | \n", "
12 | \n", "9654 | \n", "93 | \n", "
13 | \n", "6576 | \n", "58 | \n", "
14 | \n", "9956 | \n", "82 | \n", "
15 | \n", "19475 | \n", "185 | \n", "
16 | \n", "7436 | \n", "69 | \n", "
17 | \n", "29704 | \n", "270 | \n", "
18 | \n", "10845 | \n", "88 | \n", "
19 | \n", "9374 | \n", "108 | \n", "
20 | \n", "11521 | \n", "86 | \n", "
21 | \n", "10181 | \n", "75 | \n", "
22 | \n", "9807 | \n", "85 | \n", "
23 | \n", "11066 | \n", "113 | \n", "
24 | \n", "8774 | \n", "90 | \n", "
25 | \n", "13227 | \n", "126 | \n", "
26 | \n", "7374 | \n", "61 | \n", "
27 | \n", "8390 | \n", "70 | \n", "
28 | \n", "14010 | \n", "127 | \n", "
29 | \n", "7187 | \n", "46 | \n", "
... | \n", "... | \n", "... | \n", "
31 | \n", "8652 | \n", "78 | \n", "
32 | \n", "10602 | \n", "99 | \n", "
33 | \n", "12251 | \n", "92 | \n", "
34 | \n", "18487 | \n", "130 | \n", "
35 | \n", "12033 | \n", "83 | \n", "
36 | \n", "7958 | \n", "70 | \n", "
37 | \n", "6080 | \n", "47 | \n", "
38 | \n", "8962 | \n", "74 | \n", "
39 | \n", "9400 | \n", "104 | \n", "
40 | \n", "13260 | \n", "110 | \n", "
41 | \n", "11172 | \n", "87 | \n", "
42 | \n", "28068 | \n", "237 | \n", "
43 | \n", "13622 | \n", "90 | \n", "
44 | \n", "10296 | \n", "68 | \n", "
45 | \n", "17642 | \n", "153 | \n", "
46 | \n", "22976 | \n", "180 | \n", "
47 | \n", "12995 | \n", "110 | \n", "
48 | \n", "12730 | \n", "124 | \n", "
49 | \n", "12922 | \n", "111 | \n", "
50 | \n", "11463 | \n", "106 | \n", "
51 | \n", "17676 | \n", "168 | \n", "
52 | \n", "16463 | \n", "185 | \n", "
53 | \n", "9104 | \n", "81 | \n", "
54 | \n", "13287 | \n", "125 | \n", "
55 | \n", "15712 | \n", "163 | \n", "
56 | \n", "9672 | \n", "72 | \n", "
57 | \n", "13976 | \n", "131 | \n", "
58 | \n", "13952 | \n", "149 | \n", "
59 | \n", "9020 | \n", "87 | \n", "
60 | \n", "26470 | \n", "269 | \n", "
61 rows × 2 columns
\n", "