{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# HIDDEN\n", "# The standard set of libraries we need\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Make plots look a little bit more fancy\n", "plt.style.use('fivethirtyeight')\n", "\n", "# The standard library for data in tables\n", "import pandas as pd\n", "\n", "# A tiny function to read a file directly from a URL\n", "from urllib.request import urlopen\n", "\n", "def read_url(url):\n", " return urlopen(url).read().decode()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This page is largely derived from `Another_Kind_Of_Character` of the UC\n", "Berkeley course \\- see the license file on the main website." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# HIDDEN\n", "# Read the text of Pride and Prejudice, split into chapters.\n", "book_url = 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt'\n", "book_text = read_url(book_url)\n", "# Break the text into Chapters\n", "book_chapters = book_text.split('CHAPTER ')\n", "# Drop the first \"Chapter\" - it's the Project Gutenberg header\n", "book_chapters = book_chapters[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In some situations, the relationships between quantities allow us to make\n", "predictions. This text will explore how to make accurate predictions based on\n", "incomplete information and develop methods for combining multiple sources of\n", "uncertain information to make decisions.\n", "\n", "As an example of visualizing information derived from multiple sources, let us\n", "first use the computer to get some information that would be tedious to\n", "acquire by hand. In the context of novels, the word \"character\" has a second\n", "meaning: a printed symbol such as a letter or number or punctuation symbol.\n", "Here, we ask the computer to count the number of characters and the number of\n", "periods in each chapter of *Pride and Prejudice*." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# In each chapter, count the number of all characters;\n", "# Also count the number of periods.\n", "chars_periods = pd.DataFrame.from_dict({\n", " 'Number of chars in chapter': [len(s) for s in book_chapters],\n", " 'Number of periods': np.char.count(book_chapters, '.')\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the data. Each row of the table corresponds to one chapter of the\n", "novel and displays the number of characters as well as the number of periods\n", "in the chapter. Not surprisingly, chapters with fewer characters also tend to\n", "have fewer periods, in general – the shorter the chapter, the fewer sentences\n", "there tend to be, and vice versa. The relation is not entirely predictable,\n", "however, as sentences are of varying lengths and can involve other punctuation\n", "such as question marks." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Number of chars in chapterNumber of periods
0461359
1442063
29746106
3604554
4539061
513287114
611492111
711274109
89971119
912798126
10916099
11402029
12965493
13657658
14995682
1519475185
16743669
1729704270
181084588
199374108
201152186
211018175
22980785
2311066113
24877490
2513227126
26737461
27839070
2814010127
29718746
.........
31865278
321060299
331225192
3418487130
351203383
36795870
37608047
38896274
399400104
4013260110
411117287
4228068237
431362290
441029668
4517642153
4622976180
4712995110
4812730124
4912922111
5011463106
5117676168
5216463185
53910481
5413287125
5515712163
56967272
5713976131
5813952149
59902087
6026470269
\n", "

61 rows × 2 columns

\n", "
" ], "text/plain": [ " Number of chars in chapter Number of periods\n", "0 4613 59\n", "1 4420 63\n", "2 9746 106\n", "3 6045 54\n", "4 5390 61\n", "5 13287 114\n", "6 11492 111\n", "7 11274 109\n", "8 9971 119\n", "9 12798 126\n", "10 9160 99\n", "11 4020 29\n", "12 9654 93\n", "13 6576 58\n", "14 9956 82\n", "15 19475 185\n", "16 7436 69\n", "17 29704 270\n", "18 10845 88\n", "19 9374 108\n", "20 11521 86\n", "21 10181 75\n", "22 9807 85\n", "23 11066 113\n", "24 8774 90\n", "25 13227 126\n", "26 7374 61\n", "27 8390 70\n", "28 14010 127\n", "29 7187 46\n", ".. ... ...\n", "31 8652 78\n", "32 10602 99\n", "33 12251 92\n", "34 18487 130\n", "35 12033 83\n", "36 7958 70\n", "37 6080 47\n", "38 8962 74\n", "39 9400 104\n", "40 13260 110\n", "41 11172 87\n", "42 28068 237\n", "43 13622 90\n", "44 10296 68\n", "45 17642 153\n", "46 22976 180\n", "47 12995 110\n", "48 12730 124\n", "49 12922 111\n", "50 11463 106\n", "51 17676 168\n", "52 16463 185\n", "53 9104 81\n", "54 13287 125\n", "55 15712 163\n", "56 9672 72\n", "57 13976 131\n", "58 13952 149\n", "59 9020 87\n", "60 26470 269\n", "\n", "[61 rows x 2 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chars_periods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the plot below, there is a dot for each chapter in the book. The horizontal\n", "axis represents the number of periods and the vertical axis represents the\n", "number of characters." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(6, 6))\n", "plt.scatter(chars_periods['Number of periods'],\n", " chars_periods['Number of chars in chapter'],\n", " color='darkblue')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the blue points are roughly clustered around a straight line.\n", "\n", "Now look at all the chapters that contain about 100 periods. The plot shows\n", "that those chapters contain about 10,000 characters to about 15,000\n", "characters, roughly. That's about 90 to 150 characters per period.\n", "\n", "Indeed, it appears from looking at the plot that the chapters tend to have\n", "somewhere between 100 and 150 characters between periods, as a very rough\n", "estimate. Perhaps Jane Austen was announcing something familiar to us now: the\n", "original 140-character limit of Twitter.\n", "\n", "{% data8page Another_Kind_Of_Character %}" ] } ], "metadata": { "jupytext": { "formats": "Rmd:rmarkdown,ipynb", "text_representation": { "extension": ".Rmd", "format_name": "rmarkdown", "format_version": "1.0" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }