Data frames, Series and arrays¶
We have now come across three important structures that Python uses to store and access data:
arrays
data frames
series
Here we stop to go back over the differences between these structures, and how to convert between them.
Data frames¶
We start by loading a data frame from a Comma Separated Value file (CSV file).
The data file we will load is a table with average https://ratemyprofessors.com scores across all professors teaching a particular academic discipline.
See the array indexing page for more detail.
Each row in this table corresponds to one discipline. Each column corresponds to a different rating.
If you are running on your laptop, you should download
the rate_my_course.csv
file to the same directory as this notebook.
# Load the Numpy library, rename to "np"
import numpy as np
# Load the Pandas data science library, rename to "pd"
import pandas as pd
# Turn on a setting to use Pandas more safely.
# We will discuss this setting later.
pd.set_option('mode.chained_assignment', 'raise')
# Read the file as a data frame.
courses = pd.read_csv('rate_my_course.csv')
# Show the first five rows.
courses.head()
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
The pd.read_csv
function returned this table in a structure called a data frame.
type(courses)
pandas.core.frame.DataFrame
The data frame is a two-dimensional structure. It has rows, and columns. We can see the number of rows and columns with:
courses.shape
(75, 6)
This means there are 75 rows. In this case, each row corresponds to one discipline.
There are 6 columns. In this case, each column corresponds to a different student rating.
Passing the data frame to the Python len
function shows us the number of rows:
len(courses)
75
Indexing into data frames¶
There are two simple ways of indexing into data frames.
We index into a data frame to get a subset of of the data.
To index into anything, we can give the name of thing - in this case courses
- followed by an opening square bracket [
, followed by something to specify which subset of the data we want, followed by a closing square bracket ]
.
The two simple ways of indexing into a data frame are:
Indexing with a string to get a column.
Indexing with a Boolean sequence to get a subset of the rows.
When we index with a string, the string should be a column name:
easiness = courses['Easiness']
The result is a series:
type(easiness)
pandas.core.series.Series
The Series is a structure that holds the data for a single column.
easiness
0 3.162754
1 3.063322
2 2.710459
3 3.316210
4 3.053803
...
70 2.863504
71 3.106727
72 3.309636
73 2.799135
74 3.109118
Name: Easiness, Length: 75, dtype: float64
We will come back to the Series soon.
Notice that, if your string specifying the column name does not match a column name exactly, you will get a long error. This gives you some practice in reading long error messages - skip to the end first, you will often see the most helpful information there.
# The exact column name starts with capital E
courses['easiness']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'easiness'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-10-cd3826a0b407> in <module>
1 # The exact column name starts with capital E
----> 2 courses['easiness']
~/Library/Python/3.8/lib/python/site-packages/pandas/core/frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
~/Library/Python/3.8/lib/python/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 'easiness'
You have just seen indexing into the data frame with a string to get the data for one column.
The other simple way of indexing into a data frame is with a Boolean sequence.
A Boolean sequence is a sequence of values, all of which are either True or False. Examples of sequences are series and arrays.
For example, imagine we only wanted to look at courses with an easiness rating of greater than 3.25.
We first make the Boolean sequence, by asking the question > 3.25
of the values in the “Easiness” column, like this:
is_easy = easiness > 3.25
This is a series that has True and False values:
type(is_easy)
pandas.core.series.Series
is_easy
0 False
1 False
2 False
3 True
4 False
...
70 False
71 False
72 True
73 False
74 False
Name: Easiness, Length: 75, dtype: bool
It has True values where the corresponding row had an “Easiness” score greater than 3.25, and False values where the corresponding row had an “Easiness” score of less than or equal to 3.25.
We can index into the data frame with this Boolean series.
When we do this, we ask the data frame to give us a new version of itself, that only has the rows where there was a True value in the Boolean series:
easy_courses = courses[is_easy]
The result is a data frame:
type(easy_courses)
pandas.core.frame.DataFrame
The data frame contains only the rows where the “Easiness” score is greater than 3.25:
easy_courses
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
11 | Sociology | 4839 | 3.740980 | 3.748169 | 3.746962 | 3.395819 |
14 | Languages | 3867 | 3.772780 | 3.917949 | 3.846951 | 3.277406 |
18 | Education | 2544 | 3.707429 | 3.806128 | 3.758211 | 3.430916 |
19 | Music | 2455 | 3.844509 | 3.787804 | 3.818114 | 3.542273 |
21 | Health | 1937 | 3.891177 | 3.884729 | 3.891213 | 3.468012 |
22 | Humanities | 1897 | 3.806969 | 3.816299 | 3.813569 | 3.344138 |
24 | Criminal Justice | 1786 | 4.056685 | 4.033779 | 4.046702 | 3.469440 |
27 | Social Science | 1412 | 3.683555 | 3.691133 | 3.690262 | 3.338846 |
31 | Theater | 1078 | 3.876633 | 3.821503 | 3.851837 | 3.584508 |
36 | Physical Education | 991 | 4.078698 | 4.030797 | 4.057719 | 3.832250 |
42 | Spanish | 584 | 3.854932 | 3.993716 | 3.928750 | 3.349743 |
43 | Art | 574 | 3.797073 | 3.799216 | 3.800157 | 3.268293 |
44 | Ethnic Studies | 481 | 3.533597 | 3.578337 | 3.556923 | 3.387630 |
45 | Film | 450 | 3.788933 | 3.764689 | 3.778689 | 3.324156 |
46 | Women's Studies | 393 | 3.639949 | 3.652087 | 3.646743 | 3.276412 |
48 | Government | 374 | 3.957540 | 3.916070 | 3.938663 | 3.423021 |
49 | Speech | 351 | 4.133191 | 4.101197 | 4.119345 | 3.674701 |
50 | Social Work | 317 | 3.770726 | 3.834700 | 3.806246 | 3.314322 |
54 | Hospitality | 267 | 3.697228 | 3.744607 | 3.719476 | 3.471498 |
58 | Foreign Language | 209 | 3.914641 | 4.057751 | 3.991675 | 3.323158 |
59 | Religious Studies | 193 | 3.932746 | 3.985751 | 3.962280 | 3.365544 |
61 | Child Development | 171 | 3.950585 | 4.008070 | 3.979766 | 3.606082 |
63 | French | 162 | 3.775370 | 3.943086 | 3.860679 | 3.267099 |
64 | Reading | 148 | 4.159392 | 4.188919 | 4.177500 | 3.882635 |
65 | Modern Languages | 147 | 3.858503 | 3.971293 | 3.916259 | 3.275238 |
66 | Agriculture | 146 | 3.767466 | 3.863219 | 3.820068 | 3.402397 |
67 | Kinesiology | 132 | 3.995000 | 3.972879 | 3.988712 | 3.541439 |
68 | Nutrition | 120 | 3.815167 | 3.844333 | 3.831500 | 3.468333 |
72 | Linguistics | 110 | 3.749000 | 3.834545 | 3.798182 | 3.309636 |
The way this works can be easier to see when we use a smaller data frame.
Here we take the first eight rows from the data frame, by using the head
method.
The head
method can take an argument, which is the number of rows we want.
first_8 = courses.head(8)
The result is a new data frame:
type(first_8)
pandas.core.frame.DataFrame
first_8
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
5 | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
7 | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |
We index into the new data frame with a string, to get the “Easiness” column:
easiness_first_8 = first_8["Easiness"]
easiness_first_8
0 3.162754
1 3.063322
2 2.710459
3 3.316210
4 3.053803
5 2.652054
6 3.379829
7 3.172033
Name: Easiness, dtype: float64
This Boolean series has True where the “Easiness” score is greater than 3.25, and False otherwise:
is_easy_first_8 = easiness_first_8 > 3.25
is_easy_first_8
0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 False
Name: Easiness, dtype: bool
We index into the first_8
data frame with this Boolean series, to select the rows where is_easy_first_8
has True, and throw away the rows where it has False.
easy_first_8 = first_8[is_easy_first_8]
easy_first_8
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
Oh dear, Psychology looks pretty easy.
Series and array¶
The series, as you have seen, is the structure that Pandas uses to store the data from a column:
first_8
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
5 | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
7 | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |
easiness_first_8 = first_8["Easiness"]
easiness_first_8
0 3.162754
1 3.063322
2 2.710459
3 3.316210
4 3.053803
5 2.652054
6 3.379829
7 3.172033
Name: Easiness, dtype: float64
You can index into a series, but this indexing is powerful and sophisticated, so we will not use that for now.
For now, you can convert the series to an array, like this:
easi_8 = np.array(easiness_first_8)
easi_8
array([3.16275414, 3.06332232, 2.71045949, 3.31620986, 3.0538026 ,
2.65205418, 3.37982853, 3.17203268])
Then you can use the usual array indexing to get the values you want:
# The first value
easi_8[0]
3.16275414471149
# The first five values
easi_8[:5]
array([3.16275414, 3.06332232, 2.71045949, 3.31620986, 3.0538026 ])
You can think of a data frame as sequence of columns, where each column is series.
Here I take two columns from the data frame, as series:
disciplines = first_8['Discipline']
disciplines
0 English
1 Mathematics
2 Biology
3 Psychology
4 History
5 Chemistry
6 Communications
7 Business
Name: Discipline, dtype: object
clarity = first_8['Clarity']
clarity
0 3.756147
1 3.487379
2 3.608331
3 3.909520
4 3.788818
5 3.387174
6 3.867349
7 3.640327
Name: Clarity, dtype: float64
I can make a new data frame by inserting these two columns:
# A new data frame
thinner_courses = pd.DataFrame()
# We will come to the need for the `copy` later.
thinner_courses['Discipline'] = disciplines.copy()
thinner_courses['Clarity'] = clarity.copy()
thinner_courses
Discipline | Clarity | |
---|---|---|
0 | English | 3.756147 |
1 | Mathematics | 3.487379 |
2 | Biology | 3.608331 |
3 | Psychology | 3.909520 |
4 | History | 3.788818 |
5 | Chemistry | 3.387174 |
6 | Communications | 3.867349 |
7 | Business | 3.640327 |