3.14 Selecting in arrays
Selecting values from an array
import numpy as np
We use arrays all the time, in data science.
One of the most common tasks we have to do on arrays, is to select values.
We do this with array slicing.
We do array slicing when we follow the array variable name with
[
(open square brackets), followed by something to specify
which elements we want, followed by ]
(close square brackets.
The simplest case is when we want a single element from a one-dimensional array. In that case the thing between the [
and the ]
is the index of the value that we want.
The index of the first value is 0, the index of the second value is 2, and so on.
We start by loading data from a Comma Separated Value file (CSV file).
This is summary data that Andrew Rosen downloaded from https://www.ratemyprofessors.com for an analysis he reported in a 2018 paper. The data file here is from the supplementary material; it has the average rating across academic discipline for all professors in a particular discipline who have more than 20 ratings.
See the dataset page for more detail.
If you are running on your laptop, you should download the rate_my_course.csv file to the same directory as this notebook.
Don’t worry about the next cell. It just loads the data we need from a data file. We will come on the machinery to do this very soon.
# This code we have not covered yet. We will soon.
# It loads the data file, and makes some arrays.
# Load the library for reading data files.
import pandas as pd
# Read the file.
courses = pd.read_csv('rate_my_course.csv')
# Sort the courses by number of rated teachers, most first.
courses_by_n = courses.sort_values(
'Number of Professors', ascending=False)
# Select the top eight easiest courses.
big_courses = courses_by_n.head(8)
big_courses
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
5 | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
7 | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |
# Again - don't worry about this code - we will cover it later.
# Put the columns into arrays
disciplines = big_courses['Discipline'].values
easiness = big_courses['Easiness'].values
quality = big_courses['Overall Quality'].values
We now have the names of the disciplines with the largest number of professors.
disciplines
array(['English', 'Mathematics', 'Biology', 'Psychology', 'History',
'Chemistry', 'Communications', 'Business'], dtype=object)
Here we get the first value. This value is at index 0.
# Get the first value (at index position 0)
disciplines[0]
'English'
Notice that this is another way of writing:
disciplines.item(0)
'English'
# Get the second value (at index position 1)
disciplines[1]
'Mathematics'
# Get the third value (at index position 2)
disciplines[2]
'Biology'
At first this will take some time to get used to, that the first value is at index position 0. There are good reasons for this, and many programming languages have this convention, but it does a while to get this habit of mind.
Notice too that we use square brackets for indexing.
We have seen square brackets before, when we create lists. For example, we can create a list with my_list = [1, 2, 3]
. (We usually then create an array with something like my_array = np.array(my_list)
. This is a different use of square brackets. When we create a list, the square brackets tell Python that the elements between the brackets should be the elements of the list. This use is called a list literal expression. The square brackets follow an equal sign, or some other operator, or start the line.
When we use square brackets for indexing, the square brackets always follow an expression that returns an array. For example, consider this line:
disciplines[2]
'Biology'
disciplines
is an expression that returns an array. Therefore the open square brackets follows this expression, and therefore, Python can tell that this use of square brackets means select an element or elements from the array.
So we have seen different uses of square brackets:
- Creating a list (a list literal); (often uses in making arrays).
- Indexing into an array.
Python can tell which of these two we mean, from the context.
Index with negative numbers
If we know how many elements the array has, then we can get the last element by using the number of elements, minus one (why?).
Here the number of elements is:
len(disciplines)
8
So, the last element of the array is at index position 7:
disciplines[7]
'Business'
In fact, there is a short cut for getting elements at the end of the array, and that is to use an offset with a minus in front. The number is then the offset from one past the last item. For example, here is another way to get the last element:
disciplines[-1]
'Business'
The last but one element:
disciplines[-2]
'Communications'
Index with slices
Sometimes we want more than one element from the array. For example, we might want the first 4 elements from the array. We can get these using an array slice. It looks like this:
# All the elements from offset zero up to (not including) 4
disciplines[0:4]
array(['English', 'Mathematics', 'Biology', 'Psychology'], dtype=object)
# All the elements from offset 4 up to (not including) 8
disciplines[5:11]
array(['Chemistry', 'Communications', 'Business'], dtype=object)
You can omit the first number, if you mean to start at offset 0:
disciplines[:4]
array(['English', 'Mathematics', 'Biology', 'Psychology'], dtype=object)
You can omit the last number if you mean to end at the last element of the array:
disciplines[4:]
array(['History', 'Chemistry', 'Communications', 'Business'], dtype=object)
Indexing with Boolean arrays
We often want to select several elements from an array according to some criterion.
The most common way to do this, is to do array slicing, using a Boolean array between the square brackets.
It can be easier to understand this by example than by description.
Here are the “Overall Quality” scores for the ten largest courses:
quality
array([3.79136358, 3.56686702, 3.65764056, 3.9009491 , 3.77374607,
3.46548462, 3.87501873, 3.6633317 ])
These are the quality ratings corresponding to the disciplines
we saw earlier. The top (largest) discipline is:
disciplines[0]
'English'
The Overall Quality rating for that course is:
quality[0]
3.7913635779463704
You have already come across Boolean arrays.
These are arrays of True
and False
values.
Here is a Boolean array, created from applying a comparison to an array:
greater_than_3 = easiness > 3
greater_than_3
array([ True, True, False, True, True, False, True, True])
This has a True
value at the positions of elements > 3, and False
otherwise.
As you have already seen, we can do things like count the number
of True
values in the Boolean array:
np.count_nonzero(greater_than_3)
6
Now let us say that we wanted to get the elements from easiness
that are greater than 4. That is, we want to get the elements
in easiness
for which the corresponding element in
greater_than_3
is True
.
We can do this with Boolean array indexing. The Boolean array goes between the square brackets, like this:
easiness
array([3.16275414, 3.06332232, 2.71045949, 3.31620986, 3.0538026 ,
2.65205418, 3.37982853, 3.17203268])
greater_than_3
array([ True, True, False, True, True, False, True, True])
easiness[greater_than_3]
array([3.16275414, 3.06332232, 3.31620986, 3.0538026 , 3.37982853,
3.17203268])
We have selected the numbers in easiness
that are greater than 3.
We can use this same Boolean array to index into another array. For example, here we show the discipline names corresponding to the courses with Easiness scores greater than 3:
disciplines[greater_than_3]
array(['English', 'Mathematics', 'Psychology', 'History',
'Communications', 'Business'], dtype=object)
Just to confirm that these are the disciplines with “Easiness” ratings greater than 4, here are the first ten rows of the table. These are the rows corresponding to numbers in the arrays.
big_courses.head(10)
Discipline | Number of Professors | Clarity | Helpfulness | Overall Quality | Easiness | |
---|---|---|---|---|---|---|
0 | English | 23343 | 3.756147 | 3.821866 | 3.791364 | 3.162754 |
1 | Mathematics | 22394 | 3.487379 | 3.641526 | 3.566867 | 3.063322 |
2 | Biology | 11774 | 3.608331 | 3.701530 | 3.657641 | 2.710459 |
3 | Psychology | 11179 | 3.909520 | 3.887536 | 3.900949 | 3.316210 |
4 | History | 11145 | 3.788818 | 3.753642 | 3.773746 | 3.053803 |
5 | Chemistry | 7346 | 3.387174 | 3.538980 | 3.465485 | 2.652054 |
6 | Communications | 6940 | 3.867349 | 3.878602 | 3.875019 | 3.379829 |
7 | Business | 6120 | 3.640327 | 3.680503 | 3.663332 | 3.172033 |