1. Coding for data

About this textbook

This is the textbook for a course teaching data science - coding for data.

We go into some detail about what we mean by “Data science” in the next section, but here is the one-line summary:

Data science is an approach to data analysis with a foundation in code and algorithms.

The textbook aims to teach you this approach.

We designed the textbook so you can learn from it without doing the course. If you read this textbook carefully, and do the exercises, you will have a solid foundation for learning more about data science.

The background you need

You do not need any previous experience in programming to use this book. We aim to teach you the programming you need as we go.

The book deliberately uses very little formal mathematics. Instead, we show how the procedures work with code. We hope you agree that this can be much easier to understand, especially for those without much background in mathematics.

A summary of the book

We start by asking some questions about the real world that force us to think about the effects of chance. Then we use some simple code to ask the computer to simulate the effects of chance. We find that this allows us to draw important conclusions about events in the real world.

We continue with this approach throughout the textbook. First we look at some data from the real world, then we think about the procedures we would need to draw conclusions from these data. Next we go through the code that you need, and then we implement the procedures in code, and draw our conclusions.

By the end of the course, we have covered simulation of the real world using random numbers, drawing conclusions about differences between groups using random permutation, looking for straight line relationships using regression, and predicting categories of things using classification with machine learning.

Based on the Berkeley textbook

This textbook draws heavily on the excellent Computational and Inferential Thinking textbook for the data science course at the University of California Berkeley: The foundations of data science. Many thanks to the main authors Ani Adhikari and John DeNero, as well as David Wagner, who wrote the beautiful chapter on machine learning and classification.

Many sections are interactive

You can interact with many of the sections in this textbook, with the “Interact” button at the top of the page. This will take you to a free online service that allows you to execute the code in the section, to generate the tables and figures. We encourage you to play with these interactive sections, by changing the code and running it. These sections also have a “Download notebook” button. If you click on this button, you will get a copy of the interactive document on your own computer, in the form of a Jupyter notebook. See the tools page for the free software you will need.