Project

Teams

  • Teams of size 3 or 4.
  • If you are from Maths, Physics, Computer Science or Engineering, you are a Theory Person.
  • If you are from Geography, Biosciences, Sports / Exercise, Psychology, History, English, Music or Business, you are a Data Person.
  • If you’re in neither group, ask me (Matthew).
  • Each team must have at least one Theory person and one Data person.
  • See below for data rules.
  • When you have a team you’re happy with, choose one member to email me (Matthew), with the information below.

I will consider smaller team sizes, and even team sizes of 1, but your email should give convincing justification (see below). In particular, you need to convince me you have done the best you can to form a larger team.

Team request

Email me (Matthew) with your request to form a team.

One team member should email, with a Cc to the other members.

Your email should have:

  • a list of the team members;
  • note the “data” team member(s) and their subject(s). If you don’t have a “data” team member, give good grounds for not having one;
  • note the “theory” team member(s) and their subject(s). If you don’t have one, give good grounds for not having one;
  • point to the data you are going to use. You must point to some initial data. If the data is not public, give me some way to verify that it exists, and is suitable;
  • sketch your plan for the initial analysis, and the final results that you hope for.

Data

See data for projects

Workload

This term is a 10 credits of a 20 credit course. Each credit corresponds to 10 hours work, of which one hour is the lecture. Unlike other courses, there is no exam to revise for, so the rest of the time is for the project.

You will find that, at the beginning of the project, this amount of work can seem daunting. Please don’t worry; if you work steadily, you will find things fall into place. On the other hand, you must plan to work steadily.

This is a write-up of a previous data science course we ran: (Millman, Brett, Barnowski, & Poline, 2018).

In answer to “What advice would you give to another student who is considering taking this course?”:

Unlike most group projects (which last for maybe a few weeks tops or could conceivably be pulled off by one very dedicated person), this one will dominate the entire semester. . . . Try to stay organized for the project and create lots of little goals and checkpoints. You should always be working on something for the project, whether that’s coding, reviewing, writing, etc. Ask lots of questions and ask them early!

Getting help

We (your instructors) are very happy to help with advice on your project. We can’t write any project code for you, but we will give you advice. Please do not wait to ask us; if you are stuck, we need to know as soon as possible.

Scope

We only expect you to use the techniques that we have shown you in the lectures. You should not use any techniques that you do not understand. We would far prefer that you do simple, clear analyses using basic techniques than complex analyses that you do not fully understand. Your job as a data scientist is to draw clear conclusions from data. Often this will just involve selecting and plotting relevant data, and making an argument from the results.

Suggested structure

See the rubric for the requirements of your project files. This specifies that you must have a README file in some format, as the main instructions to reproduce your analysis.

We also recommend that you:

  • Download the data you are working on, and save it with your project files. Leave instructions on how we (your instructors) can get the original data that you downloaded.
  • Consider using a setup file, such as a Notebook, that runs once, to set up the project. For example, it might install any libraries that you will use. It may download the data from a URL and save it to your project directory.

Process

In summary:

  • Analysis and collaboration will be public (using CoCalc or Github service).
  • Analysis should be reproducible.
  • Final report should be in the form of a Notebook, or similar executable document.

See: project workflow

Plagiarism

See: plagiarism rules

Using Python libraries

You can use any part of the Numpy, Pandas, and Matlplotlib libraries with no further explanation.

If you use other libraries, you should explain in your write-up why you are using the library, rather than building the analysis yourself. You must persuade us, in your write-up, that you fully understand the parts of the library that you are using. If in doubt, speak to me (Matthew), or one of the TAs.

Marking

See the full grading rubric.

In summary:

  • Project marked for clarity, depth, validity and reproducibility (72% of project grade);
  • Each member of the project submits a summary of their own contribution, with evidence from the public record of collaboration (28% of project grade).

Dates

Week Date Class
1 18 Jan Start to form teams
2 25 Jan Data pitches 1
3 1 Feb Data pitches 2
4 8 Feb Teams finalized
5 15 Feb Workflow
6 22 Feb  
7 1 Mar  
8 8 Mar Progress reports
9 15 Mar  
10 22 Mar Presentations
11 29 Mar  
11 30 Mar Projects due

Progress report

See: progress report

Presentations

See: project presentations