Project
Teams
- Teams of size 3 or 4.
- If you are from Maths, Physics, Computer Science or Engineering, you are a Theory Person.
- If you are from Geography, Biosciences, Sports / Exercise, Psychology, History, English, Music or Business, you are a Data Person.
- If you’re in neither group, ask me (Matthew).
- Each team must have at least one Theory person and one Data person.
- See below for data rules.
- When you have a team you’re happy with, choose one member to email me (Matthew), with the information below.
I will consider smaller team sizes, and even team sizes of 1, but your email should give convincing justification (see below). In particular, you need to convince me you have done the best you can to form a larger team.
Team request
Email me (Matthew) with your request to form a team.
One team member should email, with a Cc to the other members.
Your email should have:
- a list of the team members;
- note the “data” team member(s) and their subject(s). If you don’t have a “data” team member, give good grounds for not having one;
- note the “theory” team member(s) and their subject(s). If you don’t have one, give good grounds for not having one;
- point to the data you are going to use. You must point to some initial data. If the data is not public, give me some way to verify that it exists, and is suitable;
- sketch your plan for the initial analysis, and the final results that you hope for.
Data
Workload
This term is a 10 credits of a 20 credit course. Each credit corresponds to 10 hours work, of which one hour is the lecture. Unlike other courses, there is no exam to revise for, so the rest of the time is for the project.
You will find that, at the beginning of the project, this amount of work can seem daunting. Please don’t worry; if you work steadily, you will find things fall into place. On the other hand, you must plan to work steadily.
This is a write-up of a previous data science course we ran: (Millman, Brett, Barnowski, & Poline, 2018).
In answer to “What advice would you give to another student who is considering taking this course?”:
Unlike most group projects (which last for maybe a few weeks tops or could conceivably be pulled off by one very dedicated person), this one will dominate the entire semester. . . . Try to stay organized for the project and create lots of little goals and checkpoints. You should always be working on something for the project, whether that’s coding, reviewing, writing, etc. Ask lots of questions and ask them early!
Getting help
We (your instructors) are very happy to help with advice on your project. We can’t write any project code for you, but we will give you advice. Please do not wait to ask us; if you are stuck, we need to know as soon as possible.
Scope
We only expect you to use the techniques that we have shown you in the lectures. You should not use any techniques that you do not understand. We would far prefer that you do simple, clear analyses using basic techniques than complex analyses that you do not fully understand. Your job as a data scientist is to draw clear conclusions from data. Often this will just involve selecting and plotting relevant data, and making an argument from the results.
Suggested structure
See the rubric for the requirements of your project files. This specifies that you must have a README
file in some format, as the main instructions to reproduce your analysis.
We also recommend that you:
- Download the data you are working on, and save it with your project files. Leave instructions on how we (your instructors) can get the original data that you downloaded.
- Consider using a
setup
file, such as a Notebook, that runs once, to set up the project. For example, it might install any libraries that you will use. It may download the data from a URL and save it to your project directory.
Process
In summary:
- Analysis and collaboration will be public (using CoCalc or Github service).
- Analysis should be reproducible.
- Final report should be in the form of a Notebook, or similar executable document.
See: project workflow
Plagiarism
See: plagiarism rules
Using Python libraries
You can use any part of the Numpy, Pandas, and Matlplotlib libraries with no further explanation.
If you use other libraries, you should explain in your write-up why you are using the library, rather than building the analysis yourself. You must persuade us, in your write-up, that you fully understand the parts of the library that you are using. If in doubt, speak to me (Matthew), or one of the TAs.
Marking
In summary:
- Project marked for clarity, depth, validity and reproducibility (72% of project grade);
- Each member of the project submits a summary of their own contribution, with evidence from the public record of collaboration (28% of project grade).
Dates
Week | Date | Class |
---|---|---|
1 | 18 Jan | Start to form teams |
2 | 25 Jan | Data pitches 1 |
3 | 1 Feb | Data pitches 2 |
4 | 8 Feb | Teams finalized |
5 | 15 Feb | Workflow |
6 | 22 Feb | |
7 | 1 Mar | |
8 | 8 Mar | Progress reports |
9 | 15 Mar | |
10 | 22 Mar | Presentations |
11 | 29 Mar | |
11 | 30 Mar | Projects due |
Progress report
See: progress report