Course: Math 446, Data Science with Python, Fall 2024
Prerequisite: MATH 408 and 1 from (MATH 225 or MATH 235) and 1 from (ITP 115 or ITP 116).
Course Content: Python implementations of: data collection, data wrangling, exploratory data analysis, dimensionality reduction, unsupervised / supervised learning, clustering, classification, common predictive statistical / machine learning algorithms, model validation.
Last update: May 30 2024
Instructor: Steven Heilman, stevenmheilman(@-symbol)gmail.com
Office Hours: Mondays 11AM-1PM, KAP 406G
Lecture Meeting Time/Location: Mondays, Wednesdays, and Fridays, 1PM-150PM, VHE 206
TA: Inga Girshfeld, girshfel(@-symbol)usc.edu
TA Office Hours: Held in the Math Center, with schedule provided at that link.
Textbook: There is no required textbook. The first course resource is a freely available book:
Python for Data Analysis, 3E, by Wes McKinney, available online at: this link.
Some other textbooks that might be helpful are:
A Hands-On Introduction to Data Science, by Chirag Shah
An Introduction to Statistical Learning, with Applications in Python by James, Witten, Hastie and Tibshirani. this link.
Project Abstract Due: Thursday, September 26, 4PM PST (via brightspace)
Exam 1: Wednesday, October 2, 1PM-150PM PST
Progress Report Due: Thursday, October 31, 4PM PST (via brightspace)
Exam 2: Friday, November 8, 1PM-150PM PST
Final Project Presentation Video: Last two or three weeks of class, schedule TBD
Final Report Due: Wednesday, December 18, 11AM PST (via brightspace)
Email Policy:
Exam Procedures: This course has two midterm exams and no final exam. The final project is a replacement for the final exam. Students must bring their USCID cards to the midterms and to the final exam. Phones must be turned off. Cheating on an exam results in a score of zero on that exam. Exams can be regraded at most 15 days after the date of the exam. This policy extends to homeworks as well. All students are expected to be familiar with the USC Student Handbook Policies on Plagiarism, Cheating, Academic Integrity, etc.. (See also here.)
Accessibility Services: If you are registered with accessibility services, I would be happy to discuss this at the beginning of the course. Any student requesting accommodations based on a disability is required to register with Accessibility Services (OSAS) each semester. A letter of verification for approved accommodations can be obtained from OSAS. Please be sure the letter is delivered to me as early in the semester as possible. OSAS is located in 301 STU and is open 8:30am-5:00pm, Monday through Friday.
https://osas.usc.edu/
213-740-0776 (phone)
213-740-6948 (TDD only)
213-740-8216 (fax)
OSASFrontDesk@usc.edu
Discrimination, sexual assault, and harassment are not tolerated by the university. You are encouraged to report any incidents to the Office of Equity and Diversity http://equity.usc.edu/ or to the Department of Public Safety http://capsnet.usc.edu/department/department-public-safety/online-forms/contact-us. This is important for the safety whole USC community. Another member of the university community - such as a friend, classmate, advisor, or faculty member - can help initiate the report, or can initiate the report on behalf of another person. The Center for Women and Men http://www.usc.edu/student-affairs/cwm/ provides 24/7 confidential support, and the sexual assault resource center webpage sarc@usc.edu describes reporting options and other resources.
Exam Resources: Here is a practice exam with solution. Here is a page containing some practice exams for a course that is somewhat similar to our course.
Final Project Guidelines:
The final project is an opportunity to work with a data set of your choice, apply some of the techniques we have discussed in class, and perhaps learn some new things we did not cover in class. A project could begin with an interesting question or a well-known problem, and perhaps lead to investigating or implementing various algorithms, conducting an empirical analysis, etc.
Along the way, you will review relevant literature, identify appropriate data sources, select appropriate means of evaluation, and either develop novel methodology for your problem or deploy and comprehensively evaluate existing methodology for your new application.
The goal is to say something interesting about a problem in data science, broadly construed. You could perhaps develop new methodology for an existing problem or application that has no fully satisfactory solution. You could alternatively tackle a new problem or application with existing methodology; in this case, you should identify one or more questions without satisfactory answers in your chosen domain and explore how the methodology can help you answer those questions. You may draw inspiration from particular data sets, but your focus should rest not on the data itself but rather on the questions about the world that you can answer with that data.
While a substantial theoretical component is not required for this project, it could be beneficial if your project is supported by some theoretical results.
You may work alone or in a group of two; the standards for a group project will be twice as high. In certain cases I might approve a group of three, but this is unlikely.
We strongly encourage you to come to office hours to discuss your project ideas, progress, and difficulties.
Final Project Milestones: Submission format TBD, but will probably be LaTeX.
I: Project Proposal. By this first milestone, you should have selected a question or problem of interest, identified relevant data sources, begun exploring the literature surrounding the question, and discussed your ideas with the course staff. Your project proposal deliverable is a 1/2 - 1 page report (single spaced) describing the question or problem you intend to tackle, why this question is important or interesting, prior work on this problem, what data you intend to use in your analyses, and the principal challenges that you anticipate.
If you would like to receive feedback about particular aspects of your proposal, please indicate this in your submission.
I can try to help in problem selection. Ideally, the problem should be something you are very interested in. As such, it might be helpful to first tell me about your interests (maybe after class or in office hours), and we can try to think of something to work on. Selecting problems to work on is a difficult skill that takes years to develop, so it would be nice if you find a project idea on your own, but I expect everyone will need at least a little help in their choice. I know some things about some fields but I don't know everything about every field, so I might not be so helpful with certain projects outside my own background, but I can learn a bit myself to help you along if your interests are outside my knowledge.
II: Progress Report. By this second milestone, you should have some initial results to share; for example, you may have implemented and evaluated the performance of existing algorithms on your dataset and task of interest, or you may have conducted an initial study with simulated data to better understand the properties of certain methods, or you were able to prove some preliminary result about some question of interest, etc.
Your progress report deliverable is a write-up of no more than 2 pages (single spaced) (not including references) describing what you have accomplished so far and, briefly, what you intend to do in the remainder of the term. You should be able to reuse at least part of the text of this milestone in your final report.
III: Pre-recorded presentation. You will present your work in a pre-recorded video (it is easy to record a presentation using zoom, but I guess you don't have to use zoom). Depending on enrollment numbers, we might watch videos in class. The length of the presentation will vary according to course enrollment, but each person should expect to speak for about 5-10 minutes. Since the talk will be short, you should consider practicing (and timing) your talk before recording it. You can practice part of it with me in office hours if you want. If we view the talks in class, expect around 3-5 minutes of questions from myself and your fellow students. Unlike other times in the class, attendance is mandatory during the presentations, and I will be taking attendance during them. Once I set the time limits they will be strict. Going over time will result in severe penalties.
You will be graded on your presentation skills, e.g. voice volume, screen/board usage, pacing of material, choice of material, etc. Minor technical problems will not be penalized, but major technical problems will be penalized. % If you use a computer, make sure to have a backup plan in case of a technical problem to avoid such a penalty. If you want to give a short version of your presentation in office hours before the actual presentation, and then have me give feedback that might be a good idea.
IV: Final Report. Your final project report (not including acknowledgements and references) should be around 5-8 pages in length (single spaced) (using at most 12 point font and maximum 1 inch margins) and should follow a typical scientific style (with abstract, introduction, etc.). The write-up should clearly define your problem or question of interest, review relevant past work, and introduce and detail your approach. A comprehensive empirical evaluation should follow, along with an interpretation of your results. Any elucidation of the theoretical properties of an empirical method under consideration is also welcome.
If this work was done in collaboration with someone outside of the class (e.g., a professor), please describe their contributions in an acknowledgements section.
The final report PDF file should be submitted on brightspace. No hardcopy is needed.
Some Project Ideas: (to browse your own ideas, you could e.g. look through the proceedings of recent conferences such as ICLR, ICML, NeurIPS, COLT, STOC, FOCS, etc.) (The resources below skew towards my own interests a bit, so don't take this list as suggestive of what you should or should not study.)
Reinforcement Learning
Large Language Models and Transformers
Adam Optimization Method
Embeddings and the "kernel trick"
Some LLM/Machine Learning Leaderboards and Datasets
HPC Resources:
If your project will need high performance computing resources, then I would encourage you to register (early in the semester) to use part of the HPC allocation for this course. Please email me about this if you are interested, and I can add you to the allocation. See also the USC HPC resource pages, such as this link and also this link.
Homework Policy:
Grading Policy:
Tentative Schedule: (This schedule may change slightly during the course.)
Week | Monday | Tuesday | Wednesday | Thursday | Friday |
1 | Aug 26: Intro to Jupyter notebook | Aug 27 | Aug 28: Review of Python | Aug 29 | Aug 30: Python data structures, functions |
2 | Sep 2: No class | Sep 3 | Sep 4: Intro to Numpy, Matplotlib | Sep 5: Homework 1 due | Sep 6: Numpy and Floating point arithmetic |
3 | Sep 9: Numpy | Sep 10 | Sep 11: Review of linear algebra | Sep 12: Homework 2 due | Sep 13: Review of linear algebra |
4 | Sep 16: Least squares minimization | Sep 17 | Sep 18: Singular value decomposition | Sep 19: Homework 3 due | Sep 20: Principal component analysis |
5 | Sep 23: Principal component analysis | Sep 24 | Sep 25: k-means clustering | Sep 26: Project proposal due | Sep 27: k-means clustering |
6 | Sep 30: Dimension reduction | Oct 1 | Oct 2: Exam 1 | Oct 3: No homework due | Oct 4: Intro to pandas |
7 | Oct 7: Data loading, storage, file formats | Oct 8 | Oct 9: pandas: Series, DataFrame | Oct 10: Homework 4 due. No class | Oct 11: No class |
8 | Oct 14: pandas: Series, DataFrame | Oct 15 | Oct 16: pandas: Reindexing, Deletion | Oct 17: Homework 5 due | Oct 18: pandas: Selection, Filtering |
9 | Oct 21: Data cleaning | Oct 22 | Oct 23: Data cleaning | Oct 24: Homework 6 due | Oct 25: Web scraping: zillow |
10 | Oct 28: Web scraping: google finance, eBay | Oct 29 | Oct 30: Linear Regression | Oct 31: Progress report due | Nov 1: Logistic Regression |
11 | Nov 4: Multiclass classification: MNIST | Nov 5 | Nov 6: Multiclass classification: MNIST | Nov 7: No homework due | Nov 8: Exam 2 |
12 | Nov 11: No class | Nov 12 | Nov 13: Deep Learning, keras | Nov 14: Homework 7 due | Nov 15: Deep Learning, keras |
13 | Nov 18: Deep Learning, keras | Nov 19 | Nov 20: Large Language Models | Nov 21: Homework 8 due | Nov 22: Large Language Models |
14 | Nov 25: Leeway | Nov 26 | Nov 27: No class | Nov 28: No class | Nov 29: No class |
15 | Dec 2: Student Presentations | Dec 3 | Dec 4: Student Presentations | Dec 5: Homework 9 due | Dec 6: Student Presentations [Dec 18: Final Report Due] |
Advice on succeeding in a math class:
Homework
Homework .tex files
Exam Solutions
Supplementary Notes