Assessment
Grading Breakdown
| Component | Points | Details |
|---|---|---|
| Class participation | 5 | Engagement in class discussions |
| Online polls | 5 | Participation in online polls |
| Quiz 1 | 15 | In class quiz (Oct 23) |
| Quiz 2 | 15 | In class quiz (Dec 4) |
| Data challenge | 30 | Oct 23 - Nov 6 (ends 18:00) |
| Final project | 30 | Presentation on Dec 11 |
| Extra credits | 12 max | RouterSense 60-day participation |
Grades will be assigned based on this NYU policy.
Class participation
Active participation in class is highly encouraged and will be assessed based on the quality of your engagement. This includes asking insightful questions that demonstrate your understanding of the material, as well as providing thoughtful answers during discussions. When you contribute meaningfully—whether by posing a good question or offering a helpful response—please record your participation immediately using the provided Google Form. Danny will typically acknowledge strong contributions in class by saying "Good question" or "Good response." Consistent, high-quality participation will positively impact your grade and help foster a collaborative learning environment. Each record, once verified, in the form would reward you with one point (max one point per class).
Online polls
From time to time, Danny will conduct Mentimeter polls during class. To receive credit for online poll participation, make sure to record your NetID when prompted in these polls. Your responses—regardless of which options or choices you select—will count toward participation; correctness is not checked. The key requirement is to participate in these synchronous online polls whenever they are conducted.
Midterm quizzes
30 multiple-choice questions. The exam will be administered on GradeScope. You may use your computer to access resources such as Google and LLMs, but you must not communicate with other humans. The exam consists of 30 multiple-choice questions (shuffled) to be completed in 30 minutes. Danny will present a scenario on the projector screen, and the questions will relate to that scenario.
In addition to the scenarios Danny will present on the board or screen during the quiz, you should also be familiar with scenarios and examples discussed in class. These will be clear and straightforward—nothing obscure or tricky. If you pay attention in class and review the recordings, you’ll be well prepared. You don’t need to memorize every detail, but make sure you understand the big picture and main ideas behind each scenario. The quizzes are designed to reinforce your understanding of real examples covered in class, so following along and grasping the context is key. Remember, the goal is to ensure you can follow and apply the concepts discussed in class, which an AI or large language model won’t have access to.
There will be two quizzes:
Oct 23: Quiz 1 will cover the fundamentals of big data analytics.
Dec 4: Quiz 2 will cover advanced topics.
Quiz logistics and recommendations: I strongly recommend taking the quiz in person during class. Many questions may be phrased ambiguously, and being present allows you to ask for clarifications on the spot—sometimes I’ll even provide hints in response to good questions. While you can participate via Zoom, my responses to clarification requests may be delayed since I prioritize in-class questions. You must take the quiz synchronously at the scheduled time, whether in person or on Zoom. If you miss the quiz, you cannot make it up, as I will explain the answers immediately after. However, if you have a legitimate reason for missing the quiz, I can offer a brief (about 10 minutes) phone interview at the end of the semester, where you’ll answer questions in an interview format to make up the grade.
Example scenario: Suppose you have log files for website visits and want to analyze the data to understand who visits your site and when.
Sample questions:
-
I want to look up an IP address really fast. Which SQL operation should I use on the
ip_addresscolumn?GROUP BYJOINCREATE INDEX(correct answer)- None of the above
-
I have another dataset mapping IP addresses to countries. How should I merge my current dataset with this one?
GROUP BYJOIN(correct answer)CREATE INDEX- None of the above
Big Data Challenge
The Big Data Challenge is a two-week competition starting Nov 6 and ending at 18:00 on Dec 4. Throughout this period, you’ll tackle a series of tasks available on the course portal. For each challenge, you’ll submit Python code directly to the portal, where it will be executed in the cloud.
You’ll be evaluated on two main criteria:
- Accuracy: Measures the correctness of your results. This score is not curved.
- Performance: Assesses how efficiently your code loads, reads, writes, and analyzes datasets. Performance scores will be curved, but the curving will be relaxed.
You’ll have unlimited opportunities to modify and resubmit your code. Since resubmissions may be queued—especially near the deadline—submit early to avoid delays. A sample dataset will be provided for local testing, but final evaluation will use a larger, unseen dataset on the portal. Optimize your code for scalability and generalization.
If you use large-language models (LLMs) at any stage, be especially critical of their outputs and ensure your solutions are robust.
You may work solo or in groups of up to four people. All group members receive the same grade, with a strict maximum of four per group.
Final Project
Timeline
Oct 16: Start forming groups. Before the start of the class, each group should briefly write about these ideas in the class Slack in the public channel. Please use this invite link to join the class Slack. Danny will go over preliminary ideas from these groups in class.
Oct 23: The groups should have been confirmed. Please enter the group into and proposed topic on this spreadsheet.
Oct 30: Each project group should briefly present their plans and timeline at the beginning of class.
Nov 6: Each project group should briefly present their progress at the beginning of class.
Dec 4: Each project group should briefly present their progress at the beginning of class.
Dec 11: Final project presentation.
Final Project Presentation
Students are expected to present their final projects in person on Dec 11. If special circumstances arise, presenting over Zoom is acceptable—just let Danny know in advance. In general, presenting in person is much easier and more effective for telling your story.
If you choose to present over Zoom, make sure your setup keeps you as the focus of the presentation, not your slides. Position yourself as the center of the screen, with slides visible only as a small prop (e.g., in a corner). This applies to all presentations: whether in person or on Zoom, the speaker should be the main attention, and slides should support your story, not dominate it.
Each group will have about 10 minutes to present, depending on the number of groups. Presentations are meant to be interactive, with questions throughout—expect interruptions from Danny and classmates. Typical questions include:
- What are the major challenges in managing your data?
- What makes your project “big data”?
- What is the technical or social significance of your project?
Danny will evaluate presentations on the spot. At the end, send Danny a link to your presentation slides—no write-up is needed. The presentation itself is your final deliverable.
Get into teams of at most 4 people. If you want to do the project alone, that's fine with me. The number of people on the team should reflect the challenges in the project.
Criteria for grading
Although the interim progress reports are not graded, please treat them seriously because Danny will be providing feedback.
Danny will be grading the final presentation based on the following criteria:
- Storytelling (20 points):
- Explain the problem/motivation (very important)
- Compare with existing work
- Method, results, limitations, implications
- Technical contributions (10 points):
- Obtaining, pre-processing, analyzing, evaluating, and visualizing data
Presentation Requirements
- Total time: Each group has approximately 10 minutes; expect interruptions for questions.
- Slides: Keep them simple.
- Delivery: Do not read directly from your slides; otherwise, Danny will interrupt.
- Evaluation criteria: Make sure you address all bullet points in the evaluation criteria. For example, be prepared to answer questions like:
- How big is your dataset?
- What challenges did you face in obtaining, pre-processing, or analyzing the large dataset?
- Objectivity: Be upfront about limitations; do not oversell your work. This is a science class—be objective, not promotional.
The presentation and its slides are all I ask for as the deliverable. There is no need to submit a written report. That being said, for projects that I identify as super interesting, I am more than happy to work with you separately on a blog post; I'll ask NYU Tandon Media team to promote it in the official channels.
Potential Topics
I do not want students to simply take existing datasets (e.g., from Kaggle or public APIs) and do basic analysis or apply standard machine learning models. That's boring. The most important aspect of your project should be obtaining an interesting, large-scale dataset yourself. Projects using easily accessible or small datasets are not acceptable unless you can demonstrate that you obtained the data yourself.
A recommended direction is personal health analytics—collecting detailed data about yourself, such as web browsing history, phone usage, accelerometer readings, location data, or other sensor data over at least one week. This is interesting because it involves real challenges: continuous data collection, privacy, battery consumption, and handling noisy data (e.g., GPS inaccuracies). Think about how to cross-validate between different sources (GPS, accelerometer, network traffic) and what meaningful inferences you can draw.
There are tons of challenges not only in obtaining this data, but also in making sure it's not noisy. We're talking about potentially per-second sensor data, which can be a lot to deal with. Consider how to handle large volumes, clean the data, and ask interesting questions.
You are encouraged to use multiple data sources and combine them for richer analysis. For example, use the iOS App Privacy Report, RouterSense for network-level data, Apple Watch health metrics, or apps that record sound, location, or other sensor data. Projects should address privacy and informed consent, especially if collecting sensitive data.
If you choose personal health analytics, aim for at least one week of continuous data collection and combine RouterSense data with other sources for deeper insights. Always combine RouterSense with something else—Apple Watch Health data, sleep data, or other continuous sensor data. You are not limited to this topic—propose your own ideas, but ensure your dataset is substantial and not trivially obtained. If your idea is too simple or relies on readily available data, it will be rejected.
Some example apps for continuous data collection include:
- RouterSense (network activity)
- iOS App Privacy Report
- Apple Health / Apple Watch
- Sensor logging apps (accelerometer, GPS, sound recording). Examples
- Gaia GPS: iOS and Android
- Sensor Logger: iOS and Android; read about it on its website
The key is to obtain and analyze large, challenging datasets, ideally from multiple sources, and to ask interesting questions that go beyond basic analysis. Think about the challenges of continuous data collection, privacy, battery consumption, and noise. What kind of inferences can you get out of this analysis? How do you cross-validate between different data sources? These are the kinds of projects I want to see.