Assessment


Grading Breakdown

Component Points Details
Class participation 5 Engagement in class discussions
Online polls 5 Participation in online polls
Quiz 1 15 In class quiz (Oct 23)
Quiz 2 15 In class quiz (Dec 4)
Data challenge 30 Oct 23 - Nov 6 (ends 18:00)
Final project 30 Presentation on Dec 11
Extra credits 12 max RouterSense 60-day participation

Grades will be assigned based on this NYU policy.


Class participation

Active participation in class is highly encouraged and will be assessed based on the quality of your engagement. This includes asking insightful questions that demonstrate your understanding of the material, as well as providing thoughtful answers during discussions. When you contribute meaningfully—whether by posing a good question or offering a helpful response—please record your participation immediately using the provided Google Form. Danny will typically acknowledge strong contributions in class by saying "Good question" or "Good response." Consistent, high-quality participation will positively impact your grade and help foster a collaborative learning environment. Each record, once verified, in the form would reward you with one point (max one point per class).


Online polls

From time to time, Danny will conduct Mentimeter polls during class. To receive credit for online poll participation, make sure to record your NetID when prompted in these polls. Your responses—regardless of which options or choices you select—will count toward participation; correctness is not checked. The key requirement is to participate in these synchronous online polls whenever they are conducted.


Midterm quizzes

30 multiple-choice questions. The exam will be administered on GradeScope. You may use your computer to access resources such as Google and LLMs, but you must not communicate with other humans. The exam consists of 30 multiple-choice questions (shuffled) to be completed in 30 minutes. Danny will present a scenario on the projector screen, and the questions will relate to that scenario.

In addition to the scenarios Danny will present on the board or screen during the quiz, you should also be familiar with scenarios and examples discussed in class. These will be clear and straightforward—nothing obscure or tricky. If you pay attention in class and review the recordings, you’ll be well prepared. You don’t need to memorize every detail, but make sure you understand the big picture and main ideas behind each scenario. The quizzes are designed to reinforce your understanding of real examples covered in class, so following along and grasping the context is key. Remember, the goal is to ensure you can follow and apply the concepts discussed in class, which an AI or large language model won’t have access to.

There will be two quizzes:

Oct 23: Quiz 1 will cover the fundamentals of big data analytics.

Dec 4: Quiz 2 will cover advanced topics.

Quiz logistics and recommendations: I strongly recommend taking the quiz in person during class. Many questions may be phrased ambiguously, and being present allows you to ask for clarifications on the spot—sometimes I’ll even provide hints in response to good questions. While you can participate via Zoom, my responses to clarification requests may be delayed since I prioritize in-class questions. You must take the quiz synchronously at the scheduled time, whether in person or on Zoom. If you miss the quiz, you cannot make it up, as I will explain the answers immediately after. However, if you have a legitimate reason for missing the quiz, I can offer a brief (about 10 minutes) phone interview at the end of the semester, where you’ll answer questions in an interview format to make up the grade.

Example scenario: Suppose you have log files for website visits and want to analyze the data to understand who visits your site and when.

Sample questions:


Big Data Challenge

The Big Data Challenge is a two-week competition starting Nov 6 and ending at 18:00 on Dec 4. Throughout this period, you’ll tackle a series of tasks available on the course portal. For each challenge, you’ll submit Python code directly to the portal, where it will be executed in the cloud.

You’ll be evaluated on two main criteria:

You’ll have unlimited opportunities to modify and resubmit your code. Since resubmissions may be queued—especially near the deadline—submit early to avoid delays. A sample dataset will be provided for local testing, but final evaluation will use a larger, unseen dataset on the portal. Optimize your code for scalability and generalization.

If you use large-language models (LLMs) at any stage, be especially critical of their outputs and ensure your solutions are robust.

You may work solo or in groups of up to four people. All group members receive the same grade, with a strict maximum of four per group.


Final Project

Timeline

Oct 16: Start forming groups. Before the start of the class, each group should briefly write about these ideas in the class Slack in the public channel. Please use this invite link to join the class Slack. Danny will go over preliminary ideas from these groups in class.

Oct 23: The groups should have been confirmed. Please enter the group into and proposed topic on this spreadsheet.

Oct 30: Each project group should briefly present their plans and timeline at the beginning of class.

Nov 6: Each project group should briefly present their progress at the beginning of class.

Dec 4: Each project group should briefly present their progress at the beginning of class.

Dec 11: Final project presentation.

Final Project Presentation

Students are expected to present their final projects in person on Dec 11. If special circumstances arise, presenting over Zoom is acceptable—just let Danny know in advance. In general, presenting in person is much easier and more effective for telling your story.

If you choose to present over Zoom, make sure your setup keeps you as the focus of the presentation, not your slides. Position yourself as the center of the screen, with slides visible only as a small prop (e.g., in a corner). This applies to all presentations: whether in person or on Zoom, the speaker should be the main attention, and slides should support your story, not dominate it.

Each group will have about 10 minutes to present, depending on the number of groups. Presentations are meant to be interactive, with questions throughout—expect interruptions from Danny and classmates. Typical questions include:

Danny will evaluate presentations on the spot. At the end, send Danny a link to your presentation slides—no write-up is needed. The presentation itself is your final deliverable.

Get into teams of at most 4 people. If you want to do the project alone, that's fine with me. The number of people on the team should reflect the challenges in the project.

Criteria for grading

Although the interim progress reports are not graded, please treat them seriously because Danny will be providing feedback.

Danny will be grading the final presentation based on the following criteria:

Presentation Requirements

The presentation and its slides are all I ask for as the deliverable. There is no need to submit a written report. That being said, for projects that I identify as super interesting, I am more than happy to work with you separately on a blog post; I'll ask NYU Tandon Media team to promote it in the official channels.

Potential Topics

I do not want students to simply take existing datasets (e.g., from Kaggle or public APIs) and do basic analysis or apply standard machine learning models. That's boring. The most important aspect of your project should be obtaining an interesting, large-scale dataset yourself. Projects using easily accessible or small datasets are not acceptable unless you can demonstrate that you obtained the data yourself.

A recommended direction is personal health analytics—collecting detailed data about yourself, such as web browsing history, phone usage, accelerometer readings, location data, or other sensor data over at least one week. This is interesting because it involves real challenges: continuous data collection, privacy, battery consumption, and handling noisy data (e.g., GPS inaccuracies). Think about how to cross-validate between different sources (GPS, accelerometer, network traffic) and what meaningful inferences you can draw.

There are tons of challenges not only in obtaining this data, but also in making sure it's not noisy. We're talking about potentially per-second sensor data, which can be a lot to deal with. Consider how to handle large volumes, clean the data, and ask interesting questions.

You are encouraged to use multiple data sources and combine them for richer analysis. For example, use the iOS App Privacy Report, RouterSense for network-level data, Apple Watch health metrics, or apps that record sound, location, or other sensor data. Projects should address privacy and informed consent, especially if collecting sensitive data.

If you choose personal health analytics, aim for at least one week of continuous data collection and combine RouterSense data with other sources for deeper insights. Always combine RouterSense with something else—Apple Watch Health data, sleep data, or other continuous sensor data. You are not limited to this topic—propose your own ideas, but ensure your dataset is substantial and not trivially obtained. If your idea is too simple or relies on readily available data, it will be rejected.

Some example apps for continuous data collection include:

The key is to obtain and analyze large, challenging datasets, ideally from multiple sources, and to ask interesting questions that go beyond basic analysis. Think about the challenges of continuous data collection, privacy, battery consumption, and noise. What kind of inferences can you get out of this analysis? How do you cross-validate between different data sources? These are the kinds of projects I want to see.