Assessment

Grading Breakdown

Component	Points	Details
Class participation	5	Engagement in class discussions
Online polls	5	Participation in online polls
Quiz 1	15	In class quiz (Oct 23)
Quiz 2	15	In class quiz (Dec 4)
Data challenge	30	Oct 23 - Nov 6 (ends 18:00)
Final project	30	Presentation on Dec 11
Extra credits	12 max	RouterSense 60-day participation

Grades will be assigned based on this NYU policy.

Class participation

Active participation in class is highly encouraged and will be assessed based on the quality of your engagement. This includes asking insightful questions that demonstrate your understanding of the material, as well as providing thoughtful answers during discussions. When you contribute meaningfully—whether by posing a good question or offering a helpful response—please record your participation immediately using the provided Google Form. Danny will typically acknowledge strong contributions in class by saying "Good question" or "Good response." Consistent, high-quality participation will positively impact your grade and help foster a collaborative learning environment. Each record, once verified, in the form would reward you with one point (max one point per class).

Online polls

From time to time, Danny will conduct Mentimeter polls during class. To receive credit for online poll participation, make sure to record your NetID when prompted in these polls. Your responses—regardless of which options or choices you select—will count toward participation; correctness is not checked. The key requirement is to participate in these synchronous online polls whenever they are conducted.

Midterm quizzes

30 multiple-choice questions. The exam will be administered on GradeScope. You may use your computer to access resources such as Google and LLMs, but you must not communicate with other humans. The exam consists of 30 multiple-choice questions (shuffled) to be completed in 30 minutes. Danny will present a scenario on the projector screen, and the questions will relate to that scenario.

In addition to the scenarios Danny will present on the board or screen during the quiz, you should also be familiar with scenarios and examples discussed in class. These will be clear and straightforward—nothing obscure or tricky. If you pay attention in class and review the recordings, you’ll be well prepared. You don’t need to memorize every detail, but make sure you understand the big picture and main ideas behind each scenario. The quizzes are designed to reinforce your understanding of real examples covered in class, so following along and grasping the context is key. Remember, the goal is to ensure you can follow and apply the concepts discussed in class, which an AI or large language model won’t have access to.

There will be two quizzes:

Oct 23: Quiz 1 will cover the fundamentals of big data analytics.

Dec 4: Quiz 2 will cover advanced topics.

Quiz logistics and recommendations: I strongly recommend taking the quiz in person during class. Many questions may be phrased ambiguously, and being present allows you to ask for clarifications on the spot—sometimes I’ll even provide hints in response to good questions. While you can participate via Zoom, my responses to clarification requests may be delayed since I prioritize in-class questions. You must take the quiz synchronously at the scheduled time, whether in person or on Zoom. If you miss the quiz, you cannot make it up, as I will explain the answers immediately after. However, if you have a legitimate reason for missing the quiz, I can offer a brief (about 10 minutes) phone interview at the end of the semester, where you’ll answer questions in an interview format to make up the grade.

Example scenario: Suppose you have log files for website visits and want to analyze the data to understand who visits your site and when.

Sample questions:

I want to look up an IP address really fast. Which SQL operation should I use on the ip_address column?
- GROUP BY
- JOIN
- CREATE INDEX (correct answer)
- None of the above
I have another dataset mapping IP addresses to countries. How should I merge my current dataset with this one?
- GROUP BY
- JOIN (correct answer)
- CREATE INDEX
- None of the above

Big Data Challenge

The Big Data Challenge is a two-week competition starting Dec 5 and ending at 23:59 on Dec 19 (last day of the Finals period). Throughout this period, you’ll tackle a series of tasks available on the course portal. For each challenge, you’ll submit Python code directly to the portal, where it will be executed in the cloud.

You’ll be evaluated on two main criteria:

Accuracy: Measures the correctness of your results. This score is not curved.
Performance: Assesses how efficiently your code loads, reads, writes, and analyzes datasets. Performance scores will be curved, but the curving will be relaxed.

You’ll have unlimited opportunities to modify and resubmit your code. Since resubmissions may be queued—especially near the deadline—submit early to avoid delays. A sample dataset will be provided for local testing, but final evaluation will use a larger, unseen dataset on the portal. Optimize your code for scalability and generalization.

If you use large-language models (LLMs) at any stage, be especially critical of their outputs and ensure your solutions are robust.

You may work solo or in groups of up to four people. All group members receive the same grade, with a strict maximum of four per group.

Final Project

Timeline

Oct 16: Start forming groups. Before the start of the class, each group should briefly write about these ideas in the class Slack in the public channel. Please use this invite link to join the class Slack. Danny will go over preliminary ideas from these groups in class.

Oct 23: The groups should have been confirmed. Please enter the group into and proposed topic on this spreadsheet.

Oct 30: Each project group should briefly present their plans and timeline at the beginning of class.

Nov 6: Each project group should briefly present their progress at the beginning of class.

Dec 4: Each project group should briefly present their progress at the beginning of class.

Dec 11: Final project presentation.

Final Project Presentation

Students are expected to present their final projects in person on Dec 11. If special circumstances arise, presenting over Zoom is acceptable—just let Danny know in advance. In general, presenting in person is much easier and more effective for telling your story.

If you choose to present over Zoom, you must turn on the camera. Danny will configure the screen such that your face will take up 50% of the screen, while the remaining 50% is for your slide deck.

Whether you present in person (preferred) or on Zoom (less preferred since you might be slightly disadvantaged), the presenter should be the star of the show. The slide deck is only a prop. The speaker should be the main attention, and slides should support your story, not dominate it. Make sure there is very little text on each slide, and lots of graphics/visualizations. Make the text big, especially on Zoom, because I'll be showing your deck at 50% size.

During each group's presentation, I'd prefer having a single speaker throughout the presentation (whether in person or Zoom) for a consistent experience (a recommendation rather than a strict requirement), although I encourage everyone to chime in during the Q&As.

Each group will present max 7 minutes, with 3 minutes of Q&A. During the Q&A, the next group should set up and be ready.

If you present in person, please join the Zoom session and share your screen. I'll project your slides to the class via my computer.

Presentations are meant to be interactive, with questions throughout—expect interruptions from Danny and classmates. Typical questions include:

What are the major challenges in managing your data?
What makes your project “big data”?
What is the technical or social significance of your project?

Danny will evaluate presentations on the spot. At the end, send Danny a link to your presentation slides—no write-up is needed. The presentation itself is your final deliverable.

Get into teams of at most 4 people. If you want to do the project alone, that's fine with me. The number of people on the team should reflect the challenges in the project.

Criteria for grading

Although the interim progress reports are not graded, please treat them seriously because Danny will be providing feedback.

Danny will be grading the final presentation based on the following criteria:

Storytelling (20 points):
- Explain the problem/motivation (very important)
- Compare with existing work
- Method, results, limitations, implications
Technical contributions (10 points):
- Obtaining, pre-processing, analyzing, evaluating, and visualizing data

(Please see the December 4 class recording for details on the above criteria.)

For each of the criteria above, Danny will grade on the following 5-point Likert scale:

Exceeding expectation (5)
Meeting expectation (3)
Below expectation (1)
Not done or missing (0)

Presentation Requirements

Total time: Each group has 7 minutes max to present, followed by 3 minutes of Q&A. Expect interruptions for questions.
Slides: Keep them simple. Make the fonts big. Use lots of graphics and visualizations.
Delivery: Do not read directly from your slides; otherwise, Danny will interrupt.
Evaluation criteria: Make sure you address all bullet points in the evaluation criteria. For example, be prepared to answer questions like:
- How big is your dataset?
- What challenges did you face in obtaining, pre-processing, or analyzing the large dataset?
Objectivity: Be upfront about limitations; do not oversell your work. This is a science class—be objective, not promotional.

The presentation and its slides are all I ask for as the deliverable. There is no need to submit a written report. That being said, for projects that I identify as super interesting, I am more than happy to work with you separately on a blog post; I'll ask NYU Tandon Media team to promote it in the official channels.

Potential Topics

I do not want students to simply take existing datasets (e.g., from Kaggle or public APIs) and do basic analysis or apply standard machine learning models. That's boring. The most important aspect of your project should be obtaining an interesting, large-scale dataset yourself. Projects using easily accessible or small datasets are not acceptable unless you can demonstrate that you obtained the data yourself.

A recommended direction is personal health analytics—collecting detailed data about yourself, such as web browsing history, phone usage, accelerometer readings, location data, or other sensor data over at least one week. This is interesting because it involves real challenges: continuous data collection, privacy, battery consumption, and handling noisy data (e.g., GPS inaccuracies). Think about how to cross-validate between different sources (GPS, accelerometer, network traffic) and what meaningful inferences you can draw.

There are tons of challenges not only in obtaining this data, but also in making sure it's not noisy. We're talking about potentially per-second sensor data, which can be a lot to deal with. Consider how to handle large volumes, clean the data, and ask interesting questions.

You are encouraged to use multiple data sources and combine them for richer analysis. For example, use the iOS App Privacy Report, RouterSense for network-level data, Apple Watch health metrics, or apps that record sound, location, or other sensor data. Projects should address privacy and informed consent, especially if collecting sensitive data.

If you choose personal health analytics, aim for at least one week of continuous data collection and combine RouterSense data with other sources for deeper insights. Always combine RouterSense with something else—Apple Watch Health data, sleep data, or other continuous sensor data. You are not limited to this topic—propose your own ideas, but ensure your dataset is substantial and not trivially obtained. If your idea is too simple or relies on readily available data, it will be rejected.

Some example apps for continuous data collection include:

RouterSense (network activity)
iOS App Privacy Report
Apple Health / Apple Watch
Sensor logging apps (accelerometer, GPS, sound recording). Examples
- Gaia GPS: iOS and Android
- Sensor Logger: iOS and Android; read about it on its website

The key is to obtain and analyze large, challenging datasets, ideally from multiple sources, and to ask interesting questions that go beyond basic analysis. Think about the challenges of continuous data collection, privacy, battery consumption, and noise. What kind of inferences can you get out of this analysis? How do you cross-validate between different data sources? These are the kinds of projects I want to see.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search