Syllabus

We meet 18:00 - 20:30 every Thursday. Feel free to join us...

in person (2 MetroTech Center Room 817) or
on Zoom (NYU sign-in required) - I livestream every class. I record too, but since participation matters, I encourage you to join synchronously but not rely on the recording.

September 4: Introduction

Everyone should read this paper: "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI

September 11: End-to-end example

Class focused on exploring real-world big data analysis using a live demo of a fictitious e-commerce website. Danny invited students to interact with the site, generating user activity logs for analysis. He emphasized starting from high-level business questions—such as identifying popular products, user locations, and peak activity times—before diving into technical details.

Danny demonstrated downloading raw server logs, inspecting their JSON structure, and using Jupyter notebooks with Python and Pandas for data parsing. He showed how to extract product names from request URIs, filter out irrelevant log entries, and discussed the importance of understanding data schemas and ground truth. The session highlighted the use of AI coding assistants (like GitHub Copilot) for rapid prototyping, but stressed the need for human oversight and careful prompt engineering.

Students learned to build a simple data pipeline: loading logs, cleaning data, extracting product views, and aggregating results to identify the most popular items. Danny discussed common pitfalls, such as misinterpreting log fields and the impact of repeated page refreshes on popularity metrics. The class ended with a discussion on defining “popularity” (by frequency vs. unique users) and the importance of clear business logic in analytics.

Recording

September 18: End-to-end example

Class focused on building a real-time dashboard to analyze user activity on a fictitious shopping website. Danny recapped key lessons from previous sessions: the importance of understanding data sources, documentation challenges, and the concept of “data cascades”—how poor data quality propagates through models. He emphasized using AI coding assistants (like GitHub Copilot) for rapid prototyping, but stressed the need for human oversight and clear design documentation.

Danny demonstrated setting up a Python environment on a remote server via SSH, installing Pandas for data manipulation and Streamlit for interactive dashboards. He showed how to access and parse server log files, extracting fields such as timestamps, user IDs, and product IDs. The demo included writing Python scripts to filter logs for recent activity, aggregate product views, and count unique visitors in the last minute.

Using Streamlit, Danny built a dashboard with live-updating charts: - A bar chart showing the most-viewed products in real time - A counter for active users in the past minute - A table listing recent user actions

He explained Streamlit’s auto-refresh feature for real-time updates, and walked through debugging issues like missing log fields and handling malformed entries. Danny discussed optimizing performance by batching log reads and using Pandas’ efficient data structures. Students saw how to deploy the dashboard for remote access and explored customizing the UI for clarity and usability.

The session ended with Q&A on dashboard design, handling large log files, and strategies for scaling real-time analytics in production environments.

Recording

September 25: NoSQL

Class focused on privacy-preserving data collection and database design. Danny demonstrated a research study using a VPN app (WireGuard) to collect phone activity data, emphasizing the importance of saving your participation ID, as it is never stored by researchers. The study dashboard allows participants to view their own device activity, export data, and collaborate with peers, while maintaining privacy—no personally identifiable information (PII) is collected.

Danny explained the benefits of the system, including ad blocking and detailed usage heatmaps, and discussed weekly surveys correlating phone activity with psychological measures. The session transitioned to technical topics: streaming log data into databases, contrasting SQLite (schema-based, fast with indexes) and MongoDB (schema-less, flexible for evolving data needs).

Danny provided a live demo of MongoDB, showing how to set up a local instance, create databases and collections, and insert JSON-like documents. He demonstrated querying with filters, projections, and aggregations, highlighting MongoDB’s ability to handle nested and heterogeneous data structures. The class discussed how MongoDB’s flexible schema supports rapid prototyping and evolving requirements, and how indexes can be created to optimize query performance. Danny also covered best practices for scaling MongoDB horizontally using sharding, and explained replica sets for high availability. The session ended with Q&A on database scaling, indexing, and strategies for managing large, distributed datasets in NoSQL environments.

Recording

October 2: Web scraping

Class discussion focused on the differences between structured and unstructured data, privacy in data collection, and hands-on data scraping. Danny demonstrated how personal phone activity data is collected for a research study, highlighting privacy concerns and stressing the importance of not sharing participation IDs. The class explored techniques for analyzing structured data using SQLite, Pandas, and MongoDB, then transitioned to unstructured data, with a focus on text.

Danny introduced web scraping as a practical way to collect large volumes of text data, using job postings as a case study. The session featured a live demo of inspecting job sites (LinkedIn, Greenhouse, Ashby, Indeed, Monster), discussing challenges such as JavaScript-rendered content, login requirements, and user-agent spoofing. Students learned to spot-check downloaded HTML for accuracy and to verify data quality before coding. The importance of understanding site structure and legal/ethical considerations was emphasized. The next steps will cover parsing HTML and extracting relevant information using Python and Jupyter notebooks, preparing for hands-on exercises in data extraction.

Recording

October 9: Data Storage Trade-offs and Scraping Challenges

Class focused on practical trade-offs in storing and analyzing large text datasets, marking a transition from web scraping to hands-on work with the historical Yelp dataset. Danny began by revisiting the concept of “data cascades” from the assigned paper, emphasizing how early errors or bias in data labeling can propagate through downstream analytics and machine learning models.

The session included interactive polls and open-ended questions about data pipelines, accuracy, and storage formats. Danny compared CSV, JSON, SQL, NoSQL, and Pandas DataFrames, outlining their strengths and weaknesses:

CSV/JSON: Simple, portable, but limited for complex queries and large-scale analytics.
SQL/NoSQL databases: Persistent, scalable, support indexing and complex queries; SQL is schema-based, NoSQL is flexible for evolving data.
Pandas: Fast for in-memory analysis, ideal for prototyping and small datasets, but not durable or scalable for production.

Danny explained that persistent storage (disk-based databases or files) is preferred for batch analytics, while in-memory solutions (like Pandas or Redis) are suitable for real-time, short-term analysis but have limitations in durability and scalability.

Students discussed bottlenecks in big data workflows:

CPU-bound: Intensive computations (e.g., parsing, aggregations)
Memory-bound: Large datasets exceeding RAM
I/O-bound: Disk speed, often the slowest part of the pipeline

Danny clarified the role of indexes: they slow down writes but dramatically speed up reads, and are essential for fast querying in large datasets.

The class then brainstormed challenges in scraping Amazon reviews, identifying issues such as: - Dynamic content and JavaScript rendering - Anti-scraping measures (CAPTCHAs, rate limits) - Legal/copyright concerns (CFAA risks) - Prevalence of fake reviews and products

Danny demonstrated scraping data behind login screens using browser cookies and curl, showing how to capture authenticated requests. He stressed the importance of verifying data authenticity, noting that modern datasets may be polluted by AI-generated or misleading content.

The session concluded with logistics for participation and a preview of upcoming hands-on analysis using the Yelp dataset, including data exploration, cleaning, and building scalable pipelines for text analytics.

Recording

October 16: Yelp analysis and benchmarking

Class focused on final project planning and practical big data analysis using the Yelp dataset. Danny outlined the final project requirements: each group must propose an original data-driven project, emphasizing the importance of collecting or scraping raw data rather than relying on pre-cleaned datasets (e.g., Kaggle). Projects should demonstrate creativity, tackle real-world challenges, and ideally incorporate heterogeneous data sources, personal sensor data, or complex scraping tasks. Students are expected to document their data collection process, address potential biases, and validate the integrity of their datasets.

Danny reviewed project proposals live, providing feedback such as:

Encouraging teams to clarify their business questions and hypotheses before coding.
Advising against using only public APIs without exploring data limitations or missing fields.
Suggesting ways to combine multiple data sources for richer analysis.
Recommending students focus on data curation, cleaning, and benchmarking, as these are the most challenging and valuable aspects of big data work.

He explained the mechanics of the Big Data Challenge, where students submit code and receive real-time feedback on performance, stressing the need for efficient queries and optimization. Quizzes will be open book but context-dependent, rewarding students who engage actively in class.

Technical topics included importing and analyzing Yelp data using SQLite, Pandas, DuckDB, and MongoDB. Danny emphasized the importance of understanding the strengths and limitations of each tool, including their performance characteristics and suitability for different types of analysis. Students practiced select, filter, group by, and join operations, and learned how to build materialized views for benchmarking.

Danny reiterated that successful projects require careful data management, understanding noise, and identifying biases. The session ended with Q&A on project ideas, data sources, and strategies for collecting meaningful, original datasets. Danny encouraged students to iterate on their proposals, use Slack for feedback, and focus on the unique challenges of data validation and management.

Recording

October 23: Quiz 1 and Yelp analysis + benchmarking

Quiz 1 on fundamentals of big data analytics. The quiz will take place in class or on Zoom synchronously at 18:15 - 18:45. Students are expected to take the quiz online on GradeScoope. The quiz is open book, open Internet. Students can use any tools (e.g., LLMs), but they may not communicate with another human being during the quiz.

At the beginning of class, Danny will discuss the final projects:

The groups should have been confirmed. Please enter the group into and proposed topic on this spreadsheet.
Before the start of the class, each group should again write about these ideas in the class Slack in the public channel.
Danny will go over these ideas. Ideas should finalize today.

Class focused on foundational big data processing concepts, emphasizing Spark, MapReduce, and lazy evaluation. Danny began by explaining the MapReduce paradigm: the map step transforms input data into key-value pairs, the shuffle redistributes data across nodes, and the reduce step aggregates results. He illustrated how traditional MapReduce (e.g., Hadoop) writes intermediate data to disk, which can be slow and resource-intensive.

Danny then introduced Apache Spark, highlighting its use of in-memory computation and Resilient Distributed Datasets (RDDs). Spark’s transformations (e.g., map, filter, groupByKey) are lazy—they build a logical execution plan but do not compute results until an action (e.g., collect, count, saveAsTextFile) is triggered. This allows Spark to optimize the pipeline, minimize data shuffling, and avoid unnecessary computation.

Technical details included:

Spark’s DAG (Directed Acyclic Graph) scheduler for optimizing execution plans.
Fault tolerance via RDD lineage: lost partitions can be recomputed from original data.
Example code snippets showing Spark transformations:

rdd = sc.textFile("yelp_reviews.json")
filtered = rdd.filter(lambda x: "pizza" in x)
counts = filtered.map(lambda x: (x['business_id'], 1)).reduceByKey(lambda a, b: a + b)

Comparison of Spark’s reduceByKey (which combines results locally before shuffling) vs. groupByKey (which shuffles all data).

Danny discussed trade-offs: Spark excels at iterative algorithms and interactive analytics, while MapReduce is better for simple, batch-oriented jobs. Students asked about cluster setup, memory management, and Spark’s support for SQL and DataFrames.

The session ended with reminders about project logistics and a preview of distributed computation topics, including sharding and benchmarking.

Recording

October 30: Search and text analysis

Danny will also ask individual project groups to briefly present their plans and timeline.

During class, Danny will discuss the following concepts:

TF-IDF
page rank
indexing and searching
Kibana + Elastic Search

November 6: Distributed computation

Big Data Challenge begins today. Danny will explain how to submit responses to the Big Data Challenge.

Danny will also ask individual project groups to briefly present their progress.

During class, Danny will discuss the following concepts:

Sharding
Consistency models

November 13: LLMs and Urban Analytics

Guest lectures:

LLMs, RAGs, etc (Vijay Prakash)
Urban privacy data analytics (Grace McGrath)

Please attend in person. The class will not be livestreamed or recorded. Quiz 2 will cover contents of this class.

November 20: AI/ML for Big Data

Guest lectures: AI/ML for Big Data (Rameen Mahmood)

Please attend in person. The class will not be livestreamed or recorded. Quiz 2 will cover contents of this class.

November 27: No class

No class - Thanksgiving

December 4: Quiz 2 and latest big data trends

Big Data Challenge ends at 18:00. Danny will go over the results.

Quiz 2 on topics covered so far. Same format as Quiz 1.

Danny will also ask individual project groups to briefly present their progress.

During class, Danny will discuss the latest technological trends:

streaming
time series database
Kafka, Hive, BigQuery

December 11: Final project presentation

TBD.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search