Syllabus
We meet 18:00 - 20:30 every Thursday. Feel free to join us...
- in person (2 MetroTech Center Room 817) or
- on Zoom (NYU sign-in required) - I livestream every class. I record too, but since participation matters, I encourage you to join synchronously but not rely on the recording.
September 4: Introduction
Everyone should read this paper: "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI
September 11: End-to-end example
Class focused on exploring real-world big data analysis using a live demo of a fictitious e-commerce website. Danny invited students to interact with the site, generating user activity logs for analysis. He emphasized starting from high-level business questions—such as identifying popular products, user locations, and peak activity times—before diving into technical details.
Danny demonstrated downloading raw server logs, inspecting their JSON structure, and using Jupyter notebooks with Python and Pandas for data parsing. He showed how to extract product names from request URIs, filter out irrelevant log entries, and discussed the importance of understanding data schemas and ground truth. The session highlighted the use of AI coding assistants (like GitHub Copilot) for rapid prototyping, but stressed the need for human oversight and careful prompt engineering.
Students learned to build a simple data pipeline: loading logs, cleaning data, extracting product views, and aggregating results to identify the most popular items. Danny discussed common pitfalls, such as misinterpreting log fields and the impact of repeated page refreshes on popularity metrics. The class ended with a discussion on defining “popularity” (by frequency vs. unique users) and the importance of clear business logic in analytics.
September 18: End-to-end example
Class focused on building a real-time dashboard to analyze user activity on a fictitious shopping website. Danny recapped key lessons from previous sessions: the importance of understanding data sources, documentation challenges, and the concept of “data cascades”—how poor data quality propagates through models. He emphasized using AI coding assistants (like GitHub Copilot) for rapid prototyping, but stressed the need for human oversight and clear design documentation.
Danny demonstrated setting up a Python environment on a remote server via SSH, installing Pandas for data manipulation and Streamlit for interactive dashboards. He showed how to access and parse server log files, extracting fields such as timestamps, user IDs, and product IDs. The demo included writing Python scripts to filter logs for recent activity, aggregate product views, and count unique visitors in the last minute.
Using Streamlit, Danny built a dashboard with live-updating charts: - A bar chart showing the most-viewed products in real time - A counter for active users in the past minute - A table listing recent user actions
He explained Streamlit’s auto-refresh feature for real-time updates, and walked through debugging issues like missing log fields and handling malformed entries. Danny discussed optimizing performance by batching log reads and using Pandas’ efficient data structures. Students saw how to deploy the dashboard for remote access and explored customizing the UI for clarity and usability.
The session ended with Q&A on dashboard design, handling large log files, and strategies for scaling real-time analytics in production environments.
September 25: NoSQL
Class focused on privacy-preserving data collection and database design. Danny demonstrated a research study using a VPN app (WireGuard) to collect phone activity data, emphasizing the importance of saving your participation ID, as it is never stored by researchers. The study dashboard allows participants to view their own device activity, export data, and collaborate with peers, while maintaining privacy—no personally identifiable information (PII) is collected.
Danny explained the benefits of the system, including ad blocking and detailed usage heatmaps, and discussed weekly surveys correlating phone activity with psychological measures. The session transitioned to technical topics: streaming log data into databases, contrasting SQLite (schema-based, fast with indexes) and MongoDB (schema-less, flexible for evolving data needs).
Danny provided a live demo of MongoDB, showing how to set up a local instance, create databases and collections, and insert JSON-like documents. He demonstrated querying with filters, projections, and aggregations, highlighting MongoDB’s ability to handle nested and heterogeneous data structures. The class discussed how MongoDB’s flexible schema supports rapid prototyping and evolving requirements, and how indexes can be created to optimize query performance. Danny also covered best practices for scaling MongoDB horizontally using sharding, and explained replica sets for high availability. The session ended with Q&A on database scaling, indexing, and strategies for managing large, distributed datasets in NoSQL environments.
October 2: Web scraping
Class discussion focused on the differences between structured and unstructured data, privacy in data collection, and hands-on data scraping. Danny demonstrated how personal phone activity data is collected for a research study, highlighting privacy concerns and stressing the importance of not sharing participation IDs. The class explored techniques for analyzing structured data using SQLite, Pandas, and MongoDB, then transitioned to unstructured data, with a focus on text.
Danny introduced web scraping as a practical way to collect large volumes of text data, using job postings as a case study. The session featured a live demo of inspecting job sites (LinkedIn, Greenhouse, Ashby, Indeed, Monster), discussing challenges such as JavaScript-rendered content, login requirements, and user-agent spoofing. Students learned to spot-check downloaded HTML for accuracy and to verify data quality before coding. The importance of understanding site structure and legal/ethical considerations was emphasized. The next steps will cover parsing HTML and extracting relevant information using Python and Jupyter notebooks, preparing for hands-on exercises in data extraction.
October 9: Data Storage Trade-offs and Scraping Challenges
Class focused on practical trade-offs in storing and analyzing large text datasets, marking a transition from web scraping to hands-on work with the historical Yelp dataset. Danny began by revisiting the concept of “data cascades” from the assigned paper, emphasizing how early errors or bias in data labeling can propagate through downstream analytics and machine learning models.
The session included interactive polls and open-ended questions about data pipelines, accuracy, and storage formats. Danny compared CSV, JSON, SQL, NoSQL, and Pandas DataFrames, outlining their strengths and weaknesses:
- CSV/JSON: Simple, portable, but limited for complex queries and large-scale analytics.
- SQL/NoSQL databases: Persistent, scalable, support indexing and complex queries; SQL is schema-based, NoSQL is flexible for evolving data.
- Pandas: Fast for in-memory analysis, ideal for prototyping and small datasets, but not durable or scalable for production.
Danny explained that persistent storage (disk-based databases or files) is preferred for batch analytics, while in-memory solutions (like Pandas or Redis) are suitable for real-time, short-term analysis but have limitations in durability and scalability.
Students discussed bottlenecks in big data workflows:
- CPU-bound: Intensive computations (e.g., parsing, aggregations)
- Memory-bound: Large datasets exceeding RAM
- I/O-bound: Disk speed, often the slowest part of the pipeline
Danny clarified the role of indexes: they slow down writes but dramatically speed up reads, and are essential for fast querying in large datasets.
The class then brainstormed challenges in scraping Amazon reviews, identifying issues such as: - Dynamic content and JavaScript rendering - Anti-scraping measures (CAPTCHAs, rate limits) - Legal/copyright concerns (CFAA risks) - Prevalence of fake reviews and products
Danny demonstrated scraping data behind login screens using browser cookies and curl, showing how to capture authenticated requests. He stressed the importance of verifying data authenticity, noting that modern datasets may be polluted by AI-generated or misleading content.
The session concluded with logistics for participation and a preview of upcoming hands-on analysis using the Yelp dataset, including data exploration, cleaning, and building scalable pipelines for text analytics.
October 16: Yelp analysis and benchmarking
Class focused on final project planning and practical big data analysis using the Yelp dataset. Danny outlined the final project requirements: each group must propose an original data-driven project, emphasizing the importance of collecting or scraping raw data rather than relying on pre-cleaned datasets (e.g., Kaggle). Projects should demonstrate creativity, tackle real-world challenges, and ideally incorporate heterogeneous data sources, personal sensor data, or complex scraping tasks. Students are expected to document their data collection process, address potential biases, and validate the integrity of their datasets.
Danny reviewed project proposals live, providing feedback such as:
- Encouraging teams to clarify their business questions and hypotheses before coding.
- Advising against using only public APIs without exploring data limitations or missing fields.
- Suggesting ways to combine multiple data sources for richer analysis.
- Recommending students focus on data curation, cleaning, and benchmarking, as these are the most challenging and valuable aspects of big data work.
He explained the mechanics of the Big Data Challenge, where students submit code and receive real-time feedback on performance, stressing the need for efficient queries and optimization. Quizzes will be open book but context-dependent, rewarding students who engage actively in class.
Technical topics included importing and analyzing Yelp data using SQLite, Pandas, DuckDB, and MongoDB. Danny emphasized the importance of understanding the strengths and limitations of each tool, including their performance characteristics and suitability for different types of analysis. Students practiced select, filter, group by, and join operations, and learned how to build materialized views for benchmarking.
Danny reiterated that successful projects require careful data management, understanding noise, and identifying biases. The session ended with Q&A on project ideas, data sources, and strategies for collecting meaningful, original datasets. Danny encouraged students to iterate on their proposals, use Slack for feedback, and focus on the unique challenges of data validation and management.
October 23: Quiz 1 and Yelp analysis + benchmarking
Quiz 1 on fundamentals of big data analytics. The quiz will take place in class or on Zoom synchronously at 18:15 - 18:45. Students are expected to take the quiz online on GradeScoope. The quiz is open book, open Internet. Students can use any tools (e.g., LLMs), but they may not communicate with another human being during the quiz.
At the beginning of class, Danny will discuss the final projects:
- The groups should have been confirmed. Please enter the group into and proposed topic on this spreadsheet.
- Before the start of the class, each group should again write about these ideas in the class Slack in the public channel.
- Danny will go over these ideas. Ideas should finalize today.
Class focused on foundational big data processing concepts, emphasizing Spark, MapReduce, and lazy evaluation. Danny began by explaining the MapReduce paradigm: the map step transforms input data into key-value pairs, the shuffle redistributes data across nodes, and the reduce step aggregates results. He illustrated how traditional MapReduce (e.g., Hadoop) writes intermediate data to disk, which can be slow and resource-intensive.
Danny then introduced Apache Spark, highlighting its use of in-memory computation and Resilient Distributed Datasets (RDDs). Spark’s transformations (e.g., map, filter, groupByKey) are lazy—they build a logical execution plan but do not compute results until an action (e.g., collect, count, saveAsTextFile) is triggered. This allows Spark to optimize the pipeline, minimize data shuffling, and avoid unnecessary computation.
Technical details included:
- Spark’s DAG (Directed Acyclic Graph) scheduler for optimizing execution plans.
- Fault tolerance via RDD lineage: lost partitions can be recomputed from original data.
- Example code snippets showing Spark transformations:
rdd = sc.textFile("yelp_reviews.json")
filtered = rdd.filter(lambda x: "pizza" in x)
counts = filtered.map(lambda x: (x['business_id'], 1)).reduceByKey(lambda a, b: a + b)
- Comparison of Spark’s
reduceByKey(which combines results locally before shuffling) vs.groupByKey(which shuffles all data).
Danny discussed trade-offs: Spark excels at iterative algorithms and interactive analytics, while MapReduce is better for simple, batch-oriented jobs. Students asked about cluster setup, memory management, and Spark’s support for SQL and DataFrames.
The session ended with reminders about project logistics and a preview of distributed computation topics, including sharding and benchmarking.
October 30: Spark
Danny provided detailed feedback on class projects, emphasizing several key points for success:
- Original Data Collection: Projects must go beyond using pre-cleaned datasets; students should collect, scrape, or generate their own raw data. Danny cautioned against relying solely on public APIs or Kaggle datasets, urging teams to demonstrate creativity and initiative in sourcing data.
- Documentation and Transparency: Every step of the data pipeline—from collection and cleaning to transformation and analysis—should be clearly documented. Danny stressed that reproducibility and transparency are essential, and teams should be able to explain their choices and methods.
- Data Validation and Bias Awareness: Danny highlighted the importance of validating data integrity and being aware of potential biases, especially when merging heterogeneous sources or working with sensor data. He encouraged students to critically assess data provenance and address missing or noisy values.
- Benchmarking and Performance: Teams should benchmark their analytics, optimize queries, and be mindful of scalability. Danny recommended profiling Spark jobs, identifying bottlenecks, and iterating on pipeline efficiency.
- Clear Business Questions: Projects should start with well-defined business or research questions. Danny advised teams to clarify their hypotheses and goals before diving into technical implementation.
- Iterative Feedback: Danny encouraged ongoing feedback and iteration, using Slack and class discussions to refine project ideas and approaches.
Danny also discussed Spark in detail, sharing best practices and common challenges:
- Best Practices: Write Spark code using transformations like
map,filter, andreduceByKeyto leverage lazy evaluation and minimize unnecessary computation. UsereduceByKeyinstead ofgroupByKeyfor aggregations to reduce data shuffling and improve performance. Profile Spark jobs to identify bottlenecks, and optimize memory usage by caching RDDs or DataFrames when needed. - Challenges: Be aware of cluster setup complexities, memory management issues, and the impact of shuffling large datasets. Debugging distributed jobs can be difficult, so Danny recommended starting with small samples locally before scaling up. He emphasized the importance of understanding Spark’s execution plan (DAG) and using built-in tools for monitoring and troubleshooting.
- General Advice: Document Spark pipelines thoroughly, validate intermediate results, and iterate on pipeline design to ensure scalability and correctness.
He reminded students that successful projects require careful planning, rigorous validation, and a willingness to tackle real-world data challenges. Attention to detail, documentation, and responsiveness to feedback will be critical for achieving strong results.
November 6: Text data analysis and distributed computation
Big Data Challenge Kickoff
The instructor announced the start of the Big Data Challenge, a multi-week competition where students will analyze a shared dataset and submit code through a Google Form for testing accuracy and performance. He explained that the challenge will use a dashboard to display real-time results compared to the class, with multiple submission attempts allowed. The instructor also outlined upcoming topics on textual data and distributed data, noting that guest lecturers will cover large language models in the next two weeks, and the final exam will include high-level questions on these topics.
Large Language Models in Search
Danny discussed the use of large language models for search and ranking, mentioning YOLO for object detection and RAG for search applications. He expressed hesitation about the reliability of large language models due to hallucination issues. Danny explained that evaluation methods for these models include benchmarking and human-in-the-loop approaches, using Google Health AI as an example. He also mentioned that he would cover RAG in more detail next week and would show some proprietary Google Health AI slides during the in-person session.
AI Model Evaluation Strategies
Danny discussed the evaluation of high-stake AI models, particularly in healthcare, which involves both automatic and human evaluation methods. He explained that while automatic evaluation is straightforward, human evaluation is crucial for ensuring quality and often involves expert raters assessing models on various axes. Danny also touched on the trade-offs of using large language models for retrieval tasks, noting the difficulty of automatic evaluation and the need for human involvement, which can be costly and not very scalable. He mentioned that sampling is often used in academic settings due to resource constraints. Danny further addressed Rui's question about converting natural language queries into SQL or Python code for accessing local databases, explaining that this process is now feasible with advanced language models like GPT-4o, which can generate appropriate code based on context. He also briefly introduced the concept of homomorphic encryption as a privacy-preserving method for processing data in the cloud without exposing the actual data.
Search Algorithm Manipulation and Risks
Danny discussed the manipulation of search algorithms and language models, highlighting how they can be influenced to rank harmful content highly. He explained that while these models are not "hallucinatory" in the sense of creating false information, they can be trained on misleading data, leading to incorrect outputs. Danny also mentioned an ongoing research project with JPMorgan Chase to develop a chatbot that helps consumers navigate cybersecurity risks, noting that current language models often provide inaccurate advice on such topics.
Risks in Language Model Applications
Danny discussed the potential risks of large language models, particularly in therapeutic and safety contexts. He highlighted a case where a chatbot provided a correct but unsafe response to a question about handling potential online tracking by a partner, emphasizing the need for human expertise in assessing safety. Danny also mentioned ongoing work on using language models to assist victims of domestic abuse and the challenges in benchmarking these models for safety and accuracy.
Distributed Data Storage and AI
Danny discussed the use of distributed data storage and replication techniques to improve performance and availability, using examples like YouTube and Netflix. He explained concepts such as sharding, partitioning, and the trade-offs between eventual consistency and strong consistency models. Danny also touched on the potential of AI language models for emotional support and therapy, citing a New York Times article, and emphasized the need for guardrails to handle dangerous situations.
Distributed Systems and Database Fundamentals
Danny explained key concepts in distributed systems and database management, focusing on replication models, consistency, and the CAP theorem. He emphasized that for many small to medium-sized enterprises, simpler centralized systems with powerful machines and replication tools like RQLite or Litestream can be more effective than complex distributed systems. Danny also briefly touched on blockchain technology, noting its relevance in finance but cautioning that it is not typically used in the same way as traditional distributed databases.
Bitcoin and Distributed Database Principles
Danny explained the fundamental principles of Bitcoin as a decentralized distributed database, describing how transactions are processed and confirmed through a consensus mechanism called mining. He highlighted that while Bitcoin's proof-of-work system is energy-intensive, other distributed databases often use proof-of-stake approaches for better efficiency. Danny also mentioned that banks are exploring similar blockchain technologies to achieve faster transaction confirmations without relying on central clearinghouses. He concluded by noting that while Bitcoin remains a significant application of distributed databases, the fintech industry has evolved and may now use different implementations of these principles.
Database and Health Data Analysis
Danny demonstrated how to benchmark database performance using Python, comparing Pandas and SQLite for loading and inserting data. He showed that Pandas was faster for in-memory operations, but emphasized that benchmarking is crucial for choosing the right technology for big data challenges. Julio discussed his project on analyzing health data from Apple Health and RouterSense, focusing on the correlation between phone usage and stress levels. Danny suggested aligning the data temporally and considering factors like website types and traffic volumes. He advised Julio to post questions in Slack and reach out to Ramin for further guidance.
November 13: LLMs and Urban Analytics
Guest lectures:
Please attend in person. The class will not be livestreamed or recorded. Quiz 2 will cover contents of this class.
November 20: AI/ML for Big Data
Guest lectures: AI/ML for Big Data (Rameen Mahmood): slides
Please attend in person. The class will not be livestreamed or recorded. Quiz 2 will cover contents of this class.
November 27: No class
No class - Thanksgiving
December 4: Quiz 2 and latest big data trends
Quiz 2 on topics covered so far. Same format as Quiz 1.
Danny will also ask individual project groups to briefly present their progress.
During class, Danny will discuss the latest technological trends:
- streaming
- time series database
- Kafka, Hive, BigQuery
December 11: Final project presentation
Schedule of presentation: see this sheet
Please edit the sheet to include a public Google Drive link to your slide deck. Acceptable formats include Google Slides, PowerPoint, or PDF. No KeyNote deck please.
See this document for details on the presentation.