Quantifying the World

Presenter Notes

Course Goals

  1. Learn how to collect data to answer questions quantitatively
  2. Learn techniques of analysis to yield your own insights
  3. Learn how to facilitate analysis for others
  4. Learn how to summarize the findings of your analysis

Presenter Notes

Course Goals: Skills to Learn

  1. Learn how to collect data to answer questions quantitatively Python MongoDB
    Using APIs to collect data
  2. Learn techniques of analysis to yield your own insights R
  3. Facilitating analysis for others
    • JavaScript, HTML
    • Google Chart Tools API
  4. Learn how to summarize the findings of your analysis
    • Lots of practice on technical writing

Presenter Notes

Goal 1: Collect data to answer questions quantitatively

  1. Formulating questions data can answer
    • What types of questions data can answer?
    • Understanding available data
    • Sources of bias
    • Collecting control and treatment data
    • Combining disparate data sources
  2. Write scripts to automate collection
    • Methodologies for data collection
    • Direct data set downloads
    • Web page scraping
    • API access
  3. Understand how to store and query data

Presenter Notes

Goal 2: Analysis techniques to yield insights

  1. Using R to study data
    • Techniques for accessing data
    • Data aggregation
    • Transforming data for analysis
  2. Data exploration
    • Summary statistics, box plots, histograms, CDFs
    • Comparing values split by categorical variables
  3. Analyzing a single variable
    • Summarizing variables
    • Part-to whole analysis
    • Deviation analysis
  4. Comparing multiple variables
    • Response variables, explanatory variables
    • Regression techniques (linear, logistic, survival)
    • Visualizing more than two variables

Presenter Notes

Goal 3: Facilitating analysis for others

  1. Architecture for making data available online
    • Python CGI scripting
    • Storing data on a web server
    • Presenting a subset of available data to users
    • Using JavaScript APIs to present data
  2. Designing a web interface to data
    • Selecting subset of interest to user (optional)
    • Trade-off between presenting known insights and letting users find their own
  3. Google Charts API
    • Selecting appropriate charts
    • User-driven filters
    • Presenting data to Google Charts API

Presenter Notes

Goal 4: Summarizing the findings of your analysis

Unlike many programming tasks, your code is not necessarily the primary deliverable. Instead, the coding is a means to an end: delivering better understanding of a question that can be answered by collected data. We will discuss and practice techniques for:

  • Articulating data collection methodology
  • Describing conclusions that have been found
  • Making data available to others for replication

Presenter Notes

How this course differs from many CS courses

  • We will learn script-based programming
    • Goal is to write code that solves tasks quickly
    • Aim is to minimize overhead on the programmer
  • Re-usable code that solves a general problem is not our goal
  • Writing code that helps answer data-driven questions is our goal

Presenter Notes

What's not covered

  1. Network analysis
  2. Data mining techniques using artificial intelligence and machine learning
  3. Analysis of huge datasets (peta-scale and beyond)
    • MongoDB is capable of scaling to very large datasets, so you're off to a good start

Presenter Notes

Coding resources for projects and assignments

  • Linux workstations in micro-focus run Python, R, and a MongoDB client
    • Can log in to these machines from anywhere on-campus
  • You will have access to a data store for the project and assignments
    • More details soon

Presenter Notes

Data Analysis Process

  1. Formulating questions for investigation (Goal 1)
  2. Data collection design and execution (Goal 1)
  3. Exploratory analysis (Goal 2)
  4. Focused analysis (Goal 2)
  5. Communicating results (Goal 4)
  6. Design interactive interface (Goal 3)

Presenter Notes

Course Assignments & Data Analysis Process

  1. Formulating questions for investigation
  2. Data collection design and execution
    • H0: Web scraper
    • H1: API collector
  3. Exploratory analysis
    • H2: MongoDB queries
    • H3: R data exploration
  4. Focused analysis
    • H4: R multivariate analysis
  5. Communicating results
  6. Design interactive interface

Presenter Notes

Course Project

  • Goal is for you to apply new skills in the context of a real-world topic, from beginning to end
  • Work in teams of three
  • You can choose the topic
    • Start with an interesting question that can be answered by gathering and examining data
    • See resources page for pointers to exemplary projects and potential data sources.
  • One good strategy: combine two or more different data sources in an unexpected way
    • Campaign donations broken down by geography and other demographics
  • Another good strategy: track behavior over time

There are a few broad categories of topics: transparency in government data, security topics, studying Internet usage

Presenter Notes

Project Milestones and Schedule

  • P0: Project proposal (Feb 17)
  • P1: Data collection (Mar 16)
  • P2: Data analysis (Apr 20)
  • P3: Presentation (Apr 30 and May 3)
  • P4: Web interface (May 4)
  • P5: Project report (May 15)

Presenter Notes

Project Milestones & Data Analysis Process

  1. Formulating questions for investigation
    • P0: Project proposal
  2. Data collection design and execution
    • P0: Project proposal
    • P1: Data collection
  3. Exploratory analysis
    • P2: Data analysis
  4. Focused analysis
    • P2: Data analysis
  5. Communicating results
    • P3: Presentation
    • P5: Project report
  6. Design interactive interface
    • P4: Web interface

Presenter Notes

Blog Posts

  • Another primary goal of the course is to improve your skills in summarizing the findings of your analysis

    • Communicating technical topics clearly and succinctly is hard!
    • To get more practice, you will maintain a blog during the semester
    • Six separate blog posts will be written, mostly in the context of the project
  • Goals for the blog

    1. Improve writing skills
    2. Foster collaboration between students
    3. Make writing the project report easier

Presenter Notes

Blog Posts & Data Analysis Process

  1. Formulating questions for investigation
    • B0: Blog post on data source
    • BP0: Blog post on P0
  2. Data collection design and execution
    • BP0: Blog post on P0
    • BP1: blog post on P1
    • B1: blog post on research paper
  3. Exploratory analysis
    • BP2: Blog post on P2
  4. Focused analysis
    • B1: blog post on research paper
    • BP2: Blog post on P2
  5. Communicating results
    • B1: blog post on research paper
    • BP2.5: Blog post on replicating P2
  6. Design interactive interface

Presenter Notes

Logistics for Blog Posts

  • Typically due Monday after Friday project milestones
  • Individually written
  • Will assign random subset of blogs for everyone to read
  • Will include brief discussion in following class
  • Will give feedback on writing to help you improve over the semester

Grade Distribution

  • Assignments (35%)
  • Project (40%)
  • Blog (10%)
  • Midterm Exam (15%)

Presenter Notes

Presenter Notes

Questions

  • Syllabus
  • Project
  • Schedule

  • For next time

    • Fill out the introductory questionnaire by Sunday
    • Post a link on Piazza to a web article that describes a cool use of data
    • Read the lecture notes for Monday (will be posted tomorrow)

Presenter Notes

Presenter Notes