H1: API Collector

Overview and Goal of the Assignment

For this homework assignment we will continue the theme of collecting data on political contributions. However, this time we will collect data using two different APIs:

  1. Influence Explorer Campaign Contributions API from Sunlight Labs: API interface to curated FEC data on individual contributions to political candidates. Unlike the Super PACs from last week, these are the traditional contributions directly in support of candidates, and so are subject to contribution limits.
  2. USA Today Census API: API interface to US 2010 Census data including population and racial demographics

Your task is to write a program that will fetch the contributions to the presidential campaigns of Barack Obama, Newt Gingrich and Mitt Romney. You are also to fetch state-level population and racial demographics from the Census API. Next, you are to combine these data sources and the Super PAC contributions to these candidates obtained in h1 by matching up the state-level demographics to contributions. You will create a CSV file listing each contribution and its matching demographics.

Once we begin using R later in the course, this file can be used to answer questions about the composition of candidate contributions and identify any differences between regular and Super PAC funding.

You will need to sign up for an API key from both the Sunlight Foundation and USA Today.

Step 1: Fetch the Election Contributions

Write a function called getOfficialContributions, which queries the Campaign Contributions Influence Explorer API for all contributions in support of Barack Obama, Newt Gingrich and Mitt Romney made during the 2012 campaign cycle. Please only obtain contributions made on or before January 31, 2012. Here is the API specification detailing requests and the responses. Note that this API tracks contributions made both for and against candidates; we are only interested in contributions made in support of a candidate. Finally, be sure to request the results in JSON format.

(Here's a hint about constructing the queries: you only need to make 3 queries, one for each candidate.)

After obtaining the results, convert the raw JSON results to a python object. You should create a dictionary with three keys: "barack obama", "newt gingrich", and "mitt romney". The value for each key should be the corresponding python object that has been converted from JSON. Once you have created the dictionary, you should save it to your data directory in a pickle file named sunlightRecords.pkl.

You should write the getOfficialContributions function so that it returns the dictionary to the caller. Also, you should first check whether the pickle file exists before querying the API to prevent issuing unnecessary queries to the API.

Step 2: Fetch the Census Information

Write a function called getStateDemographics, which queries the USA Today Census API for demographic information on US states. Here is the API specification. (Hint: use the "Locations" base URL when constructing your queries.)

Two-letter state codes can be obtained from this tab-delimited file. Please process this file using Python to construct a list of state codes.

After obtaining the results, convert the raw JSON results to a python object. You should create a dictionary with keys for each of the states (e.g., "MA", "NY", etc.). The value for each key should be the corresponding python object that has been converted from JSON. Once you have created the dictionary, you should save it to your data directory in a pickle file named censusRecords.pkl.

You should write the getStateDemographics function so that it returns the dictionary to the caller. Also, you should first check whether the pickle file exists before querying the API to prevent issuing unnecessary queries to the API.

Step 3: Join up the Census Information and Election Contributions

Your final task is to write a function called makeCombinedCSV that will join up the contributions and census data. The CSV file should be called regSuperCensus.csv and be placed in the h1/data directory. Here is the structure of the fields in the CSV, along with some example entries:

Candidate,Regular Or Super PAC, Committee Number, Committee Name, Contribution Amount, Date, State, ZIP Code, Individual or Organization, Population of State, Fraction White (Non-Hispanic), Fraction Black, USA Today Diversity Index
"barack obama","Regular","C00431445","Obama for America",35800.0,"20110525","AR","72207","IND",2915918.0,0.745381,0.154289,0.417706
"barack obama","Regular","C00431445","Obama for America",2500.0,"20110610","PR","00907","IND","NA","NA","NA","NA"
"mitt romney","Regular","C00431171","Romney for President",5000.0,"20110622","CA","95762","IND",37253956.0,0.401468,0.061713,0.728676
"mitt romney","Regular","NA","NA",5000.0,"20110506","NA","NA","PAC","NA","NA","NA","NA"
"newt gingrich","Super","C00495861","Priorities USA Action",50000.0,"20111108","CA","94037","IND",37253956.0,0.401468,0.061713,0.728676
"newt gingrich","Super","C00495861","Priorities USA Action",15000.0,"20111012","NY","10022","IND",19378102.0,0.583352,0.158622,0.616678

Here are the steps you need to take to create the CSV:

  1. Extract the values matching "Committee Number, Committee Name, Contribution Amount, Date, State, ZIP Code, and Individual or Organization" from the contributions dictionary.
  2. Notice that the Date follows a format of YYYY-MM-DD, whereas in the Super PAC data it follows a YYYYMMDD format. Store the date as a string in the YYYYMMDD format.
  3. The "Individual or Organization" field corresponds to "contributor_type" in the dictionary. The possible values are "I" (meaning individual) and "C" (meaning comes from a PAC). Store these values as "IND" and "PAC", respectively, in order to match up with the Super PAC data.
  4. The Census fields you need to include are called "Pop", "PctNonHispWhite", "PctBlack" and "USATDiversityIndex".
  5. Missing values are unavoidable in data collection. To be consistent, all missing values should be encoded as NA. This happens in two circumstances. First, sometimes contributions come from territories (such as Puerto Rico) which do not have entries in the USA Today API. When this occurs, the entries for all demographic data should be entered as NA. Second, some entries are missing in the contributions data, particularly when the contribution is from a committee. In this case you should swap out the empty string for NA. There's an elegant way to do this using list comprehensions that I haven't shown you before. Suppose you have a list called thelist, which has empty values. To swap for NAs, type thelist = ['NA' if x == '' else x for x in thelist].
  6. After you have finished with the regular contributions, move on to the Super PAC data from h1. You can download my copy of the Super PAC data (which will be online after the h1 deadline has passed). Only include contributions to the Super PAC's associate with Romney, Obama and Gingrich. The Romney Super PAC is C00490045 (Restore Our Future). The Obama Super PAC is C00495861 (Priorities USA Action). The Gingrich Super PAC is C00507525 (Winning Our Future). Columns 2-8 in allcont.csv correspond directly to Columns 3-9 in the CSV file you are constructing. Your job is to supply the missing columns.

After you've done all this, you'll want to do some spot checking on the generated CSV to make sure that it creates the fields you are expecting. You don't have to include code for doing this, but be aware I will be inspecting the CSV file to make sure that the fields include accurate data.

What to turn in

Create a directory inside ~/qtw called h1, with data and code directories as before. Please name your python script h1.py, and name the functions and data files as specified in the document. Be sure to put both your names at the top of h1.py, and include an estimate for how many hours the assignment took you to complete.

Your code should be well-documented. Additionally, you should generate HTML documentation using the pydoc command with the -w parameter set.

I will be able to read any files or directories placed in the qtw directory. This means there is nothing for you to do to submit the homework, aside from putting it in the right place.

Please do not modify h1.py after the submission deadline. I will check the modification timestamp to verify an on-time submission.