Leveraging APIs to Collect Data

Presenter Notes

But first, some Python coding tips

  • Normal coding process in Python
    • Write some half-working code in an editor, saved in foo.py
    • Copy code from the editor into the interpreter to fix the bugs
    • Once everything's working, put the working code in a function called bar
    • From the interpreter, type import foo, then foo.bar()
    • You then realize that you need to add a new feature to bar(), so you modify foo.py.
    • If you type import foo, then nothing happens
    • Instead you must type reload(foo)!
    • But what happens when there is an error in your new-and-improved bar() function?

Presenter Notes

Enter pdb, the Python debugger

Presenter Notes

Python debugging example

Here's the code in python_ex_wrong.py:

 1 #Tyler Moore
 2 #Almost working solution to example 1 and 2 from Python notes
 3 #2 Feb 2012
 4 
 5 import urllib2
 6 
 7 def readCSV():
 8     state2PeaceYrs={}
 9     for line in urllib2.urlopen("http://cs.wellesley.edu/~qtw/data/peaceIndexNoHeader.csv"):
10         bits=line.split(',')
11         state2PeaceYrs[bits[0]]=[float(b) for b in bits[1:]]
12     state2Peace={}
13     for state in state2PeaceYrs:
14         state2Peace[state]=state2PeaceYrs[state][-1]
15     return (state2Peace,state2PeaceYrs)
16 
17 if __name__ == '__main__':
18     (state2Peace,state2PeaceYrs)=readCSV()
19     #OK let's debug to make sure this makes sense
20     len(state2PeaceYrs)  #should be 50
21     [len(state2PeaceYrs[s]) for s in state2PeaceYrs] # should be a 50-element list, all of length 19

Presenter Notes

Meanwhile, back at the interpreter

Here's how we find our error:

 1 >>> import pdb
 2 >>> import python_ex_wrong
 3 >>> (d1,d2)=python_ex_wrong.readCSV()
 4 Traceback (most recent call last):
 5   File "<stdin>", line 1, in <module>
 6   File "python_ex_wrong.py", line 14, in readCSV
 7     state2Peace[state]=state2PeaceYrs[state][-1]
 8 IndexError: list index out of range
 9 >>> pdb.pm()
10 > /home/qtw/public_html/code/python_ex_wrong.py(14)readCSV()
11 -> state2Peace[state]=state2PeaceYrs[state][-1]
12 (Pdb) state
13 '\n'
14 (Pdb) len(state2PeaceYrs.keys())
15 51
16 (Pdb) state2PeaceYrs[state]
17 []
18 (Pdb) state2PeaceYrs[state][-1]
19 *** IndexError: list index out of range
20 (Pdb)
21 >>>   #type ctrl-D to exit the debugger, fix the code and then reload

Presenter Notes

On to APIs

Presenter Notes

URL structure

  • URLs follow a common structure
  • URLs can't have spaces or other special characters, so these are encoded using character codes and % symbols
    • In Python, urllib.quote_plus() deals with special characters, while urllib.unquote_plus() undoes the process
  • However, when dealing with APIs, we are mostly concerned with encoding just the query part of the parameter

Presenter Notes

URL query structure

  • Fake example: http://example.com/foo?att1=val1&att2=val2&att3=val3
  • Real example: http://www.bing.com/search?q=tyler+moore&go=&qs=n&form=QBLH&pq=tyler%2520moore&sc=8-10&sp=-1&sk=
  • So we've got a series of attribute-value pairs. Sounds like a dictionary!

urllib.urlencode() takes a dictionary and creates a query string:

1 >>> qs={"att1":"val1","att2":"val2 with spaces","att3":"val3%"}
2 >>> urllib.urlencode(qs)
3 'att3=val3%25&att2=val2+with+spaces&att1=val1'

Presenter Notes

Crafting API queries

  • Most APIs we will be working with are RESTful, which means that we embed the query in a URL, request the URL and the response comes back directly
  • Today's working example: http://developer.nytimes.com/docs/article_search_api
  • From the docs, we see that the URL structure specified is: http://api.nytimes.com/svc/search/v1/article?query=(field:)keywords (facet:[value])(&params)&api-key=your-API-key
  • In any documentation, you are looking for three types of information
    1. Required and optional parameters for issuing queries
    2. Structure and nature of response
    3. Working examples!
  • Let's look at this in the New York Times Article Search API

Presenter Notes

Example query: articles mentioning Wellesley

  • Having examined the API documentation, we see that the relevant attribute is called query. The URL should consist of the base URL + the query + the API key:
    • http://api.nytimes.com/svc/search/v1/article?query=wellesley&api-key

So let's build this URL bit by bit:

1 apikey="put in your API key here"
2 baseurl="http://api.nytimes.com/svc/search/v1/article?"
3 q={"query":"wellesley","api-key":apikey}
4 url2check=baseurl+urllib.urlencode(q)

Presenter Notes

JSON

  • CSVs are nice when you've got nicely-behaved records
  • But what if you've got nested data?
  • JSON is a standard object notation that corresponds nicely to Python dictionaries and lists

Continuing our earlier example about querying for Wellesley (download the stored json file from http://cs.wellesley.edu/~qtw/data/wellesleyNYT.json):

1 result=urllib2.urlopen(url2check).read()
2 f=open('wellesleyNYT.json','w')
3 f.write(result)
4 f.close()

Presenter Notes

Exploring results

 1 import json
 2 resd=json.loads(result)
 3 >>> for r in resd:
 4 ...     print r, resd[r]
 5 ...
 6 tokens [u'wellesley']
 7 total 4409
 8 results [{u'body': u'PHILADELPHIA BETSEY STEVENSON and Justin Wolfers might sound like almost any upscale couple. They have impressive degrees and serious careers and the social markers that go with them. They have one child, but there are two strollers, a Bugaboo and a Bob baby jogger, parked in the front hall of their stylish home here. Their daughter, Matilda, who', u'date': u'20120212', u'byline': u'By MOTOKO RICH', u'url': u'http://www.nytimes.com/2012/02/12/business/economics-of-family-life-as-taught-by-a-power-couple.html', u'title': u'Economics of Family Life, as Taught by a Power Couple'}, {u'body': u"LIPPINCOTT--Rosemond, 97, on January 16, 2012, at Mayflower Place Nursing Center, West Yarmouth, MA. Born Summit, NJ, to Dr. Henry M. and Mary O'Reilly. Kent Place School, '32; Wellesley College, '36. Predeceased by husband Job H. Lippincott. Resided Chatham, NJ, 1937-77; Nantucket, MA, 1977-85; and thereafter on Cape Cod. She was a generous", u'date': u'20120129', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9800E2DA133AF93AA15752C0A9649D8B63', u'title': u'Paid Notice: Deaths LIPPINCOTT, ROSEMOND'}, {u'body': u'HOWARD--Barnaby J. The son of a British lord, who grew up to be a pilot with the British and United States Navies during World War II and later a farmer in Southern Rhodesia (now Zimbabwe) before returning to America to set up a successful investment company (CAIMS), died December 18 at home in Orange Park, FL at age 86 after a courageous battle', u'date': u'20120129', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9803E3DA133AF93AA15752C0A9649D8B63', u'title': u'Paid Notice: Deaths HOWARD, BARNABY J'}, {u'body': u"Two hundred fifty-two consecutive matches won over 13 years. Thirteen national titles. The longest winning streak in college sports. Trinity College has been a squash dynasty under Coach Paul Assaiante. But two weeks ago in New Haven, Yale overthrew that dynasty in a 5-4 victory. Yale's coach, David Talbott, called it ''a long time coming.'' The", u'date': u'20120129', u'byline': u'By MING TSAI', u'url': u'http://www.nytimes.com/2012/01/29/sports/chef-ming-tsai-devoted-player-and-cooker-of-squash.html', u'title': u'Squash, a Growing Sport, And Nutritious, Too'}, {u'body': u"To the Editor: Hendrik Hartog has it right in ''Bargaining for a Child's Love'' (Sunday Review, Jan. 15). That Republicans disparage entitlement programs astounds me. I don't know of any who have refused Social Security or Medicare for themselves or their parents or grandparents. My mother, born in 1918, often said that it was President Franklin D.", u'date': u'20120124', u'url': u'http://www.nytimes.com/2012/01/24/opinion/benefits-for-the-elderly.html', u'title': u'LETTER; Benefits for the Elderly'}, {u'body': u"CRAWFORD--John Charlton, composer, pianist, professor, beloved father and husband, died on January 5, 2012, at age 80 in his 23rd year of Parkinson's disease in Cambridge, MA. Born the son of academic parents in 1931 in Philadelphia, he was gifted in music and languages. He graduated from Germantown Friends School and the Yale School of Music, and", u'date': u'20120122', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9C00E2DE133AF931A15752C0A9649D8B63', u'title': u'Paid Notice: Deaths CRAWFORD, JOHN CHARLTON'}, {u'body': u"IT'S show time for Anne M. Finucane. Her co-star on this day, Bill Clinton, is waiting offstage. The audience shifts in its seats. The spotlight goes up and ... action! It's a Thursday in early December, at a conference center near Orlando, and Ms. Finucane is busy shaping an image. Or, rather, trying to reshape one. This choreographed interview", u'date': u'20120115', u'byline': u'By LOUISE STORY and GRETCHEN MORGENSON', u'url': u'http://www.nytimes.com/2012/01/15/business/at-bank-of-america-the-image-officer-has-a-lot-to-fix.html', u'title': u'The Image Officer With a Lot to Fix'}, {u'body': u'KNEUBUHL--James Pritchard of Southbury, CT, formerly of New Canaan, CT and San Marino, CA, died December 30, 2011, at the age of 95. Husband of the late Margaret Woodard Kneubuhl, Jim leaves his daughters, Janet Schloat of Pound Ridge, NY and Barbara Kneubuhl of Wellesley, MA; three grandsons, David, Benjamin, and Michael Schloat and their wives;', u'date': u'20120112', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9404E5D8123AF931A25752C0A9649D8B63', u'title': u'Paid Notice: Deaths KNEUBUHL, JAMES PRITCHARD OF SOUTHBURY'}, {u'body': u'EDELMAN--Eleanor L. died peacefully in her sleep at her home in Bronxville, New York on January 7, 2012. For 53 years, she was the wife of Albert I. Edelman, an attorney who predeceased her. She was born Eleanor Louise Weisman in 1924 in St. Louis, Missouri and was known to her friends as Elly. Along with her beloved sisters, Beryl and Nanette, she', u'date': u'20120112', u'url': u'http://query.nytimes.com/gst/fullpage.html?res=9E03E6D8123AF931A25752C0A9649D8B63', u'title': u'Paid Notice: Deaths EDELMAN, ELEANOR L'}, {u'body': u'Nina Bich-Phuong Xuan Ha and Stephen Michael Girasuolo were married Friday evening at the Harvard Club of New York. Marylin G. Diamond, a retired acting justice of State Supreme Court in New York, officiated. On Thursday, the Rev. Thich Nguyen Hanh, a Buddhist priest, performed a ceremony that incorporated Vietnamese traditions at the Unitarian', u'date': u'20120108', u'url': u'http://www.nytimes.com/2012/01/08/fashion/weddings/nina-ha-stephen-girasuolo-weddings.html', u'title': u'Nina Ha, Stephen Girasuolo'}]
 9 offset 0
10 #so most of the data comes in resd["results"]
11 >>> len(resd["results"])
12 10
13 >>> for k in resd['results'][0]:
14 ...     print k, resd['results'][0][k]
15 ...
16 body PHILADELPHIA BETSEY STEVENSON and Justin Wolfers might sound like almost any upscale couple. They have impressive degrees and serious careers and the social markers that go with them. They have one child, but there are two strollers, a Bugaboo and a Bob baby jogger, parked in the front hall of their stylish home here. Their daughter, Matilda, who
17 date 20120212
18 byline By MOTOKO RICH
19 url http://www.nytimes.com/2012/02/12/business/economics-of-family-life-as-taught-by-a-power-couple.html
20 title Economics of Family Life, as Taught by a Power Couple

Presenter Notes

Exercise: crafting queries and printing results

  • Download the file http://cs.wellesley.edu/~qtw/code/apiex.py.
  • Your task is to create a dictionary q that will include the appropriate parameters to answer the following query: get articles written by David Pogue in 2011 that mention "iphone" and "android".
  • Use a list comprehension to extract just the titles of the articles from the results.
  • Bonus task if you finish the rest of the exercise early: find out how many articles written by David Pogue in 2011 that don't mention iPhone or Android, and compare this to the number that do.

Presenter Notes

Presenter Notes

Storing results

  • Since you may issue lots of queries, storing every JSON file separately is inefficient
  • Instead, you should construct a dictionary of the results and store the serialized object using the cPickle module

Presenter Notes

Creating a dictionary of JSON-processed objects

Suppose we want articles mentioning "Obama", "Romney", "Santorum", "Gingrich" or "Paul". We can issue the queries and store the results in a dictionary:

 1 import datetime, time
 2 rightnow=datetime.datetime.now()
 3 queries=[{"query":"title:"+politician,"api-key":apikey} for politician in ["Obama", "Romney", "Santorum", "Gingrich","Paul"]]
 4 apiResults={}
 5 for q in queries:
 6     #these 3 lines are just the same as before, just encoding and grabbing the URL
 7     url2check=baseurl+urllib.urlencode(q)
 8     result=urllib2.urlopen(url2check).read()
 9     resd=json.loads(result)
10     #OK now store the json result in the apiResults dictionary
11     apiResults[(url2check,rightnow)]=resd
12     time.sleep(1)

Presenter Notes

What did we just create?

Now apiResults is a dictionary whose keys are 2-element tuples of the URL requested plus the time of the search:

1 >>> apiResults.keys()    
2 [('http://api.nytimes.com/svc/search/v1/article?query=title%3AGingrich&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3AObama&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3APaul&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3ASantorum&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928)), ('http://api.nytimes.com/svc/search/v1/article?query=title%3ARomney&api-key=[removed API key for security reasons]', datetime.datetime(2012, 2, 14, 15, 3, 39, 685928))]

Presenter Notes

But how do we access the dictionary after we end our Python session?

Python has built-in "object serialization" via the "pickle" module:

1 import cPickle as pickle #we use cPickle, a C implementation of the pickle module that runs faster
2 pf=open("~/qtw/inclass/data/apiex.pkl","wb")  #wb= write to a binary file
3 pickle.dump(apiResults,pf,True)
4 pf.close()

Presenter Notes

Next time we need to access the dictionary

Fire up the interpreter (or put the code in your own module) and type:

1 import cPickle as pickle
2 pf=open("~/qtw/inclass/data/apiex.pkl","rb")
3 apiRes=pickle.load(pf)
4 pf.close()

Presenter Notes

Project

  • Timeline
  • Topic selection
    • If you are working with a "traditional" data source (e.g., indicators of literacy rates and economic development collected from the World Bank), then you must complement the study with data collected using "new" sources as a complementary explanatory variable
    • If you are collecting data from a "new" source, then collecting linking data from another source is still strongly encouraged (see me if you think this will be infeasible or not make sense)

Presenter Notes

H1

  • H1 is due next Friday Feb 24
  • This week pair programming is mandatory
  • Here's the current draft
  • Will be finalized shortly, look for an email

Presenter Notes