Community Data Science Workshops (Fall 2014)/Day 2 lecture

Lecture Slides

 * Slides (PDF) — For viewing
 * Slides (ODP Libreoffice Slides Format) — For editing and modification

Resources

 * Encoding:
 * Pragmatic Unicode
 * Official Python Unicode documentation

Lecture Outline

 * Introduction and context


 * You can write some tools in Python now. Congratulations!
 * Today we'll learn how to find/create data sets
 * Next week we'll get into data science (asking and answering questions)


 * Outline:


 * What did we learn in Session 1?
 * What is an API?
 * How do we use one to fetch interesting datasets?
 * How do we write programs that use the internet?
 * How can we use the placekitten API to fetch kitten pictures?
 * Introduction to structured data (JSON)
 * How do we use APIs in general?


 * What is a (web) API?


 * API: a structured way for programs to talk to each other (aka an interface for programs)
 * Web APIs: like a website your programs can visit (you:a website::your program:a web API)


 * How do we use an API to fetch datasets?

Basic idea: your program sends a request, the API sends data back
 * Where do you direct your request? The site's API endpoint.
 * For example: Wikipedia's web API endpoint is http://en.wikipedia.org/w/api.php
 * How do I write my request? Put together a URL; it will be different for different web APIs.
 * Check the documentation, look for code samples
 * How do you send a request?
 * Python has modules you can use, like  (they make HTTP requests)
 * What do you get back?
 * Structured data (usually in the JSON format)
 * How do you understand (i.e. parse) the data?
 * There's a module for that!


 * How do we write Python programs that make web requests?

To use APIs to build a dataset we will need:
 * all our tools from last session: variables, etc
 * the ability to open urls on the web
 * the ability to create custom URLS
 * the ability to save to files
 * the ability to understand (i.e., parse) JSON data that APIs usually give us


 * Session 1 review


 * Navigating in the terminal and using it to run programs
 * Writing Python:
 * using variables to manipulate data
 * types of data: strings, integers, lists, dictionaries
 * if statements
 * for loops
 * printing
 * importing modules, so you can use code other people have written for you!


 * New programming concepts:


 * interpolate variables into a string using % and %s
 * requests
 * open files and write to them
 * parsing a string (turning the string into a data structure we can manipulate)


 * How do we use an API to fetch kitten pictures?

placekitten.com
 * API that takes specially crafted URLs and gives appropriately sized picture of kittens
 * Exploring placekitten in a browser:
 * visit the API documentation
 * kittens of different sizes
 * kittens in greyscale or color
 * Now we write a small program to grab an arbitrary square from placekitten by asking for the size on standard in: placekitten_raw_input.py


 * Introduction to structured data (JSON, JavaScriptObjectNotation)


 * what is json: useful for more structured data
 * import json; json.loads
 * like Python (except no single quotes)
 * simple lists, dictionaries
 * can reflect more complicated data structures
 * Example file at http://mako.cc/cdsw.json
 * download it and parse it: parse_cdswjson.py


 * Using other APIs


 * every API is different, so read the documentation!
 * If the documentation isn't helpful, search online
 * for popular APIs, there are python modules that help you make requests and parse json

Possible issues:
 * rate limiting
 * authentication
 * text encoding issues