Matplotlib

Revision as of 18:45, 26 July 2012 by imported>Jesstess

Project

Learn how to plot data with the matplotlib plotting library. Ditch Excel forever!

Goals

  • practice reading data from a file
  • practice using the matplotlib Python plotting library to analyze data and generate graphs

Project setup

Mac OS X users only

If you do not already have a C compiler installed, you'll need one to install matplotlib. You have several options depending on your situation:

  1. Download and install Xcode (1.5 GB) from https://developer.apple.com/xcode/
  2. Download and install Command Line Tools for Xcode (175 MB) from https://developer.apple.com/downloads/index.action. This requires an Apple Developer account (free, but you have to sign up).
  3. Download and install kennethreitz's gcc installer (requires 10.6 or 10.7) from https://github.com/kennethreitz/osx-gcc-installer/

Please wave over a staff member and we'll help you pick which option is best for you computer.

Install the project dependencies

Please follow the official matplotlib installation instructions at http://matplotlib.sourceforge.net/users/installing.html

The dependencies vary across operating systems. http://matplotlib.sourceforge.net/users/installing.html#build-requirements summarizes what you'll need for your operating system.

A universal dependency is the NumPy scientific computing library. NumPy has download and installation instructions at http://numpy.scipy.org/

Installing matplotlib and its dependencies is somewhat involved; please ask for help if you get stuck or don't know where to start!


Download and un-archive the Jeopardy database project skeleton code

Un-archiving will produce a JeopardyDatabase folder containing 3 Python files and one SQL database dump.

Create a SQLite database from the database dump

Inside JeopardyDatabase is a file called jeopardy.dump which contains a SQL database dump. We need to turn that database dump into a SQLite database.

Once you have SQLite installed, you can create a database from jeopardy.dump with:

sqlite3 jeopardy.db < jeopardy.dump

This creates a sqlite3 database called jeopardy.db

Test your setup

At a command prompt, start sqlite3 using the jeopardy.db database by running:

sqlite3 jeopardy.db

That should start a sqlite prompt that looks like this:

SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

At that sqlite prompt, type .tables and hit enter. That should display a list of the tables in this database:

sqlite> .tables
category  clue    
sqlite>

From a command prompt, navigate to the JeopardyDatabase directory and run

python jeopardy_categories.py

You should see a list of 10 jeopardy categories printed to the screen. If you don't, let a staff member know so you can debug this together.

Project steps

1. Create a basic plot

  1. Run python basic_plot.py. This will pop up a window with a dot plot of some data.
  2. Open basic_plot.py. Read through the code in this file. The meat of the file is in one line:
    pyplot.plot([0, 2, 4, 8, 16, 32], "o")

    In this example, the first argument to pyplot.plot is the list of y values, and the second argument describes how to plot the data. If two lists had been supplied, pyplot.plot would consider the first list to be the x values and the second list to be the y values.

  3. Change the plot to display lines between the data points by changing
    pyplot.plot([0, 2, 4, 8, 16, 32], "o")

    to

    pyplot.plot([0, 2, 4, 8, 16, 32], "o-")
  4. Add x-values to the data by changing pyplot.plot([0, 2, 4, 8, 16, 32], "o-") to
    x_values = [0, 4, 7, 20, 22, 25]
    y_values = [0, 2, 4, 8, 16, 32]
    pyplot.plot(x_values, y_values, "o-")

    Note how matplotlib automatically resizes the graph to fit all of the points in the figure for you.

  5. Read about how to generate random integers on http://docs.python.org/library/random.html#random.randint. Then, instead of hard-coding x values and y values in basic_plot.py, generate a list of random y values. An example plot using random y values might look like this:
     

Read these short documents:

Check your understanding:

  • What does matplotlib pick as the x values if you don't supply them yourself?
  • What options would you pass to pyplot.plot to generate a plot with red triangles and dotted lines?


2. Plotting the world population over time

  1. Run python world_population.py. This will pop up a window with a dot plot of the world population over the last 10,000 years.
  2. Open world_population.py. Read through the code in this file. In this example, we read our data from a file. Open the data file world_population.txt and examine the format of the file.
  3. Find the documentation on http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.plot for customizing the linewidth of plots. Then change the world population plot to use a magenta, down-triangle marker and a linewidth of 2.

Word population resources:

Check your understanding:

  • In world_population.py, what does file("world_population.txt", "r").readlines() return?
  • In world_population.py, what does point.split() return?


3. Make database queries from Python

Examine the code in jeopardy_categories.py. To make a database query from Python, you need to:

  1. Import a Python library for making database connections
  2. Establish a connection to the desired database
  3. Get a cursor from the database for making queries
  4. Execute the database query using the standard SQL syntax
  5. Retrieve the list of results from the database cursor
  6. Do something useful with the results, like print them
  7. Close the database connection

Match up each of these steps with lines of code in the file.


4. Tweak the existing Jeopardy scripts

1. Modify jeopardy_categories.py to print both the category and game number

tip: Remind yourself of the categories schema by running .schema category at a sqlite prompt.


Example output:

Example categories:

DETECTIVE FICTION (game #1)
THE OLD TESTAMENT (game #2)
ASIAN HISTORY (game #4)
RIVER SOURCES (game #5)
WORLD RELIGION (game #3)
SEAN SONG (game #2)
ANIMATED MOVIES (game #1)
NEW YORK CITY (game #6)
AFRICAN WILDLIFE (game #7)
LITTLE RED RIDING HOOD (game #8)

2. Modify jeopardy_clues.py to only print clues with an $800 value.

A good way to achieve this is by adding a WHERE clause to the SQL query in jeopardy_clues.py.

Read about WHERE clauses in this short document:


Example output:

Example clues:

[$800]
A: She also created the detectives Tuppence & Tommy Beresford
Q: What is 'Agatha Christie'

[$800]
A: According to this Old Testament book, this "swords into plowshares" prophet walked naked for 3 years
Q: What is 'Isaiah'
...


5. Daily Doubles

Write a script that prints 10 daily doubles and their responses.


tip: The clue table has an isDD field.


Example output:

Category: NEW YORK CITY
Question: The heart of Little Italy is this street also found in a Dr. Seuss book title
Answer: Mulberry Street
===
Category: RIVER SOURCES
Question: This Mideastern boundary river rises on the slopes of Mount Hermon
Answer: the Jordan
===
Category: ROOM
Question: The Titanic has 3 rooms for this--only men were allowed there, as women weren't supposed to do it in public
Answer: smoking
...

Bonus exercises

1. Random category clues

Write a script that randomly chooses a category and prints clues from that category.


tip: SQL supports an "ORDER BY RANDOM()" clause that will return rows in a random order. For example, to randomly pick 1 category id you could use:

SELECT id FROM category ORDER BY RANDOM() LIMIT 1

You can also use ORDER BY to sort the clues by value.


Example output:

5 GUYS NAMED MOE
[$200] Last name of Moe of the Three Stooges
[$400] Moe Strauss founded this auto parts chain along with Manny Rosenfield & Jack Jackson
[$600] Major league catcher Moe Berg was also a WWII spy for this agency, precursor of the CIA
[$800] Term for the type of country music Moe Bandy plays, the clubs where he began, or the "Queen" he sang of in 1981
[$1000] This "Kool" rapper's album "How Ya Like Me Now" began a rivalry with LL Cool J

Exercises resources:


2. Random game categories

Write a script to randomly choose a game number and print the categories from that game.

tip: the category table has game and round fields. round 0 is the Jeopardy round, round 1 is the Double Jeopardy round, and round 2 is Final Jeopardy.


Example output:

Categories for game #136:
0 WELCOME TO MY COUNTRY
0 METALS
0 GO GO GAUGUIN
0 FILE UNDER "M"
0 ANIMATED CATS
0 MAO MAO MAO MAO
1 SHAKESPEARE'S OPENING LINES
1 HEY, MARIO!
1 BRIDGE ON THE RIVER....
1 RUNNING MATES
1 13-LETTER WORDS
1 TONY BENNETT'S SONGBOOK
2 IN THE NEWS 2000


3. Top 20 Jeopardy categories

Read about the GROUP BY clause and write a script using it to print the 20 most common Jeopardy categories.

An example of using GROUP BY and ORDER BY to produce an ordered list of counts on a hypothetical foo field is:

SELECT foo, COUNT(foo) AS count FROM my_table GROUP BY foo ORDER BY count

Example output:

81 LITERATURE
79 BEFORE & AFTER
73 WORD ORIGINS
71 SCIENCE
64 BUSINESS & INDUSTRY
63 AMERICAN HISTORY
...

Congratulations!

You've learned about SQL and making database queries from within Python. Keep practicing!