Matplotlib

From OpenHatch wiki
Revision as of 19:58, 26 July 2012 by imported>Jesstess

Project

Learn how to plot data with the matplotlib plotting library. Ditch Excel forever!

Goals

  • practice reading data from a file
  • practice using the matplotlib Python plotting library to analyze data and generate graphs

Project setup

Mac OS X users only

If you do not already have a C compiler installed, you'll need one to install matplotlib. You have several options depending on your situation:

  1. Download and install Xcode (1.5 GB) from https://developer.apple.com/xcode/
  2. Download and install Command Line Tools for Xcode (175 MB) from https://developer.apple.com/downloads/index.action. This requires an Apple Developer account (free, but you have to sign up).
  3. Download and install kennethreitz's gcc installer (requires 10.6 or 10.7) from https://github.com/kennethreitz/osx-gcc-installer/

Please wave over a staff member and we'll help you pick which option is best for you computer.

Install the project dependencies

Please follow the official matplotlib installation instructions at http://matplotlib.sourceforge.net/users/installing.html

The dependencies vary across operating systems. http://matplotlib.sourceforge.net/users/installing.html#build-requirements summarizes what you'll need for your operating system.

A universal dependency is the NumPy scientific computing library. NumPy has download and installation instructions at http://numpy.scipy.org/

Installing matplotlib and its dependencies is somewhat involved; please ask for help if you get stuck or don't know where to start!


Download and un-archive the Jeopardy database project skeleton code

Un-archiving will produce a JeopardyDatabase folder containing 3 Python files and one SQL database dump.

Create a SQLite database from the database dump

Inside JeopardyDatabase is a file called jeopardy.dump which contains a SQL database dump. We need to turn that database dump into a SQLite database.

Once you have SQLite installed, you can create a database from jeopardy.dump with:

sqlite3 jeopardy.db < jeopardy.dump

This creates a sqlite3 database called jeopardy.db

Test your setup

At a command prompt, start sqlite3 using the jeopardy.db database by running:

sqlite3 jeopardy.db

That should start a sqlite prompt that looks like this:

SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

At that sqlite prompt, type .tables and hit enter. That should display a list of the tables in this database:

sqlite> .tables
category  clue    
sqlite>

From a command prompt, navigate to the JeopardyDatabase directory and run

python jeopardy_categories.py

You should see a list of 10 jeopardy categories printed to the screen. If you don't, let a staff member know so you can debug this together.

Project steps

1. Create a basic plot

  1. Run python basic_plot.py. This will pop up a window with a dot plot of some data.
  2. Open basic_plot.py. Read through the code in this file. The meat of the file is in one line:
    pyplot.plot([0, 2, 4, 8, 16, 32], "o")

    In this example, the first argument to pyplot.plot is the list of y values, and the second argument describes how to plot the data. If two lists had been supplied, pyplot.plot would consider the first list to be the x values and the second list to be the y values.

  3. Change the plot to display lines between the data points by changing
    pyplot.plot([0, 2, 4, 8, 16, 32], "o")

    to

    pyplot.plot([0, 2, 4, 8, 16, 32], "o-")
  4. Add x-values to the data by changing pyplot.plot([0, 2, 4, 8, 16, 32], "o-") to
    x_values = [0, 4, 7, 20, 22, 25]
    y_values = [0, 2, 4, 8, 16, 32]
    pyplot.plot(x_values, y_values, "o-")

    Note how matplotlib automatically resizes the graph to fit all of the points in the figure for you.

  5. Read about how to generate random integers on http://docs.python.org/library/random.html#random.randint. Then, instead of hard-coding x values and y values in basic_plot.py, generate a list of random y values. An example plot using random y values might look like this:

Read these short documents:

Check your understanding:

  • What does matplotlib pick as the x values if you don't supply them yourself?
  • What options would you pass to pyplot.plot to generate a plot with red triangles and dotted lines?


2. Plotting the world population over time

  1. Run python world_population.py. This will pop up a window with a dot plot of the world population over the last 10,000 years.
  2. Open world_population.py. Read through the code in this file. In this example, we read our data from a file. Open the data file world_population.txt and examine the format of the file.
  3. Find the documentation on http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.plot for customizing the linewidth of plots. Then change the world population plot to use a magenta, down-triangle marker and a linewidth of 2.

World population resources:

Check your understanding:

  • In world_population.py, what does file("world_population.txt", "r").readlines() return?
  • In world_population.py, what does point.split() return?


3. Plotting life expectancy over time

In a new file, write code to plot the data in life_expectancies_usa.txt. The format in this file is <year>,<male life expectancy>,<female life expectancy>.

You can call pyplot.plot multiple times to draw multiple lines on the same figure. For example:

pyplot.plot(my_data_1, "mo-", label="my data 1")
pyplot.plot(my_data_2, "bo-", "label="my data 2")

will plot my_data_1 in magenta and my_data_2 in blue on the same figure.

Supply labels for your plots, like above. Then use pyplot.legend to give your graph a legend.

Your graph should look something like this:

To save your graph to a file instead of or in addition to displaying it, call pyplot.savefig.

Life expectancy resources:

Bonus exercises

1. Random category clues

Write a script that randomly chooses a category and prints clues from that category.


tip: SQL supports an "ORDER BY RANDOM()" clause that will return rows in a random order. For example, to randomly pick 1 category id you could use:

SELECT id FROM category ORDER BY RANDOM() LIMIT 1

You can also use ORDER BY to sort the clues by value.


Example output:

5 GUYS NAMED MOE
[$200] Last name of Moe of the Three Stooges
[$400] Moe Strauss founded this auto parts chain along with Manny Rosenfield & Jack Jackson
[$600] Major league catcher Moe Berg was also a WWII spy for this agency, precursor of the CIA
[$800] Term for the type of country music Moe Bandy plays, the clubs where he began, or the "Queen" he sang of in 1981
[$1000] This "Kool" rapper's album "How Ya Like Me Now" began a rivalry with LL Cool J

Exercises resources:


2. Random game categories

Write a script to randomly choose a game number and print the categories from that game.

tip: the category table has game and round fields. round 0 is the Jeopardy round, round 1 is the Double Jeopardy round, and round 2 is Final Jeopardy.


Example output:

Categories for game #136:
0 WELCOME TO MY COUNTRY
0 METALS
0 GO GO GAUGUIN
0 FILE UNDER "M"
0 ANIMATED CATS
0 MAO MAO MAO MAO
1 SHAKESPEARE'S OPENING LINES
1 HEY, MARIO!
1 BRIDGE ON THE RIVER....
1 RUNNING MATES
1 13-LETTER WORDS
1 TONY BENNETT'S SONGBOOK
2 IN THE NEWS 2000


3. Top 20 Jeopardy categories

Read about the GROUP BY clause and write a script using it to print the 20 most common Jeopardy categories.

An example of using GROUP BY and ORDER BY to produce an ordered list of counts on a hypothetical foo field is:

SELECT foo, COUNT(foo) AS count FROM my_table GROUP BY foo ORDER BY count

Example output:

81 LITERATURE
79 BEFORE & AFTER
73 WORD ORIGINS
71 SCIENCE
64 BUSINESS & INDUSTRY
63 AMERICAN HISTORY
...

Congratulations!

You've learned about SQL and making database queries from within Python. Keep practicing!