Matplotlib: Difference between revisions

From OpenHatch wiki
imported>Jesstess
No edit summary
imported>Jesstess
Line 146: Line 146:
 
</ol>
 
</ol>
   
<b>Word population resources</b>:
+
<b>World population resources</b>:
 
<ul>
 
<ul>
 
<li>
 
<li>
Line 161: Line 161:
   
   
=== 3. Make database queries from Python ===
+
=== 3. Plotting life expectancy over time ===
   
  +
In a new file, write code to plot the data in <code>life_expectancies_usa.txt</code>. The format in this file is <year>,<male life expectancy>,<female life expectancy>.
Examine the code in <code>jeopardy_categories.py</code>. To make a database query from Python, you need to:
 
# Import a Python library for making database connections
 
# Establish a connection to the desired database
 
# Get a cursor from the database for making queries
 
# Execute the database query using the standard SQL syntax
 
# Retrieve the list of results from the database cursor
 
# Do something useful with the results, like print them
 
# Close the database connection
 
   
  +
You can call <code>pyplot.plot</code> multiple times to draw multiple lines on the same figure. For example:
Match up each of these steps with lines of code in the file.
 
   
  +
<pre>pyplot.plot(my_data_1, "mo-", label="my data 1")
  +
pyplot.plot(my_data_2, "bo-", "label="my data 2")</pre>
  +
  +
will plot <code>my_data_1</code> in magenta and <code>my_data_2</code> in blue on the same figure.
  +
  +
Supply labels for your plots, like above. Then use <code>pyplot.legend</code> to give your graph a legend.
  +
  +
Your graph should look something like this:
  +
  +
[[File:Life_expectancies.png|300px]]
  +
  +
To save your graph to a file instead of or in addition to displaying it, call <code>pyplot.savefig</code>.
  +
  +
<b>Life expectancy resources</b>:
  +
<ul>
  +
<li>
  +
File input and output: http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files.
  +
</li>
  +
<li>
  +
Splitting sprints into parts based on a delimiter: http://www.hacksparrow.com/python-split-string-method-and-examples.html
  +
</li>
  +
<li>
  +
Examples of legends:
  +
</li>
  +
<li>
  +
Ways to configure your legend: http://matplotlib.sourceforge.net/api/legend_api.html
  +
</li>
  +
<li>
  +
Saving your graph to a file: http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.savefig
  +
</li>
  +
</ul>
   
 
=== 4. Tweak the existing Jeopardy scripts ===
 
=== 4. Tweak the existing Jeopardy scripts ===

Revision as of 19:00, 26 July 2012

Grid.png

Project

Learn how to plot data with the matplotlib plotting library. Ditch Excel forever!

Goals

  • practice reading data from a file
  • practice using the matplotlib Python plotting library to analyze data and generate graphs

Project setup

Mac OS X users only

If you do not already have a C compiler installed, you'll need one to install matplotlib. You have several options depending on your situation:

  1. Download and install Xcode (1.5 GB) from https://developer.apple.com/xcode/
  2. Download and install Command Line Tools for Xcode (175 MB) from https://developer.apple.com/downloads/index.action. This requires an Apple Developer account (free, but you have to sign up).
  3. Download and install kennethreitz's gcc installer (requires 10.6 or 10.7) from https://github.com/kennethreitz/osx-gcc-installer/

Please wave over a staff member and we'll help you pick which option is best for you computer.

Install the project dependencies

Please follow the official matplotlib installation instructions at http://matplotlib.sourceforge.net/users/installing.html

The dependencies vary across operating systems. http://matplotlib.sourceforge.net/users/installing.html#build-requirements summarizes what you'll need for your operating system.

A universal dependency is the NumPy scientific computing library. NumPy has download and installation instructions at http://numpy.scipy.org/

Installing matplotlib and its dependencies is somewhat involved; please ask for help if you get stuck or don't know where to start!


Download and un-archive the Jeopardy database project skeleton code

Un-archiving will produce a JeopardyDatabase folder containing 3 Python files and one SQL database dump.

Create a SQLite database from the database dump

Inside JeopardyDatabase is a file called jeopardy.dump which contains a SQL database dump. We need to turn that database dump into a SQLite database.

Once you have SQLite installed, you can create a database from jeopardy.dump with:

sqlite3 jeopardy.db < jeopardy.dump

This creates a sqlite3 database called jeopardy.db

Test your setup

At a command prompt, start sqlite3 using the jeopardy.db database by running:

sqlite3 jeopardy.db

That should start a sqlite prompt that looks like this:

SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

At that sqlite prompt, type .tables and hit enter. That should display a list of the tables in this database:

sqlite> .tables
category  clue    
sqlite>

From a command prompt, navigate to the JeopardyDatabase directory and run

python jeopardy_categories.py

You should see a list of 10 jeopardy categories printed to the screen. If you don't, let a staff member know so you can debug this together.

Project steps

1. Create a basic plot

  1. Run python basic_plot.py. This will pop up a window with a dot plot of some data.
  2. Open basic_plot.py. Read through the code in this file. The meat of the file is in one line:
    pyplot.plot([0, 2, 4, 8, 16, 32], "o")

    In this example, the first argument to pyplot.plot is the list of y values, and the second argument describes how to plot the data. If two lists had been supplied, pyplot.plot would consider the first list to be the x values and the second list to be the y values.

  3. Change the plot to display lines between the data points by changing
    pyplot.plot([0, 2, 4, 8, 16, 32], "o")

    to

    pyplot.plot([0, 2, 4, 8, 16, 32], "o-")
  4. Add x-values to the data by changing pyplot.plot([0, 2, 4, 8, 16, 32], "o-") to
    x_values = [0, 4, 7, 20, 22, 25]
    y_values = [0, 2, 4, 8, 16, 32]
    pyplot.plot(x_values, y_values, "o-")

    Note how matplotlib automatically resizes the graph to fit all of the points in the figure for you.

  5. Read about how to generate random integers on http://docs.python.org/library/random.html#random.randint. Then, instead of hard-coding x values and y values in basic_plot.py, generate a list of random y values. An example plot using random y values might look like this:
    Basic plot.png

Read these short documents:

Check your understanding:

  • What does matplotlib pick as the x values if you don't supply them yourself?
  • What options would you pass to pyplot.plot to generate a plot with red triangles and dotted lines?


2. Plotting the world population over time

  1. Run python world_population.py. This will pop up a window with a dot plot of the world population over the last 10,000 years.
  2. Open world_population.py. Read through the code in this file. In this example, we read our data from a file. Open the data file world_population.txt and examine the format of the file.
  3. Find the documentation on http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.plot for customizing the linewidth of plots. Then change the world population plot to use a magenta, down-triangle marker and a linewidth of 2.

World population resources:

Check your understanding:

  • In world_population.py, what does file("world_population.txt", "r").readlines() return?
  • In world_population.py, what does point.split() return?


3. Plotting life expectancy over time

In a new file, write code to plot the data in life_expectancies_usa.txt. The format in this file is <year>,<male life expectancy>,<female life expectancy>.

You can call pyplot.plot multiple times to draw multiple lines on the same figure. For example:

pyplot.plot(my_data_1, "mo-", label="my data 1")
pyplot.plot(my_data_2, "bo-", "label="my data 2")

will plot my_data_1 in magenta and my_data_2 in blue on the same figure.

Supply labels for your plots, like above. Then use pyplot.legend to give your graph a legend.

Your graph should look something like this:

Life expectancies.png

To save your graph to a file instead of or in addition to displaying it, call pyplot.savefig.

Life expectancy resources:

4. Tweak the existing Jeopardy scripts

1. Modify jeopardy_categories.py to print both the category and game number

tip: Remind yourself of the categories schema by running .schema category at a sqlite prompt.


Example output:

Example categories:

DETECTIVE FICTION (game #1)
THE OLD TESTAMENT (game #2)
ASIAN HISTORY (game #4)
RIVER SOURCES (game #5)
WORLD RELIGION (game #3)
SEAN SONG (game #2)
ANIMATED MOVIES (game #1)
NEW YORK CITY (game #6)
AFRICAN WILDLIFE (game #7)
LITTLE RED RIDING HOOD (game #8)

2. Modify jeopardy_clues.py to only print clues with an $800 value.

A good way to achieve this is by adding a WHERE clause to the SQL query in jeopardy_clues.py.

Read about WHERE clauses in this short document:


Example output:

Example clues:

[$800]
A: She also created the detectives Tuppence & Tommy Beresford
Q: What is 'Agatha Christie'

[$800]
A: According to this Old Testament book, this "swords into plowshares" prophet walked naked for 3 years
Q: What is 'Isaiah'
...


5. Daily Doubles

Write a script that prints 10 daily doubles and their responses.


tip: The clue table has an isDD field.


Example output:

Category: NEW YORK CITY
Question: The heart of Little Italy is this street also found in a Dr. Seuss book title
Answer: Mulberry Street
===
Category: RIVER SOURCES
Question: This Mideastern boundary river rises on the slopes of Mount Hermon
Answer: the Jordan
===
Category: ROOM
Question: The Titanic has 3 rooms for this--only men were allowed there, as women weren't supposed to do it in public
Answer: smoking
...

Bonus exercises

1. Random category clues

Write a script that randomly chooses a category and prints clues from that category.


tip: SQL supports an "ORDER BY RANDOM()" clause that will return rows in a random order. For example, to randomly pick 1 category id you could use:

SELECT id FROM category ORDER BY RANDOM() LIMIT 1

You can also use ORDER BY to sort the clues by value.


Example output:

5 GUYS NAMED MOE
[$200] Last name of Moe of the Three Stooges
[$400] Moe Strauss founded this auto parts chain along with Manny Rosenfield & Jack Jackson
[$600] Major league catcher Moe Berg was also a WWII spy for this agency, precursor of the CIA
[$800] Term for the type of country music Moe Bandy plays, the clubs where he began, or the "Queen" he sang of in 1981
[$1000] This "Kool" rapper's album "How Ya Like Me Now" began a rivalry with LL Cool J

Exercises resources:


2. Random game categories

Write a script to randomly choose a game number and print the categories from that game.

tip: the category table has game and round fields. round 0 is the Jeopardy round, round 1 is the Double Jeopardy round, and round 2 is Final Jeopardy.


Example output:

Categories for game #136:
0 WELCOME TO MY COUNTRY
0 METALS
0 GO GO GAUGUIN
0 FILE UNDER "M"
0 ANIMATED CATS
0 MAO MAO MAO MAO
1 SHAKESPEARE'S OPENING LINES
1 HEY, MARIO!
1 BRIDGE ON THE RIVER....
1 RUNNING MATES
1 13-LETTER WORDS
1 TONY BENNETT'S SONGBOOK
2 IN THE NEWS 2000


3. Top 20 Jeopardy categories

Read about the GROUP BY clause and write a script using it to print the 20 most common Jeopardy categories.

An example of using GROUP BY and ORDER BY to produce an ordered list of counts on a hypothetical foo field is:

SELECT foo, COUNT(foo) AS count FROM my_table GROUP BY foo ORDER BY count

Example output:

81 LITERATURE
79 BEFORE & AFTER
73 WORD ORIGINS
71 SCIENCE
64 BUSINESS & INDUSTRY
63 AMERICAN HISTORY
...

Congratulations!

You've learned about SQL and making database queries from within Python. Keep practicing!

Fireworks.png Balloons.png