Community Data Science Workshops (Spring 2014)/Reflections: Difference between revisions

Content added Content deleted

Inline

Revision as of 22:53, 10 July 2014

Over three weekends in Spring 2014, a group of volunteers organized the Community Data Workshops (CDSW) — a series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. The CDSW were held between April 4th and May 31st in 2014 at the University of Washington, Seattle.

This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW. This includes future versions of ourselves.

In feedback, the mentors, and the students, suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefitted enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the projects.

Structure

We had four sessions:

Session 0 (Friday April 4th): Setup and Programming Practice
Session 1 (Saturday April 5th): Introduction to Python
Session 2 (Saturday May 3rd): Building data sets using web APIs
Session 3 (Saturday May 31st): Data analysis and visualization

Our organization and the curriculum for Sessions 0 and 1 were borrowed from the Boston Python Workshop. Session 0 was a three hour evening session to install software. The other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:

Morning, 10am-noon: A 2 hour lecturelanguage
Lunch, noon-1pm
Afternoon, 1pm-3:30pm: Practice workinig on projects in 3 breakout sessions
Wrap-up, 3:30pm-4pm: Wrap-up, next steps, and upcoming opportunities

We did not take roll or even track how many people were present. Our feeling was that nearly every student who came to the first week (Sesions 0 and 1) came to Session 2. Retention between the second two sessions was much worse with perhaps only 60% of the full group returning for session 3. We attribute this both to poor timing (the weekend before finals at UW) and to the long space between the sessions.

Morning Lectures

Benjamin Mako Hill gave all three of the two hours lectures. All of the lectures involved the teach working through material in an interactive Python interpretor with students following along on their own computers. In general, the lecturs were well recieved by students.

Concern with the lectures include the feeling that:

Two hours of straight lecture of difficult material was too long
If students got lost, it could be very hard to catch up given how the interactive session tended to build on earlier steps.
There were often more mentors than needed in the morning sessions meaning that many mentors were idle.
As the lectures progressed and the work and tasks became more complex, working in the interactive interpretor become increasingly difficult — particularly for very long programs.

To address these concerns, we've suggested the following changes:

Break up the lecture into at least two parts. Between those parts, include a small (10-15 minute) long excercise. This will both break things up, allow mentors to be of more help, and give students who fell behind a chance to catch up. It will also allow students to grab coffee and such.
Record the lectures so that students can catch up after the fact.
Arrange for some mentors to arrive after noon if they'd prefer.
Upload not only the outline, but examples of all of the code, that we will run interactively.
Switch into writing code in files and running those files much earlier — perhaps as soon as we hit more than 2-3 lines in a for loop in Session 1. This might make writing these loops useful in that they can be reused by students and will introduce the idea of writing and running code in a file (as opposed to a REPL environment) much earlier.

Projects

In the afternoon, we broken into small groups to work on projects. In each session we tried to have two projects on different topics for learners with different interests and a third project which was self-directed.

In sessions 1 and 2, the self direct projects were based on working through examples from CodeAcademy that we had put together and aggregrated from material already online. In the CodeAcademy room, students could work at their own pace and there mentors on hand to work with them. In Sesson 3, we did not use Code Academy but instead had a room that was devoted to students working with mentors on data science projects of their chose. In this case, we asked that, because of issues with the student to mentor ratio, students only participate in this session if they flet they could be self-sufficient and willing to work on their own 70-80% of the time with mentor help the rest of the time.

In all other breakout sessions, student would download a prepared example in the form a of a zip file or tar.gz file. In each case, these projects would include:

All of the libraries necessary to run the examples (e.g., TweePy for the Twitter example).
All of the data necessary to run the example programs (e.g., a full English wordlist).
Any other necessary code or libraries we had written for the example.
A series of small numbered example programs (~5-10 examples). Each tried to be sparse, well documented, and not more than 10-15 lines of Python. Each program would do something concrete but also provide an example for learners to modify.

On average, the sessions involved about 1/3 amount of interactive lecture where the lead mentor would walk through one or more of the examples explaining the code in detail.

For most of the sessions, however, the lead mentor would present a list of increasinigly difficult challenges which would be listed for the entire group (often in comments in source code of an example project).

Learners would work on these challenges at their own pace working with Mentors for help. If the group was stuck on a concept or tool, the lead mentor would bring the group back together to walk through the concept using the project in the full group.

In cases, more advanced students could "jump ahead" and begin working on their own challenges or changing the code to work in different ways. This was welcome and encouraged.

Session 0: Python Setup

Challenges:

Users on Windows struggled to get Python setup.
Users had different (and often older) version of Python which became a bigger issue when we began using URL parsing libraries.
Mac users struggled with — and generally did not like — Smultron.

Proposed changes:

Use Anaconda for getting Python install like SWC does
Use a different text editor for MacOS. TextWrangler was suggested
http://repl.it looks intriguing but perhaps not either ready enough or "real" enough
Emphasize more strongly that Windows users need to come to Session 0.
Change the CodeAcademy lessons to remove and change the HTML example. Users that knew HTML already were often confused because printing "<b>foo</b>" did not result in actually bolded text. This was just the wrong choice for a simple string concatenation example.
Add some text to emphasize the difference between the Python shell and the system shell. Students were confused about this until the end.
Add a new check off step that includes the following: create a file, save it, run

Session 1: Introduction to Python

The curriculum for BPW has been used many times and is well tested and worked well for us as well. That said, there several things we will change when we do the material again:

If possible, we would have liked to do introductions (i.e., simple "your name and where you are from and what you want to do up") which would have been useful up front — even in a big group.
The BPW examples were not focused on data and were more classic computer science projects. In the future, we would like to choose some examples that are little more data focused.

In terms of the afternoon sessions, we felt that the Colorwall example was way too complicated. It introduced many features and concepts that nobody had seen up front.

The Wordplay example was much beter in this regard. In particular, what we liked about Worldplay was that it was broken up into a series of small example projects that did one small thing.

This provided us with an opportunity to walk through the example and then pose challenges to students to do something concrete. Students could look through their example programs and build up from there. We felt that this was much more useful than in Colorwall where there were several large conceptual hurdles.

In the future, we want to build more data-focused examples as well. Our current thought is to build a little example, not entirely unlike Colorwall, that involves parsing and searching through the complete works of Shakespeare.

Session 2: Learning APIs

Mentors and students felt that this session was the most successful and effective session — including, suprisingly, the most widely tested BPW session.

Morning Lecture

The morning lecture was well received — if deliviered too quickly by Benjamin Mako Hill. Unsuprisingly, the example of PlaceKitten as an PI was an enormous hit.

Generally, speaking, explaining what APIs are is difficult. In particular, it's useful to explicitly say that we are focused on web APIs and that APIs are protocols or languages. Learners frequently wanted to ask questions like, "Where in the program is the API?" The API, of course, is the protocol that describes what a client can ask for and what they can expect to receive back. Preparing a concise answer to this question ahead of time is worthwhile.

Although there was some debate among the mentors, if there is one thing we might remove from curriculum for a future session, it might be JSON. The reason it seemed less useful is that most of the APIs that most learners plan to use (e.g., Twitter) already have Python interfaces in the form of modules. In this sense, spend 1/4 of a lecture to learn how to parse JSON objects seems like a poor use of time. On the other hand, spendig time looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that is neccessary and that students will otherwise not be prepared for.

Afternoon Sessions

In our session, more than 2/3 students were interested in learning Twitter and the session was heavily attended.

In Twitter, discoverability on the tweepy objects was a challenge. Users will have an object but you it's not easy to introspect those objects and see what's there in the same way you can with a JSON object. This came a suprise to us and required some real-time consultation with the TweePy documentation.

The Wikipedia session ended up spending very little time working with the example code we had prepared at all. Instead, we worked directly from exmaples in the morning and wrote code almost from Scratch while looking directly at the API.

Our session focused on building a version of the game Catfishing. Essentially, we set out to write a program that would get a list of categories for a set of articles, randomly select an artilce, and then show categories back to the user to have them "guess" the article. We modified the program to not include obvious giveaways (e.g., to remove categories that include the answer itself as a substring).

Both sessions worked well and received good feedback.

In future session, we might like to focus on other APIs including, perhaps, APIs that do not include modules which provide a stronger non-pedogogical reason to focus on reaeding and learning JSON.

SimpleAPIs might have been a good example of somethign we could do as a small group excercise between parts of the lecture.

session 3

we covered basic data manipulation stuff

- afternoon sessions

matplot lib was super tough to install.

- maybe anaconda would help? - heatmaps were a hit and something is hard to do in other software but ath worksed out well - focus more on stuff that folks can't do with their spreadhseet

seeing the harry potter graph was a complete hit

final thoughts

we want to focus on getting people more toward independence.

people didn't quite make it all the way

our final session seemed to let out a little bit on a low point int he class

one suggestion is to have a final session with no lecture or curriculum. people can come and mentors will be with them to work o n projects.

of course, we want everybody to come so we shoudl have a set of "random" projects for folks that don't have them already

logistic observations
- budget

For lunch we spent between $400 (pizza), $360 (less pizza), and $600 (for fancy Indian at the last one). This was for 50 students and 18 mentors but we assumed about 60 people would actually be there. We also spent $50 in the mornings for coffee.

Most mentors could not make the after-session so we spent about $100 per session on mentor dinners. If more people showed up, it would have been closer to $200-250 per mentor dinner.

The rooms were free.

If you had a toal budget would be in the order of $2000-2500, I think you could easily do a similar 3.5 day-long sessions.

Things we sould do differently

spacing between sessions too much - every other week?

breaks for lunch were a bit too long. 45 minutes shoudl be enough. folks were interested in getting back in action. food was simple and always there on time so we could have jsut run with this

the general structure of the entire thing was not as clear as it might be or could be. this was at least in part because the details of what we would teach int he later sesions were not done when we started

maybe include some spot where we can talk for 10-15 minutes bout how to use documentation

more windows experienced mentors

challenge going to the right directory . understanding about the path and the idea that files/datasets need to be local to the place the script is run. that was unclear

sticky notes worked super well but sort of gained less value as we went along

things to teach:

- debugging - reading documentation - troubleshooting and looking at stackexchange

@@ Line 98: / Line 98: @@
 == Session 2: Learning APIs ==
+Mentors and students felt that this session was the most successful and effective session — including, suprisingly, the most widely tested BPW session.
-the most successful project, by far
+=== Morning Lecture ===
-** lectures
-placekitten was a complete hit
+The morning lecture was well received — if deliviered too quickly by Benjamin Mako Hill. Unsuprisingly, the example of PlaceKitten as an PI was an enormous hit.
-explaining what an API is was hard.  it's something worth preparing for in advance
+Generally, speaking, explaining what APIs are is difficult. In particular, it's useful to explicitly say that we are focused on web APIs and that APIs are protocols or languages. Learners frequently wanted to ask questions like, "Where in the program is the API?" The API, of course, is the protocol that describes what a client can ask for and what they can expect to receive back. Preparing a concise answer to this question ahead of time is worthwhile.
-if we removed anything from the whole thing it might be JSON
+Although there was some debate among the mentors, if there is one thing we might remove from curriculum for a future session, it might be JSON. The reason it seemed less useful is that most of the APIs that most learners plan to use (e.g., Twitter) already have Python interfaces in the form of modules. In this sense, spend 1/4 of a lecture to learn how to parse JSON objects seems like a poor use of time. On the other hand, spendig time looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that ''is'' neccessary and that students will otherwise not be prepared for.
-- benefits are that it provides a way to think to think about more complex data structures
-- downside is that since most apis students will use already have python interfaces, it ends up being a little irrelevant
+=== Afternoon Sessions ===
-** afternoon sessions
+In our session, more than 2/3 students were interested in learning Twitter and the session was heavily attended.
-twitter/tweepy projects:
-discoverability on the tweepy objects was a challenge. you get an object but you it's not easy to introspect those and see what's there in the same way you can with a json object. this comes as a suprise and was not something we taught.
+In Twitter, discoverability on the tweepy objects was a challenge. Users will have an object but you it's not easy to introspect those objects and see what's there in the same way you can with a JSON object. This came a suprise to us and required some real-time consultation with the TweePy documentation.
+The Wikipedia session ended up spending very little time working with the example code we had prepared at all. Instead, we worked directly from exmaples in the morning and wrote code almost from Scratch while looking directly at the API.
-wikipedia:
+Our session focused on building a version of the game Catfishing. Essentially, we set out to write a program that would get a list of categories for a set of articles, randomly select an artilce, and then show categories back to the user to have them "guess" the article. We modified the program to not include obvious giveaways (e.g., to remove categories that include the answer itself as a substring).
-- we focused on building a version of categories and catfishing
-both worked very well. reviews were generally super posiitive
+Both sessions worked well and received good feedback.
+In future session, we might like to focus on other APIs including, perhaps, APIs that do not include modules which provide a stronger non-pedogogical reason to focus on reaeding and learning JSON.
-** ideas
+SimpleAPIs might have been a good example of somethign we could do as a small group excercise between parts of the lecture.
-- other APIS, maybe ones without existing modules
-- mabye work on toehr apis in small groups
 == session 3 ==