Community Data Science Workshops (Spring 2014)/Reflections: Difference between revisions

m
moved to wiki.communitydata.cc
imported>Mako
No edit summary
imported>Jtmorgan
m (moved to wiki.communitydata.cc)
 
(19 intermediate revisions by 4 users not shown)
Line 1:
{{CDSW Moved}}
Over three weekends in Spring 2014, a group of volunteers organized the [[Community Data Science Workshops]] (CDSW) — a series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. The CDSW were held between April 4th and May 31st in 2014 at the University of Washington in Seattle.
 
Over three weekends in Spring 2014, a group of volunteers organized the [[Community Data Science Workshops (Spring 2014)]] (CDSW) — athe first series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. TheThis version of the [[CDSW]] were held between April 4th and May 31st in 2014 at the University of Washington in Seattle.
 
This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!
 
In general, the mentors and students, suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which are detailed below.
 
If you have any questions or issues, you can contact [[Benjamin Mako Hill]] directly or can email the whole group of mentors at cdsw-sp2014-mentors@uw.edu.
 
== Structure ==
 
The [[CDSWCommunity Data Science Workshops (Spring 2014)]] consisted of [[CDSWCommunity Data Science Workshops (Spring 2014)#Schedule|four sessions]]:
 
* '''Session 0 (Friday April 4th)''': [[CDSWCommunity Data Science Workshops (Spring 2014)#Session 0 (Friday April 4th Evening 6-9pm)|Setup and Programming Practice]]
* '''Session 1 (Saturday April 5th)''': [[CDSWCommunity Data Science Workshops (Spring 2014)#Session 1 (Saturday April 5th)|Introduction to Python]]
* '''Session 2 (Saturday May 3rd)''': [[CDSWCommunity Data Science Workshops (Spring 2014)#Session 2 (Saturday May 3rd)|Building data sets using web APIs]]
* '''Session 3 (Saturday May 31st)''': [[CDSWCommunity Data Science Workshops (Spring 2014)# Session 3 (Saturday May 31st)|Data analysis and visualization]]
 
Our organization and the curriculum for Sessions 0 and 1 were borrowed from the [http://bostonpythonworkshop.com/ Boston Python Workshop] (BPW): Session 0 was a three hour evening session to install software. The other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:
Line 56 ⟶ 60:
=== Projects ===
 
In the afternoons, we brokenbroke into small groups to work on "projects". In each afternoon we tried to have three afternoon project tracks: Two projects on different substantive topics for learners with different interests and a third project that was much more self-directed.
 
In Sessions 1 and 2, the self-directed projects were based on working through examples from [http://www.codecademy.com/ Code Academy] that we had put from material already online on the website. In the self-directed track, students could work at their own pace with mentors on hand to work with them when they became stuck.
Line 65 ⟶ 69:
 
* All of the libraries necessary to run the examples (e.g., [http://www.tweepy.org/ Tweepy] for the Session 2 Twitter track).
* All of the data necessary to run the example programs (e.g., a full English word list for the Wordplay exampl,eexample).
* Any other necessary code or libraries we had written for the example.
* A series of small numbered example programs (~5-10 examples). Each example program attempts to be sparse, well documented, and not more than 10-15 lines of Python code. Each program tried both to do something concrete but also provide an example for learners to modify. Althought it was not always possiiblepossible, the example programs tried to only used Python concepts we had covered in class.
 
On average, the non-self-directed afternoon tracks constituted of about 30% impromptu lecture where a designated lead mentor would walk through one or more of the examples explaining the code and concepts in detail and answerinig questions.
 
AfterwardAfterwards, the lead mentor would then present a list of increasingly difficult challenges which would be listed for the entire group to work on sequentially. These were usually written on a whiteboard or projected and were often added to dynamically based on student feedback and interest.
 
Learners would work on these challenges at their own pace working with mentors for help. If the group was stuck on a concept or tool, the lead mentor would bring the group back together to walk through the concept using the project in the full group.
Line 81 ⟶ 85:
== Session 0: Python Setup ==
 
The goal of this session was to get users setup with Python and starting to learn some of the basics. The setup curriculum was adpated from BPW. We ran into the following challanges:
 
* Users on Windows struggled to get Python setup and added to their path.
Line 99 ⟶ 103:
== Session 1: Introduction to Python ==
 
The goal of this session was to teach the basic of programming in Python. The curriculum for BPW has been used many times and is well tested. and Unsurprisingly, it worked well for us as well. That said, there several things we will change when we do the material again:
 
That said, there several things we will change when we teach the material again:
* If possible, we would have liked to do introductions (i.e., simple "your name and where you are from and what you want to do up") which would have been useful up front — even in a big group.
* The BPW examples were not focused on data and were more classic computer science projects. In the future, we would like to choose some examples that are little more data focused.
 
* If possible, we would have liked to do introductions (i.e., simple "your name and where you are from and what you want to do up") which would have been useful up front — even in a big group. This seems more important in a multi-day event and would have been useful for the mentors.
In terms of the afternoon sessions, we felt that the Colorwall example was ''way'' too complicated. It introduced many features and concepts that nobody had seen up front.
* The BPW examplesprojects were not focused on data and were more like classic computer science class projects. In the future, we would like to choose some examples that are little more data focused.
 
=== Afternoon Sessionssessions ===
The Wordplay example was much better in this regard. In particular, what we liked about Wordplay was that it was broken up into a series of small example projects that did one small thing.
 
In terms of the afternoon sessions, we felt that the Colorwall[[ColorWall]] example was ''way'' too complicated. It introduced many features and concepts that nobody had seen upand many users were frontflustered.
This provided us with an opportunity to walk through the example and then pose challenges to students to do something concrete. Students could look through their example programs and build up from there. We felt that this was much more useful than in Colorwall where there were several large conceptual hurdles.
 
The [[Wordplay]] exampleproject was much better in this regard. In particular, what we liked aboutthat Wordplay was that it was broken up into a series of small example projects that each did one small thing. This provided us with an opportunity to walk through the example and then pose challenges to students to make changes to the code.
In the future, we want to build more data-focused examples as well. Our current thought is to build a little example, not entirely unlike Colorwall, that involves parsing and searching through the complete works of Shakespeare.
 
In the future, we wantwill toreplace build[[ColorWall]] with another more data-focused examples as wellexample. Our current thought is to build a little example, notinvolves entirelyinterating unlikethrough Colorwall,a thatpre-parsed involvesversion parsing and searching throughof the complete works of Shakespeare.
 
== Session 2: Learning APIs ==
 
The goal of this session was to describe what web APIs were, how they worked (making HTTP requests and receiving data back), how to understand JSON Data, and how to use common web APIs from Wikipedia and Twitter.
Mentors and students felt that this session was the most successful and effective session — including, surprisingly, the most widely tested BPW session.
 
Mentors and students felt that this session was the most successful and effective session — including, surprisingly, the most widely tested BPW session.
=== Morning Lecture ===
 
=== Morning Lecturelecture ===
The morning lecture was well received — if delivered too quickly by Benjamin Mako Hill. Unsurprisingly, the example of PlaceKitten as an PI was an enormous hit.
 
The morning lecture was well received — if delivered too quickly by Benjamin Mako Hill. Unsurprisingly, the example of [http://placekitten.com/ PlaceKitten] as an PIAPI was an enormous hit: informative ''and'' cute.
Generally, speaking, explaining what APIs are is difficult. In particular, it's useful to explicitly say that we are focused on web APIs and that APIs are protocols or languages. Learners frequently wanted to ask questions like, "Where in the program is the API?" The API, of course, is the protocol that describes what a client can ask for and what they can expect to receive back. Preparing a concise answer to this question ahead of time is worthwhile.
 
Generally, speaking, explaining whatDefining APIs are iswas difficult. In particularFirst, it'sgeneral usefulambiguity toaround explicitlythe sayuse thatof wethe areterm focusedand onthe webdifference between APIs in general and thatweb APIs are protocolsshould orbe languagesforegrounded. Learners frequently wanted to ask questions like, "Where in thethis Python program is the API?" TheIt API,was ofdifficult course,for some to grasp that the API is the ''protocol'' that describes what a client can ask for and what they can expect to receive back. Preparing a concise answer to this question ahead of time iswould have been worthwhile. We spent too much time on this in the session.
Although there was some debate among the mentors, if there is one thing we might remove from curriculum for a future session, it might be JSON. The reason it seemed less useful is that most of the APIs that most learners plan to use (e.g., Twitter) already have Python interfaces in the form of modules. In this sense, spend 1/4 of a lecture to learn how to parse JSON objects seems like a poor use of time. On the other hand, spending time looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that ''is'' necessary and that students will otherwise not be prepared for.
 
Although there was some debate among the mentors, if there is one thing we might remove from curriculum for a future session, it mightwould probably be JSON. The reason it seemed less useful is that most of the APIs that most learners plan to use (e.g., Twitter and Wikipedia) already have Python interfaces in the form of modules. In this sense, spendspending 30 1/4minutes of a lecture to learn how to parse JSON objects seems like a poor use of time. On the other hand, spending time looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that ''is'' necessary and that students will otherwise not be prepared for.
=== Afternoon Sessions ===
 
On the other hand, time spent looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that is necessary and that students will otherwise not be prepared for. We were undecided as a group.
In our session, more than 2/3 students were interested in learning Twitter and the session was heavily attended.
 
=== Afternoon sessions ===
In Twitter, discoverability on the tweepy objects was a challenge. Users will have an object but you it's not easy to introspect those objects and see what's there in the same way you can with a JSON object. This came a surprise to us and required some real-time consultation with the TweePy documentation.
 
In our session, more than 2/360% of students were interested in learning Twitter and thethat sessiontrack was heavily attended.
The Wikipedia session ended up spending very little time working with the example code we had prepared at all. Instead, we worked directly from examples in the morning and wrote code almost from Scratch while looking directly at the API.
 
In Twitter, discoverability onof the structure of [http://www.tweepy.org/ Tweepy] objects was a challenge. Users willwould havecreate an object but you it's was not easy to introspect those objects and see what's is there in the same way youwe had candiscussed with a JSON objectobjects. This came a surprise to us and required some real-time consultation with the TweePy[http://tweepy.readthedocs.org/en/v2.3.0/ Tweepy module documentation].
Our session focused on building a version of the game Catfishing. Essentially, we set out to write a program that would get a list of categories for a set of articles, randomly select an articles, and then show categories back to the user to have them "guess" the article. We modified the program to not include obvious giveaways (e.g., to remove categories that include the answer itself as a substring).
 
The Wikipedia session ended up spending very little time working with the example code we had prepared at all. Instead, we worked directly from examples in the morning and wrote code almost entire from Scratchscratch while looking directly at the output from the API.
Both sessions worked well and received good feedback.
 
Our session focused on building a version of the [http://kevan.org/catfishing.php game Catfishing]. Essentially, we set out to write a program that would get a list of categories for a set of articles, randomly select anone articlesof those articlse, and then show categories associated with that article back to the user to have them "guess" the article. We modified the program to not include obvious giveaways (e.g., to remove categories that include the answer itself as a substring).
In future session, we might like to focus on other APIs including, perhaps, APIs that do not include modules which provide a stronger non-pedagogical reason to focus on reading and learning JSON.
 
Both sessions worked well and received goodpositive feedback.
Simple APIs might have been a good example of something we could do as a small group exercise between parts of the lecture.
 
In future session, we might like to focus on other APIs including, perhaps, APIs that do not include modules. whichThis would provide a stronger non-pedagogical reason to focus on reading and learning JSON. Working with simple APIs might have been a good example of something we could do as a small group exercise between parts of the lecture.
 
== Session 3: Data Analysis and Visualization ==
 
The goal of this session was to get users to the point where they could take data from a web API and ask and answer basic data science questions by using Python to manipulating data and by creating simple visualizations.
Because we only had three sessions, ,our philosophy in Session 3 was different than most other attempts to teach data science in Python:
 
Our philosophy in Session 3 was to teach users to get data into tools they already know and use. We thought this would be a better use of their time and help make users independent earlier.
 
*Based Teachon usersfeedback tofrom getthe dataapplication, into toolswe theyknow already know.that Almostalmost every user who attended our sessions had at least basic experience with spreadsheets and using spreadsheets to create simple chartingcharts. We tried to help users process data using Python into formsformats that they could load them up in Pythonexisting tools like ''LibreOffice'', ''Microsoft Excel'', or ''Google Docs''.
 
=== Lecture ===
 
AsBecause amuch result,of theour morninganalysis lecturewas focusedgoing onto basictake dataplace manipulationoutside inof Python., Wethe mostlylecture focused on review inand theon formnew ofconcept for data manipulation. The lecture began with a detailed walk-through of code we[[User:Mako|Mako]] wrote to build a new dataset andof thenmetadata mostlyfor aall focusrevisions onto countingarticles andabout grouping[https://en.wikipedia.org/wiki/Harry_Potter dataHarry Potter] on English Wikipedia.
 
TheAfter lecture started with a dataset of metadata on all revisions to articles about Harry Potter from English Wikipedia. Afterthis review of the code necessary to build it, we focused on questions related to counting, binning, and grouping data. In that process, wein triedorder to ask and answer simple questions like:
 
* What proportion of edits to Wikipedia ''Harry Potter'' articles are minor?
* What proportion of edits to Wikipedia ''Harry Potter'' articles are made by "anonymous" contributors?
* What are the most edited articles on ''Harry Potter'' articles?
* Who are the most active editors on articles in ''Harry Potter'' articles?
 
Becuse it did not require installation of software and because it ran on every platform, we did sorting and visualization in [http://docs.google.com Google Docs].
 
=== Projects ===
 
In the afternoon projects, one group continued with work on the ''Harry Potter'' dataset from English Wikipedia. In this case, the group on building a time series dataset. We were able to bin edits by day and to graph the time series of edits to English Wikipedia over time. Users could easily see the release of the ''Harry Potter'' books and movies from the time series and this was a major ''ahah'' moment for many of the participants.
 
A second project focused on MatPlotLib[http://matplotlib.org/ Matplotlib] and generated heatmaps of contributions to articles about men and women in Wikipedia based on time in Wikipedia's lifetime and time of the subjects lifetime. The heatmaps were popular with participants and were something that could not be easily done with spreadsheets.
In this case, focused on building a time series dataset. We were able to bin edits by day and to graph the time series of edits to English Wikipedia over time.
 
[[File:Matplotlib-hist2d.png|400px]]
Users could easily see the release of books and movies. This was a major ''ahah'' moment for many of the participants.
 
The challenge with MatPlotLib''matplotlib'' was mostly focused onaround installation which took an enormous amount of time when several learners ran into trouble. In the future, we will use [https://store.continuum.io/cshop/anaconda/ Anaconda] which we hope will address these issues because ''Anaconda'' includes MatPlotLib''Matplotlib''.
A second project focused on MatPlotLib and generated heatmaps of contributions to articles about men and women in Wikipedia based on time in Wikipedia's lifetime and time of the subjects lifetime. The heatmaps were popular with participants and were something that could not be easily done with spreadsheets.
 
The challenge with MatPlotLib was mostly focused on installation which took an enormous amount of time. In the future, we will use Anaconda which we hope will address these issues because Anaconda includes MatPlotLib.
 
== General Feedback ==
 
OurOne important goal was help get learners as close to independence as possible. but weWe felt that most learners didn'tdid not make it all the way. In a sense, our our final session seemed to let out a little bit on a low point int he class in the sense thatclass: manyMany userusers had learned enough that they were able to workstart venturing out on their own but not enough that they were not struggling enormously in the process.
 
One suggestion to try to address this is to add an additional optionalhalf-day session with no lecture or planned projects. Learners could come and mentors will be with them to work on ''their'' projects. Of course, we want everybody to be able to come so we should also create a set of "random" projects for folks that don't have themprojects yet.
 
 
* The spacing between sessions too much. In part, this was due to the fact that we were creating curriculum as we went. Next time, we will try to do the sessions every other week (e.g., 3 sessions in 5 weeks).
 
* The breaks for lunch were a bit too long. We took 1 hours breaks but 45 minutes would have been enough for everybody. Learners were interested in getting back in action.
 
* The general structure of the entire curriculum was not as clear as it might have been. This was at least in part because the details of what we would teach int he later sessions were not done but it led to questions. In the future, we should present this clearly up front.
 
* The spacing between sessions too muchlarge. In part, this was due to the fact that we were creating curriculum as we went. Next time, we will try to do the sessions every other week (e.g., 34 sessions in 5 weeks).
* The breaks for lunch were a bit too long. We took 1 hourshour-long breaks but 45 minutes would have been enough for everybody. Learners were interested in getting back into action.work!
* The general structure of the entire curriculum was not as clear as it might have been which led to some confusion. This was, at least in part, because the details of what we would teach intin hethe later sessions were not donedecided butwhen itwe led to questionsbegan. In the future, we should present thisthe entire session plan clearly up front.
* We did not have enough mentors with experience using Python in Windows. We had many skilled GNU/Linux users and ''zero'' students running GNU/Linux. Most of the mentors used Mac OSX and most of the learners ran Windows.
* Although we did not use it as a recruitment or selection criteria, a majority of the participants in the session were women. Although we had a mix of men and women mentors, the fact that most of our mentors were male and most of heour mentorslearners were female setwas upsomething awe strangewould have liked to dynamicavoid. If we expect to have a similar ratio in the future, we should try to recruit female mentors and, in particular, to attract women to lead the afternoon sessions (all of the afternoon session leaderslead mentors were male).
* The SWC-style sticky notes worked extremely well but were used less, and seemed to have less value, as we went alongprogressed.
 
In the future We might also want to spend time devoting more time explicitly to teaching:
* Although we did not use it as a recruitment or selection criteria, a majority of the participants in the session were women. Although we had a mix, the fact that most of our mentors were male and most of he mentors were female set up a strange dynamic. If we expect to have a similar ratio in the future, we should recruit female mentors and, in particular, attract women to lead the afternoon sessions (all of the afternoon session leaders were male).
 
* The SWC-style sticky notes worked extremely well but were used less and seemed to have less value as we went along.
 
 
We might also want to spend time devoting more time explicitly to teaching:
 
* Debugging code
* Finding and reading documentation
* Troubleshooting and looking at StackExchange for answers to programming questions.
 
=== Budget ===
 
For lunch we spent between $400 (pizza), $360 (a few less pizzapizzas), and $600 (for fancy Indian food). This was for 50 students and ~15 mentors but we assumed about 60 people would actually be there at each session. We also spent ~$50 in the mornings for coffee.
(for fancy Indian at the last one). This was for 50 students and 18
mentors but we assumed about 60 people would actually be there. We
also spent $50 in the mornings for coffee.
 
Most mentors could not make the afterfollow-sessionup sessions so we spent about $100 per session on mentor dinners. If more people showed up, it would have been closer to $200-250 per mentor dinner.
per session on mentor dinners. If more people showed up, it would have
been closer to $200-250 per mentor dinner.
 
All of our food was generously supported by the [http://escience.washington.edu/ eScience Institute at UW]. The rooms were free because they were provided by [http://www.com.washington.edu UW Department of Communication]
The rooms were free.
 
If you had a total budget would be in the order of $2000-2500, I think you could easily do a similar 3.5 day-long set of workshops. If we had a little more, we could do better than pizza for lunch.
you could easily do a similar 3.5 day-long sessions.
 
<!-- LocalWords: CDSW BPW JSON
Anonymous user