Community Data Science Workshops (Spring 2014)/Reflections: Difference between revisions

From OpenHatch wiki
Content added Content deleted
imported>Mako
imported>Jtmorgan
m (moved to wiki.communitydata.cc)
 
(38 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{CDSW Moved}}
Over three weekends in Spring 2014, a group of volunteers organized the Community Data Workshops (CDSW) — a series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. The CDSW were held between April 4th and May 31st in 2014 at the University of Washington, Seattle.


Over three weekends in Spring 2014, a group of volunteers organized the [[Community Data Science Workshops (Spring 2014)]] (CDSW) — the first series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. This version of the [[CDSW]] were held between April 4th and May 31st in 2014 at the University of Washington in Seattle.
This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW. This includes future versions of ourselves.


This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!
In feedback, the mentors, and the students, suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefitted enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the projects.

In general, the mentors and students suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which are detailed below.

If you have any questions or issues, you can contact [[Benjamin Mako Hill]] directly or can email the whole group of mentors at cdsw-sp2014-mentors@uw.edu.


== Structure ==
== Structure ==


The [[Community Data Science Workshops (Spring 2014)]] consisted of [[Community Data Science Workshops (Spring 2014)#Schedule|four sessions]]:
We had four sessions:


* Session 0 (Friday April 4th): Setup and Programming Practice
* '''Session 0 (Friday April 4th)''': [[Community Data Science Workshops (Spring 2014)#Session 0 (Friday April 4th Evening 6-9pm)|Setup and Programming Practice]]
* Session 1 (Saturday April 5th): Introduction to Python
* '''Session 1 (Saturday April 5th)''': [[Community Data Science Workshops (Spring 2014)#Session 1 (Saturday April 5th)|Introduction to Python]]
* Session 2 (Saturday May 3rd): Building data sets using web APIs
* '''Session 2 (Saturday May 3rd)''': [[Community Data Science Workshops (Spring 2014)#Session 2 (Saturday May 3rd)|Building data sets using web APIs]]
* Session 3 (Saturday May 31st): Data analysis and visualization
* '''Session 3 (Saturday May 31st)''': [[Community Data Science Workshops (Spring 2014)# Session 3 (Saturday May 31st)|Data analysis and visualization]]


Our organization and the curriculum for Sessions 0 and 1 were borrowed from the Boston Python Workshop. Session 0 was a three hour evening session to install software. The other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:
Our organization and the curriculum for Sessions 0 and 1 were borrowed from the [http://bostonpythonworkshop.com/ Boston Python Workshop] (BPW): Session 0 was a three hour evening session to install software. The other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:


* Morning, 10am-noon: A 2 hour lecturelanguage
* '''Morning, 10am-noon''': A 2 hour lecture
* Lunch, noon-1pm
* '''Lunch, noon-1pm'''
* Afternoon, 1pm-3:30pm: Practice workinig on projects in 3 breakout sessions
* '''Afternoon, 1pm-3:30pm''': Practice working on projects in 3 breakout sessions
* Wrap-up, 3:30pm-4pm: Wrap-up, next steps, and upcoming opportunities
* '''Wrap-up, 3:30pm-4pm''': Wrap-up, next steps, and upcoming opportunities


We had 12 mentors volunteer initially although more joined as the event progressed.
We did not take roll or even track how many people were present. Our feeling was that nearly every student who came to the first week (Sesions 0 and 1) came to Session 2. Retention between the second two sessions was much worse with perhaps only 60% of the full group returning for session 3. We attribute this both to poor timing (the weekend before finals at UW) and to the long space between the sessions.

We had about 150 participants apply to attend the sessions. We selected on programming skill (to ensure that all attendees were complete beginners), enthusiasm, and randomly to maintain a learner to mentor ratio of between 4 and 5. We admitted just over 50 participants.

Our feeling was that nearly every student who came to the first week (Sessions 0 and 1) came to Session 2. Retention between the second two sessions was much worse with perhaps only 60% of the full group returning for Session 3. We attribute this rentention to poor timing (the weekend before finals at UW which affected many students) and to the long space between the sessions.

We collected detailed feedback from users at three points using the following Google forms (these are copies):

* [https://docs.google.com/forms/d/1gPmgZvOxfE0KVRkb_ySgTqNvCaa4Rl8PYUY9u-NVwTE/viewform Application to the workshop]
* [https://docs.google.com/forms/d/1FGASnZLA3V13JTuJg5LF0fVvrUX9quKYc95yeEATzHY/viewform After Session 1]
* [https://docs.google.com/forms/d/1UhEU3aWKSuLpfBgR8CZcW8JrdgNRDj6FuT8yAqFCFmE/viewform After Session 2]

We used this feedback to both evaluate what worked well and what did not and to get a sense of what students wanted to learn in the next session and which afternoon sessions they might find interesting. We did not collect feedback after the final session but we should have.


=== Morning Lectures ===
=== Morning Lectures ===


Benjamin Mako Hill gave all three of the two hours lectures. All of the lectures involved the teach working through material in an interactive Python interpretor with students following along on their own computers. In general, the lecturs were well recieved by students.
[[Benjamin Mako Hill]] gave all three of the two-hour morning lectures. All of the lectures involved the teacher working through material in the interactive Python interpreter shown on a projector with students following along in a Python interpretor on their own computers. In general, the lectures were rated well received by students.


Concern with the lectures include the feeling that:
Concern with the lectures include feedback that:


* Two hours of straight lecture of difficult material was too long
* Two hours of straight lecture of difficult material was long and difficult to sit through.
* If students got lost, it could be very hard to catch up given how the interactive session tended to build on earlier steps.
* If students became lost, it could be very hard to catch up given how the interactive Python session tended to build on earlier steps and assume the presense of variables or particular states.
* There were often more mentors than needed in the morning sessions meaning that many mentors were idle.
* There were often more mentors than really needed in the morning sessions meaning that many mentors were often idle.
* As the lectures progressed and the work and tasks became more complex, working in the interactive interpretor become increasingly difficult — particularly for very long programs.
* As the lectures progressed and the work and tasks became more complex, working in the interactive interpreter become increasingly difficult — particularly for long for loops or deeply nested blocks of codes.


To address these concerns, we've suggested the following changes:
To address these concerns, we are planning the following changes to how we run these sessions in the future:


* Break up the lecture into at least two parts. Between those parts, include a small (10-15 minute) long excercise. This will both break things up, allow mentors to be of more help, and give students who fell behind a chance to catch up. It will also allow students to grab coffee and such.
* Breaking up the lecture into at least two parts. Between those parts, we will try include a small (~10-15 minute long) exercise. This will both break things up, allow mentors to be of more help during the sssions, and give students who fell behind a chance to catch up. It will also allow students to grab coffee or use the bathroom if they need to.
* Record the lectures so that students can catch up after the fact.
* Record the lectures so that students can catch up after the sessions. We did not do this but should have.
* Arrange for some mentors to arrive after noon if they'd prefer.
* Arrange for some mentors to arrive after noon if they would prefer.
* Upload not only the outline, but examples of all of the code, that we will run interactively.
* Upload not only the outline, but examples of all of the code we'll use as part of the lectures.
* Switch into writing code in files and running those files much earlier — perhaps as soon as we hit more than 2-3 lines in a for loop in Session 1. This might make writing these loops useful in that they can be reused by students and will introduce the idea of writing and running code in a file (as opposed to a REPL environment) much earlier.
* Switch into writing code in separate files and running those files much earlier — perhaps as soon as we hit more than 2-3 lines in a <code>for</code> loop in Session 1.


=== Projects ===
=== Projects ===


In the afternoon, we broken into small groups to work on projects. In each session we tried to have two projects on different topics for learners with different interests and a third project which was self-directed.
In the afternoons, we broke into small groups to work on "projects". In each afternoon we tried to have three afternoon project tracks: Two projects on different substantive topics for learners with different interests and a third project that was much more self-directed.


In sessions 1 and 2, the self direct projects were based on working through examples from CodeAcademy that we had put together and aggregrated from material already online. In the CodeAcademy room, students could work at their own pace and there mentors on hand to work with them. In Sesson 3, we did not use Code Academy but instead had a room that was devoted to students working with mentors on data science projects of their chose. In this case, we asked that, because of issues with the student to mentor ratio, students only participate in this session if they flet they could be self-sufficient and willing to work on their own 70-80% of the time with mentor help the rest of the time.
In Sessions 1 and 2, the self-directed projects were based on working through examples from [http://www.codecademy.com/ Code Academy] that we had put from material already online on the website. In the self-directed track, students could work at their own pace with mentors on hand to work with them when they became stuck.


In Session 3, we did not use Code Academy but instead devoted the self-directed room to students working with mentors on data science projects of their choice. Because of issues with the student to mentor ratio, we asked that students only participate in the self-directed track if they felt confident they could be self-sufficient working on their own 70-80% of the time.
In all other breakout sessions, student would download a prepared example in the form a of a zip file or tar.gz file. In each case, these projects would include:


In all other tracks, student would download a prepared example in the form a of a <code>zip</code> file or <code>tar.gz</code> file. In each case, these projects would include:
* All of the libraries necessary to run the examples (e.g., TweePy for the Twitter example).

* All of the data necessary to run the example programs (e.g., a full English wordlist).
* All of the libraries necessary to run the examples (e.g., [http://www.tweepy.org/ Tweepy] for the Session 2 Twitter track).
* All of the data necessary to run the example programs (e.g., a full English word list for the Wordplay example).
* Any other necessary code or libraries we had written for the example.
* Any other necessary code or libraries we had written for the example.
* A series of small numbered example programs (~5-10 examples). Each tried to be sparse, well documented, and not more than 10-15 lines of Python. Each program would do something concrete but also provide an example for learners to modify.
* A series of small numbered example programs (~5-10 examples). Each example program attempts to be sparse, well documented, and not more than 10-15 lines of Python code. Each program tried both to do something concrete but also provide an example for learners to modify. Althought it was not always possible, the example programs tried to only used Python concepts we had covered in class.


On average, the sessions involved about 1/3 amount of interactive lecture where the lead mentor would walk through one or more of the examples explaining the code in detail.
On average, the non-self-directed afternoon tracks constituted of about 30% impromptu lecture where a designated lead mentor would walk through one or more of the examples explaining the code and concepts in detail and answerinig questions.


For most of the sessions, however, the lead mentor would present a list of increasinigly difficult challenges which would be listed for the entire group (often in comments in source code of an example project).
Afterwards, the lead mentor would then present a list of increasingly difficult challenges which would be listed for the entire group to work on sequentially. These were usually written on a whiteboard or projected and were often added to dynamically based on student feedback and interest.


Learners would work on these challenges at their own pace working with Mentors for help. If the group was stuck on a concept or tool, the lead mentor would bring the group back together to walk through the concept using the project in the full group.
Learners would work on these challenges at their own pace working with mentors for help. If the group was stuck on a concept or tool, the lead mentor would bring the group back together to walk through the concept using the project in the full group.


In cases, more advanced students could "jump ahead" and begin working on their own challenges or changing the code to work in different ways. This was welcome and encouraged.
In cases, more advanced students could "jump ahead" and begin working on their own challenges or changing the code to work in different ways. This was welcome and encouraged.

In all cases, we gave students red sticky notes they could use to signal that they needed help (a tool borrowed from [http://software-carpentry.org/ SWC]).


== Session 0: Python Setup ==
== Session 0: Python Setup ==


The goal of this session was to get users setup with Python and starting to learn some of the basics. The setup curriculum was adpated from BPW. We ran into the following challanges:
Challenges:


* Users on Windows struggled to get Python setup.
* Users on Windows struggled to get Python setup and added to their path.
* Users had different (and often older) version of Python which became a bigger issue when we began using URL parsing libraries.
* Users had different (and often older) version of Python which became a bigger issue when we began using web libraries.
* Mac users struggled with — and generally did not like Smultron.
* Mac users struggled with — and generally did not like the Smultron text editor that we recommended.


Proposed changes:
We proposed the following changes:


* Use Anaconda for getting Python install like SWC does
* Use [https://store.continuum.io/cshop/anaconda/ Anaconda] for getting Python installed (following SWC)
* Use a different text editor for MacOS. TextWrangler was suggested
* Use a different text editor for MacOS. [http://www.textwrangler.com/ Text Wrangler] was suggested.
* http://repl.it looks intriguing but perhaps not either ready enough or "real" enough
* In browser Python (e.g., http://repl.it) is intriguing but perhaps not either ready enough or "real" enough.
* Emphasize more strongly that Windows users ''need'' to come to Session 0.
* Emphasize more strongly that Windows users ''need'' to come to Session 0 to se up
* Change the CodeAcademy lessons to remove and change the HTML example. Users that knew HTML already were often confused because printing "&lt;b&gt;foo&lt;/b&gt;" did not result in actually bolded text. This was just the wrong choice for a simple string concatenation example.
* Change the Code Academy lessons to remove and change the HTML example. Users that knew HTML already were often confused because printing "&lt;b&gt;foo&lt;/b&gt;" did not result in actually bolded text. This was just the wrong choice for a simple string concatenation example.
* Add some text to emphasize the difference between the Python shell and the system shell. Students were confused about this until the end.
* Add some text to emphasize the difference between the Python shell and the system shell. Students were confused about this through the very end.
* Add a new check off step that includes the following: create a file, save it, run
* Add a new check off step that includes the following: create a file, save it, run it.


== Session 1: Introduction to Python ==
== Session 1: Introduction to Python ==


The curriculum for BPW has been used many times and is well tested and worked well for us as well. That said, there several things we will change when we do the material again:
The goal of this session was to teach the basic of programming in Python. The curriculum for BPW has been used many times and is well tested. Unsurprisingly, it worked well for us as well.


That said, there several things we will change when we teach the material again:
* If possible, we would have liked to do introductions (i.e., simple "your name and where you are from and what you want to do up") which would have been useful up front — even in a big group.
* The BPW examples were not focused on data and were more classic computer science projects. In the future, we would like to choose some examples that are little more data focused.


* If possible, we would have liked to do introductions (i.e., simple "your name and where you are from and what you want to do up") which would have been useful up front — even in a big group. This seems more important in a multi-day event and would have been useful for the mentors.
In terms of the afternoon sessions, we felt that the Colorwall example was ''way'' too complicated. It introduced many features and concepts that nobody had seen up front.
* The BPW projects were not focused on data and were more like classic computer science class projects. In the future, we would like to choose some examples that are little more data focused.


=== Afternoon sessions ===
The Wordplay example was much beter in this regard. In particular, what we liked about Worldplay was that it was broken up into a series of small example projects that did one small thing.


In terms of the afternoon sessions, we felt that the [[ColorWall]] example was ''way'' too complicated. It introduced many features and concepts that nobody had seen and many users were flustered.
This provided us with an opportunity to walk through the example and then pose challenges to students to do something concrete. Students could look through their example programs and build up from there. We felt that this was much more useful than in Colorwall where there were several large conceptual hurdles.


The [[Wordplay]] project was much better in this regard. In particular, we liked that Wordplay was broken up into a series of small example projects that each did one small thing. This provided us with an opportunity to walk through the example and then pose challenges to students to make changes to the code.
In the future, we want to build more data-focused examples as well. Our current thought is to build a little example, not entirely unlike Colorwall, that involves parsing and searching through the complete works of Shakespeare.

In the future, we will replace [[ColorWall]] with another more data-focused example. Our current thought is to build a little example involves interating through a pre-parsed version of the complete works of Shakespeare.


== Session 2: Learning APIs ==
== Session 2: Learning APIs ==


The goal of this session was to describe what web APIs were, how they worked (making HTTP requests and receiving data back), how to understand JSON Data, and how to use common web APIs from Wikipedia and Twitter.
the most successful project, by far

** lectures
placekitten was a complete hit

explaining what an API is was hard. it's something worth preparing for in advance


Mentors and students felt that this session was the most successful and effective session.
if we removed anything from the whole thing it might be JSON


=== Morning lecture ===
- benefits are that it provides a way to think to think about more complex data structures
- downside is that since most apis students will use already have python interfaces, it ends up being a little irrelevant


The morning lecture was well received — if delivered too quickly. Unsurprisingly, the example of [http://placekitten.com/ PlaceKitten] as an API was an enormous hit: informative ''and'' cute.
** afternoon sessions


Defining APIs was difficult. First, general ambiguity around the use of the term and the difference between APIs in general and web APIs should be foregrounded. Learners frequently wanted to ask questions like, "Where in this Python program is the API?" It was difficult for some to grasp that the API is the ''protocol'' that describes what a client can ask for and what they can expect to receive back. Preparing a concise answer to this question ahead of time would have been worthwhile. We spent too much time on this in the session.
twitter/tweepy projects:


Although there was some debate among the mentors, if there is one thing we might remove from curriculum for a future session, it would probably be JSON. The reason it seemed less useful is the APIs that most learners plan to use (e.g., Twitter and Wikipedia) already have Python interfaces in the form of modules. In this sense, spending 30 minutes of a lecture to learn how to parse JSON objects seems like a poor use of time.
discoverability on the tweepy objects was a challenge. you get an object but you it's not easy to introspect those and see what's there in the same way you can with a json object. this comes as a suprise and was not something we taught.


On the other hand, time spent looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that is necessary and that students will otherwise not be prepared for. We were undecided as a group.
wikipedia:


=== Afternoon sessions ===
- we focused on building a version of categories and catfishing


In our session, more than 60% of students were interested in learning Twitter and that track was heavily attended.
both worked very well. reviews were generally super posiitive


In Twitter, discoverability of the structure of [http://www.tweepy.org/ Tweepy] objects was a challenge. Users would create an object but you it was not easy to introspect those objects and see what is there in the way we had discussed with JSON objects. This came a surprise to us and required some real-time consultation with the [http://tweepy.readthedocs.org/en/v2.3.0/ Tweepy module documentation].
** ideas


The Wikipedia session ended up spending very little time working with the example code we had prepared. Instead, we worked directly from examples in the morning and wrote code almost entire from scratch while looking directly at the output from the API.
- other APIS, maybe ones without existing modules
- mabye work on toehr apis in small groups


Our session focused on building a version of the [http://kevan.org/catfishing.php game Catfishing]. Essentially, we set out to write a program that would get a list of categories for a set of articles, randomly select one of those articlse, and then show categories associated with that article back to the user to have them "guess" the article. We modified the program to not include obvious giveaways (e.g., to remove categories that include the answer itself as a substring).
== session 3 ==


Both sessions worked well and received positive feedback.
we covered basic data manipulation stuff


In future session, we might like to focus on other APIs including, perhaps, APIs that do not include modules. This would provide a stronger non-pedagogical reason to focus on reading and learning JSON. Working with simple APIs might have been a good example of something we could do as a small group exercise between parts of the lecture.
** afternoon sessions


== Session 3: Data Analysis and Visualization ==
matplot lib was super tough to install.


The goal of this session was to get users to the point where they could take data from a web API and ask and answer basic data science questions by using Python to manipulating data and by creating simple visualizations.
- maybe anaconda would help?
- heatmaps were a hit and something is hard to do in other software but ath worksed out well
- focus more on stuff that folks can't do with their spreadhseet


Our philosophy in Session 3 was to teach users to get data into tools they already know and use. We thought this would be a better use of their time and help make users independent earlier.
seeing the harry potter graph was a complete hit


Based on feedback from the application, we know that almost every user who attended our sessions had at least basic experience with spreadsheets and using spreadsheets to create simple charts. We tried to help users process data using Python into formats that they could load them up in existing tools like ''LibreOffice'', ''Microsoft Excel'', or ''Google Docs''.
== final thoughts ==


=== Lecture ===
we want to focus on getting people more toward independence.


Because much of our analysis was going to take place outside of Python, the lecture focused on review and on new concept for data manipulation. The lecture began with a detailed walk-through of code [[User:Mako|Mako]] wrote to build a dataset of metadata for all revisions to articles about [https://en.wikipedia.org/wiki/Harry_Potter Harry Potter] on English Wikipedia.
people didn't quite make it all the way


After this review, we focused on counting, binning, and grouping data in order to ask and answer simple questions like:
our final session seemed to let out a little bit on a low point int he class


* What proportion of edits to ''Harry Potter'' articles are minor?
one suggestion is to have a final session with no lecture or curriculum. people can come and mentors will be with them to work o n projects.
* What proportion of edits to ''Harry Potter'' articles are made by "anonymous" contributors?
* What are the most edited ''Harry Potter'' articles?
* Who are the most active editors on ''Harry Potter'' articles?


Becuse it did not require installation of software and because it ran on every platform, we did sorting and visualization in [http://docs.google.com Google Docs].
of course, we want everybody to come so we shoudl have a set of "random" projects for folks that don't have them already


=== Projects ===
* logistic observations
** budget


In the afternoon projects, one group continued with work on the ''Harry Potter'' dataset from English Wikipedia. In this case, the group on building a time series dataset. We were able to bin edits by day and to graph the time series of edits to English Wikipedia over time. Users could easily see the release of the ''Harry Potter'' books and movies from the time series and this was a major ''ahah'' moment for many of the participants.
For lunch we spent between $400 (pizza), $360 (less pizza), and $600
(for fancy Indian at the last one). This was for 50 students and 18
mentors but we assumed about 60 people would actually be there. We
also spent $50 in the mornings for coffee.


A second project focused on [http://matplotlib.org/ Matplotlib] and generated heatmaps of contributions to articles about men and women in Wikipedia based on time in Wikipedia's lifetime and time of the subjects lifetime. The heatmaps were popular with participants and were something that could not be easily done with spreadsheets.
Most mentors could not make the after-session so we spent about $100
per session on mentor dinners. If more people showed up, it would have
been closer to $200-250 per mentor dinner.


[[File:Matplotlib-hist2d.png|400px]]
The rooms were free.


The challenge with ''matplotlib'' was mostly around installation which took an enormous amount of time when several learners ran into trouble. In the future, we will use [https://store.continuum.io/cshop/anaconda/ Anaconda] which we hope will address these issues because ''Anaconda'' includes ''Matplotlib''.
If you had a toal budget would be in the order of $2000-2500, I think
you could easily do a similar 3.5 day-long sessions.


== General Feedback ==


One important goal was help get learners as close to independence as possible. We felt that most learners did not make it all the way. In a sense, our our final session seemed to let out a little bit on a low point in the class: Many users had learned enough that they were able to start venturing out on their own but not enough that they were not struggling enormously in the process.
Things we sould do differently


One suggestion to try to address this is to add an additional half-day session with no lecture or planned projects. Learners could come and mentors will be with them to work on ''their'' projects. Of course, we want everybody to be able to come so we should also create a set of "random" projects for folks that don't have projects yet.


* The spacing between sessions too large. In part, this was due to the fact that we were creating curriculum as we went. Next time, we will try to do the sessions every other week (e.g., 4 sessions in 5 weeks).
spacing between sessions too much
* The breaks for lunch were a bit too long. We took 1 hour-long breaks but 45 minutes would have been enough. Learners were interested in getting back to work!
- every other week?
* The general structure of the entire curriculum was not as clear as it might have been which led to some confusion. This was, at least in part, because the details of what we would teach in the later sessions were not decided when we began. In the future, we should present the entire session plan clearly up front.
* We did not have enough mentors with experience using Python in Windows. We had many skilled GNU/Linux users and ''zero'' students running GNU/Linux. Most of the mentors used Mac OSX and most of the learners ran Windows.
* Although we did not use it as a recruitment or selection criteria, a majority of the participants in the session were women. Although we had a mix of men and women mentors, the fact that most of our mentors were male and most of our learners were female was something we would have liked to avoid. If we expect to have a similar ratio in the future, we should try to recruit female mentors and, in particular, to attract women to lead the afternoon sessions (all of the afternoon session lead mentors were male).
* The SWC-style sticky notes worked extremely well but were used less, and seemed to have less value, as we progressed.


In the future We might also want to spend time devoting more time explicitly to teaching:
breaks for lunch were a bit too long. 45 minutes shoudl be enough. folks were interested in getting back in action. food was simple and always there on time so we could have jsut run with this


* Debugging code
the general structure of the entire thing was not as clear as it might be or could be. this was at least in part because the details of what we would teach int he later sesions were not done when we started
* Finding and reading documentation
* Troubleshooting and looking at StackExchange for answers to programming questions


=== Budget ===
maybe include some spot where we can talk for 10-15 minutes bout how to use documentation


For lunch we spent between $400 (pizza), $360 (a few less pizzas), and $600 (for fancy Indian food). This was for 50 students and ~15 mentors but we assumed about 60 people would actually be there at each session. We also spent ~$50 in the mornings for coffee.
more windows experienced mentors


Most mentors could not make the follow-up sessions so we spent about $100 per session on mentor dinners. If more people showed up, it would have been closer to $200-250 per mentor dinner.
challenge going to the right directory . understanding about the path and the idea that files/datasets need to be local to the place the script is run. that was unclear


All of our food was generously supported by the [http://escience.washington.edu/ eScience Institute at UW]. The rooms were free because they were provided by [http://www.com.washington.edu UW Department of Communication]
sticky notes worked super well but sort of gained less value as we went along


If you had a total budget would be in the order of $2000-2500, I think you could easily do a similar 3.5 day-long set of workshops. If we had a little more, we could do better than pizza for lunch.
things to teach:


<!-- LocalWords: CDSW BPW JSON
- debugging
-->
- reading documentation
- troubleshooting and looking at stackexchange

Latest revision as of 22:12, 15 March 2015

Page Moved
All material related to the Community Data Science Workshops have been moved from the OpenHatch wiki to a new dedicated wiki and this page is no longer being updated here. Please visit the new version of the page on the Community Data Science Collective wiki.

Over three weekends in Spring 2014, a group of volunteers organized the Community Data Science Workshops (Spring 2014) (CDSW) — the first series of four sessions designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. This version of the CDSW were held between April 4th and May 31st in 2014 at the University of Washington in Seattle.

This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!

In general, the mentors and students suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which are detailed below.

If you have any questions or issues, you can contact Benjamin Mako Hill directly or can email the whole group of mentors at cdsw-sp2014-mentors@uw.edu.

Structure

The Community Data Science Workshops (Spring 2014) consisted of four sessions:

Our organization and the curriculum for Sessions 0 and 1 were borrowed from the Boston Python Workshop (BPW): Session 0 was a three hour evening session to install software. The other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:

  • Morning, 10am-noon: A 2 hour lecture
  • Lunch, noon-1pm
  • Afternoon, 1pm-3:30pm: Practice working on projects in 3 breakout sessions
  • Wrap-up, 3:30pm-4pm: Wrap-up, next steps, and upcoming opportunities

We had 12 mentors volunteer initially although more joined as the event progressed.

We had about 150 participants apply to attend the sessions. We selected on programming skill (to ensure that all attendees were complete beginners), enthusiasm, and randomly to maintain a learner to mentor ratio of between 4 and 5. We admitted just over 50 participants.

Our feeling was that nearly every student who came to the first week (Sessions 0 and 1) came to Session 2. Retention between the second two sessions was much worse with perhaps only 60% of the full group returning for Session 3. We attribute this rentention to poor timing (the weekend before finals at UW which affected many students) and to the long space between the sessions.

We collected detailed feedback from users at three points using the following Google forms (these are copies):

We used this feedback to both evaluate what worked well and what did not and to get a sense of what students wanted to learn in the next session and which afternoon sessions they might find interesting. We did not collect feedback after the final session but we should have.

Morning Lectures

Benjamin Mako Hill gave all three of the two-hour morning lectures. All of the lectures involved the teacher working through material in the interactive Python interpreter shown on a projector with students following along in a Python interpretor on their own computers. In general, the lectures were rated well received by students.

Concern with the lectures include feedback that:

  • Two hours of straight lecture of difficult material was long and difficult to sit through.
  • If students became lost, it could be very hard to catch up given how the interactive Python session tended to build on earlier steps and assume the presense of variables or particular states.
  • There were often more mentors than really needed in the morning sessions meaning that many mentors were often idle.
  • As the lectures progressed and the work and tasks became more complex, working in the interactive interpreter become increasingly difficult — particularly for long for loops or deeply nested blocks of codes.

To address these concerns, we are planning the following changes to how we run these sessions in the future:

  • Breaking up the lecture into at least two parts. Between those parts, we will try include a small (~10-15 minute long) exercise. This will both break things up, allow mentors to be of more help during the sssions, and give students who fell behind a chance to catch up. It will also allow students to grab coffee or use the bathroom if they need to.
  • Record the lectures so that students can catch up after the sessions. We did not do this but should have.
  • Arrange for some mentors to arrive after noon if they would prefer.
  • Upload not only the outline, but examples of all of the code we'll use as part of the lectures.
  • Switch into writing code in separate files and running those files much earlier — perhaps as soon as we hit more than 2-3 lines in a for loop in Session 1.

Projects

In the afternoons, we broke into small groups to work on "projects". In each afternoon we tried to have three afternoon project tracks: Two projects on different substantive topics for learners with different interests and a third project that was much more self-directed.

In Sessions 1 and 2, the self-directed projects were based on working through examples from Code Academy that we had put from material already online on the website. In the self-directed track, students could work at their own pace with mentors on hand to work with them when they became stuck.

In Session 3, we did not use Code Academy but instead devoted the self-directed room to students working with mentors on data science projects of their choice. Because of issues with the student to mentor ratio, we asked that students only participate in the self-directed track if they felt confident they could be self-sufficient working on their own 70-80% of the time.

In all other tracks, student would download a prepared example in the form a of a zip file or tar.gz file. In each case, these projects would include:

  • All of the libraries necessary to run the examples (e.g., Tweepy for the Session 2 Twitter track).
  • All of the data necessary to run the example programs (e.g., a full English word list for the Wordplay example).
  • Any other necessary code or libraries we had written for the example.
  • A series of small numbered example programs (~5-10 examples). Each example program attempts to be sparse, well documented, and not more than 10-15 lines of Python code. Each program tried both to do something concrete but also provide an example for learners to modify. Althought it was not always possible, the example programs tried to only used Python concepts we had covered in class.

On average, the non-self-directed afternoon tracks constituted of about 30% impromptu lecture where a designated lead mentor would walk through one or more of the examples explaining the code and concepts in detail and answerinig questions.

Afterwards, the lead mentor would then present a list of increasingly difficult challenges which would be listed for the entire group to work on sequentially. These were usually written on a whiteboard or projected and were often added to dynamically based on student feedback and interest.

Learners would work on these challenges at their own pace working with mentors for help. If the group was stuck on a concept or tool, the lead mentor would bring the group back together to walk through the concept using the project in the full group.

In cases, more advanced students could "jump ahead" and begin working on their own challenges or changing the code to work in different ways. This was welcome and encouraged.

In all cases, we gave students red sticky notes they could use to signal that they needed help (a tool borrowed from SWC).

Session 0: Python Setup

The goal of this session was to get users setup with Python and starting to learn some of the basics. The setup curriculum was adpated from BPW. We ran into the following challanges:

  • Users on Windows struggled to get Python setup and added to their path.
  • Users had different (and often older) version of Python which became a bigger issue when we began using web libraries.
  • Mac users struggled with — and generally did not like the Smultron text editor that we recommended.

We proposed the following changes:

  • Use Anaconda for getting Python installed (following SWC)
  • Use a different text editor for MacOS. Text Wrangler was suggested.
  • In browser Python (e.g., http://repl.it) is intriguing but perhaps not either ready enough or "real" enough.
  • Emphasize more strongly that Windows users need to come to Session 0 to se up
  • Change the Code Academy lessons to remove and change the HTML example. Users that knew HTML already were often confused because printing "<b>foo</b>" did not result in actually bolded text. This was just the wrong choice for a simple string concatenation example.
  • Add some text to emphasize the difference between the Python shell and the system shell. Students were confused about this through the very end.
  • Add a new check off step that includes the following: create a file, save it, run it.

Session 1: Introduction to Python

The goal of this session was to teach the basic of programming in Python. The curriculum for BPW has been used many times and is well tested. Unsurprisingly, it worked well for us as well.

That said, there several things we will change when we teach the material again:

  • If possible, we would have liked to do introductions (i.e., simple "your name and where you are from and what you want to do up") which would have been useful up front — even in a big group. This seems more important in a multi-day event and would have been useful for the mentors.
  • The BPW projects were not focused on data and were more like classic computer science class projects. In the future, we would like to choose some examples that are little more data focused.

Afternoon sessions

In terms of the afternoon sessions, we felt that the ColorWall example was way too complicated. It introduced many features and concepts that nobody had seen and many users were flustered.

The Wordplay project was much better in this regard. In particular, we liked that Wordplay was broken up into a series of small example projects that each did one small thing. This provided us with an opportunity to walk through the example and then pose challenges to students to make changes to the code.

In the future, we will replace ColorWall with another more data-focused example. Our current thought is to build a little example involves interating through a pre-parsed version of the complete works of Shakespeare.

Session 2: Learning APIs

The goal of this session was to describe what web APIs were, how they worked (making HTTP requests and receiving data back), how to understand JSON Data, and how to use common web APIs from Wikipedia and Twitter.

Mentors and students felt that this session was the most successful and effective session.

Morning lecture

The morning lecture was well received — if delivered too quickly. Unsurprisingly, the example of PlaceKitten as an API was an enormous hit: informative and cute.

Defining APIs was difficult. First, general ambiguity around the use of the term and the difference between APIs in general and web APIs should be foregrounded. Learners frequently wanted to ask questions like, "Where in this Python program is the API?" It was difficult for some to grasp that the API is the protocol that describes what a client can ask for and what they can expect to receive back. Preparing a concise answer to this question ahead of time would have been worthwhile. We spent too much time on this in the session.

Although there was some debate among the mentors, if there is one thing we might remove from curriculum for a future session, it would probably be JSON. The reason it seemed less useful is the APIs that most learners plan to use (e.g., Twitter and Wikipedia) already have Python interfaces in the form of modules. In this sense, spending 30 minutes of a lecture to learn how to parse JSON objects seems like a poor use of time.

On the other hand, time spent looking at JSON objects provides practicing think about more complex data structures (e.g., nested lists and dictionaries) which is something that is necessary and that students will otherwise not be prepared for. We were undecided as a group.

Afternoon sessions

In our session, more than 60% of students were interested in learning Twitter and that track was heavily attended.

In Twitter, discoverability of the structure of Tweepy objects was a challenge. Users would create an object but you it was not easy to introspect those objects and see what is there in the way we had discussed with JSON objects. This came a surprise to us and required some real-time consultation with the Tweepy module documentation.

The Wikipedia session ended up spending very little time working with the example code we had prepared. Instead, we worked directly from examples in the morning and wrote code almost entire from scratch while looking directly at the output from the API.

Our session focused on building a version of the game Catfishing. Essentially, we set out to write a program that would get a list of categories for a set of articles, randomly select one of those articlse, and then show categories associated with that article back to the user to have them "guess" the article. We modified the program to not include obvious giveaways (e.g., to remove categories that include the answer itself as a substring).

Both sessions worked well and received positive feedback.

In future session, we might like to focus on other APIs including, perhaps, APIs that do not include modules. This would provide a stronger non-pedagogical reason to focus on reading and learning JSON. Working with simple APIs might have been a good example of something we could do as a small group exercise between parts of the lecture.

Session 3: Data Analysis and Visualization

The goal of this session was to get users to the point where they could take data from a web API and ask and answer basic data science questions by using Python to manipulating data and by creating simple visualizations.

Our philosophy in Session 3 was to teach users to get data into tools they already know and use. We thought this would be a better use of their time and help make users independent earlier.

Based on feedback from the application, we know that almost every user who attended our sessions had at least basic experience with spreadsheets and using spreadsheets to create simple charts. We tried to help users process data using Python into formats that they could load them up in existing tools like LibreOffice, Microsoft Excel, or Google Docs.

Lecture

Because much of our analysis was going to take place outside of Python, the lecture focused on review and on new concept for data manipulation. The lecture began with a detailed walk-through of code Mako wrote to build a dataset of metadata for all revisions to articles about Harry Potter on English Wikipedia.

After this review, we focused on counting, binning, and grouping data in order to ask and answer simple questions like:

  • What proportion of edits to Harry Potter articles are minor?
  • What proportion of edits to Harry Potter articles are made by "anonymous" contributors?
  • What are the most edited Harry Potter articles?
  • Who are the most active editors on Harry Potter articles?

Becuse it did not require installation of software and because it ran on every platform, we did sorting and visualization in Google Docs.

Projects

In the afternoon projects, one group continued with work on the Harry Potter dataset from English Wikipedia. In this case, the group on building a time series dataset. We were able to bin edits by day and to graph the time series of edits to English Wikipedia over time. Users could easily see the release of the Harry Potter books and movies from the time series and this was a major ahah moment for many of the participants.

A second project focused on Matplotlib and generated heatmaps of contributions to articles about men and women in Wikipedia based on time in Wikipedia's lifetime and time of the subjects lifetime. The heatmaps were popular with participants and were something that could not be easily done with spreadsheets.

The challenge with matplotlib was mostly around installation which took an enormous amount of time when several learners ran into trouble. In the future, we will use Anaconda which we hope will address these issues because Anaconda includes Matplotlib.

General Feedback

One important goal was help get learners as close to independence as possible. We felt that most learners did not make it all the way. In a sense, our our final session seemed to let out a little bit on a low point in the class: Many users had learned enough that they were able to start venturing out on their own but not enough that they were not struggling enormously in the process.

One suggestion to try to address this is to add an additional half-day session with no lecture or planned projects. Learners could come and mentors will be with them to work on their projects. Of course, we want everybody to be able to come so we should also create a set of "random" projects for folks that don't have projects yet.

  • The spacing between sessions too large. In part, this was due to the fact that we were creating curriculum as we went. Next time, we will try to do the sessions every other week (e.g., 4 sessions in 5 weeks).
  • The breaks for lunch were a bit too long. We took 1 hour-long breaks but 45 minutes would have been enough. Learners were interested in getting back to work!
  • The general structure of the entire curriculum was not as clear as it might have been which led to some confusion. This was, at least in part, because the details of what we would teach in the later sessions were not decided when we began. In the future, we should present the entire session plan clearly up front.
  • We did not have enough mentors with experience using Python in Windows. We had many skilled GNU/Linux users and zero students running GNU/Linux. Most of the mentors used Mac OSX and most of the learners ran Windows.
  • Although we did not use it as a recruitment or selection criteria, a majority of the participants in the session were women. Although we had a mix of men and women mentors, the fact that most of our mentors were male and most of our learners were female was something we would have liked to avoid. If we expect to have a similar ratio in the future, we should try to recruit female mentors and, in particular, to attract women to lead the afternoon sessions (all of the afternoon session lead mentors were male).
  • The SWC-style sticky notes worked extremely well but were used less, and seemed to have less value, as we progressed.

In the future We might also want to spend time devoting more time explicitly to teaching:

  • Debugging code
  • Finding and reading documentation
  • Troubleshooting and looking at StackExchange for answers to programming questions

Budget

For lunch we spent between $400 (pizza), $360 (a few less pizzas), and $600 (for fancy Indian food). This was for 50 students and ~15 mentors but we assumed about 60 people would actually be there at each session. We also spent ~$50 in the mornings for coffee.

Most mentors could not make the follow-up sessions so we spent about $100 per session on mentor dinners. If more people showed up, it would have been closer to $200-250 per mentor dinner.

All of our food was generously supported by the eScience Institute at UW. The rooms were free because they were provided by UW Department of Communication

If you had a total budget would be in the order of $2000-2500, I think you could easily do a similar 3.5 day-long set of workshops. If we had a little more, we could do better than pizza for lunch.