Community Data Science Workshops (Fall 2014)/Reflections


 * If you're interested in putting on your own CDSW, you should also see our reflections from Spring 2014.

Over three weekends in Fall 2014, a group of volunteers organized the Community Data Science Workshops (Fall 2014) the latest in a series of four sessions workshops designed to introduce some of the basic tools of programming and analysis of data from online communities to absolute beginners. The Fall 2014 events were held between November 7th and 22nd in 2014 at the University of Washington in Seattle.

This page hosts reflections on organization and curriculum and is written for anybody interested in organizing their own CDSW — including the authors!

In general, the mentors and students suggested that the workshops were a huge success. Students suggested that learned an enormous amount and benefited enormously. Mentors were also generally very excited about running similar projects in the future. That said, we all felt there were many ways to improve on the sessions which we have detailed below.

If you have any questions or issues, you can contact Benjamin Mako Hill directly or can email the whole group of mentors at cdsw-au2014-mentors@uw.edu.

Structure
The Community Data Science Workshops (Fall 2014) consisted of four sessions:


 * Session 0 (Friday November 7th): Setup and Programming Practice
 * Session 1 (Saturday November 8th): Introduction to Python
 * Session 2 (Saturday November 15th): Building data sets using web APIs
 * Session 3 (Saturday November 22nd):  Data analysis and visualization

Our organization and the curriculum for Sessions 0 and 1 were originally borrowed from the Boston Python Workshop (BPW) although our curriculum has diverged quite a bit as we've improved it and tailored it to the specific learning goals in our sessions.

Session 0 was a three hour evening session to install software. All three of the other sessions were all day-long session (10am to 4pm) sessions broken up into the following schedule:


 * Morning, 10am-12:20: A 2 hour lecture
 * Lunch, 12:20-1pm
 * Afternoon, 1pm-3:30pm: Practice working on projects in 3 breakout sessions
 * Wrap-up, 3:30pm-4pm: Wrap-up, next steps, and upcoming opportunities

We collected detailed feedback from users at three points using the following Google forms (these are copies):


 * Application to the workshop
 * After Session 1
 * After Session 2
 * After Session 3
 * After Session 3 (Unretained) — Unsurprisingly, perhaps, not a single person filled this out so we will not bother with this in the future.

We used this feedback to both evaluate what worked well and what did not and to get a sense of what students wanted to learn in the next session and which afternoon sessions they might find interesting.

Participants
We had 30 mentors who attended at least one of the sessions and at least 20 mentors at each sessions. Many of our mentors were UW students in more technical departments like Computer Science and Engineering and Human Centered Design & Engineering. Perhaps half of them worked outside of the university as software developers.

We had about 150 participants apply to attend the sessions. We selected on programming skill (to ensure that all attendees were complete beginners), enthusiasm, and randomly to maintain a learner to mentor ratio of between 4 and 5. We admitted 80 participants. 58 listed a UW affliations. Affiliations listed by at least three people include the following:

We had two people each who listed their affiliations as Bio- and Health Informatics, the Foster School of Management, Microsoft, and Wikipedia.

We also had people from Pyschology, the City of Seattle, the Low Income Housing Project, Seattle Meshnet, Biochemical Engineering, Bio Physical, Chemical Engineering, Game Studies, Linguistic, College of the Environment, Oceanography, the School and Public Health, UW Bothell, Central Washington University, and many people who did not specify an affiliation.

Retention between session and 0 and 1 was nearly 100%. Retention between sessions 1 and 2 and sessions 2 and 3 was roughly 75% and leaving us with perhaps 55-60% retention at the end of session 3.

Anecdotally, there is a sense that those who are dropping are those who had more trouble but didn’t struggle visibly.

Although our participant pool in CDSW (Spring 2014) was overwhelming female, there was close to gender balance in both students and mentors. Roughly 2/3 of mentees were from UW and this included students from random places including someone who works for the city of Seattle. Many random Wikipedians were there. We continue to think that it's cool that people who are not doing research but are part of online communities were in the mix with the researchers.

Once again, quite a large number of people applied were already skilled programmers. We're still not exactly sure why these people are applying because we think that the fact that the workshops are for absolute beginners is very clear. Perhaps people just want more exposure to data science?

Once again, the constraint on scaling the workshop is the number of mentors. Every mentor means that the workshop can accommodate four more mentees.

One suggestion was allowing mentees with have some programming skills — especially for the second and third workshops (given predictable rates of retention). There was not consensus among the organizers and mentors on this approach and preferred getting more newbies and invest more in them?

Morning Lectures
Mako gave lectures in Session 1 and 3. Frances Hocutt gave lecture 2 and, generally, this was seen as an important success. An important future goal is getting other people to more of the lectures. Tommy is an obvious choice to do one next time Different faces, perspective, and backgrounds are useful to communicate the breadth of interest here. Mako does not want to be the only one giving these lectures.

Our biggest challenge with growing the workshops was with physical space for the lectures. Basically, rooms tha can hold more than 100 people at UW are almost exclusively lectures halls that make it almost impossible for mentors to reach students.

We reserved a lecture hall that fit 200 people and filled it with 100 students in alternating rows to make it at least possible to reach each person. Projects are done in breakout sessions which can be split.

People continue to want a record of lectures. At the very minimum, we should make sure that we turn on console logging so that we can post this after the lectures.

Afternoon Sessions
Mining research interests/goals. Could we help match up people with similar interests?

next time maybe mine the registration for a list of research questions

How can we support self-directed projects?

Can we give mentees more guidance to support their project interests? It’s easier to do that if people are pre-clustered.

Bring up people’s ideas at the end.

The size of the breakout workshops varied and that means different degrees of engagement were feasible.

The BIG feedback from the first series of workshops: Bring people back together more often. Bringing people together in the end was effective this time. We need a go between for each session to remind people to reconvene. An emcee.

- post examples of code used in teh lectures

showcase what students ahve accomplished and places people can change things and do things differently

e.g., the fergeson thing with the exmaple from ha=rry party

- public healtha nd epi data session

We would love to create a session on basic statistical analysis in Python and at least ten mentees would have been enthusiastic to take it.

show and tell at the end was very effectively

we need a designated mc who can go =between rooms

ideomatic ptyhon

talk to chris to try to fix those things

Session 0: Python Setup
The goal of this session was to get users setup with Python and starting to learn some of the basics. We changed the curriculum enormously to use Continuum's Anaconda instead of Python directly from python.org. The result was staggering. Not a single person reported "many problems with set-up" (i.e., respondants reported either "no problems" or a "few problems.")

Anaconda was key to smoothyness compared to the first workshop series and addressed most of our setup and path issues. That said, we had several major concerns:


 * Anaconda is not free software or open source
 * Anaconda does not support Python 3 which we'd like to move to
 * One studdent had a home directory in Chinese which caused the Anaconda installation to fail at a very late stage. This was eventually fixed by a mentor who changed the path.

Additionally, we moved the Windows curriculum from away from  to using Powershell. This was a huge benefit because it meant that  works and the rest of the curriculum can converge. The only concerns were:


 * Powershell is not installed on Windows XP although not a single student had Windows XP

Changes for next time include:

When mentors can circulate easily things are better for mentees.
 * Because it was less successful, we can deemphasize recruiting mentors to the Friday night session.
 * Because Powershell was successful, we're going to try to create a single consolidated set of installation instructions for Windows, Mac OSX, and Linux!
 * We will make it clear to mentors whether participants should self-report they’d completed the steps or whether the mentor should verify that the steps were all taken. In future, email mentors ahead of time to let them know.
 * We need to do a better job of modelling stticky notes during lectures early on.
 * The sticky notes we bought were small and ambiguous color. We should get bright red sticky notes next time.
 * Set up/arrange/select the space to facilitate better circulation of mentors.
 * We are going to try writing installation instructions that do not rely on Anaconda so people have a fully open source option.
 * Once again, not a single person outside of mentors ran GNU/Linux. We should strongly consider how much effort we want to put into maintaining this part of the curriculum.
 * We should move to Python 3 to try to address lingering unicode issues. We should try to do this for the next session.
 * Not everybody loves the checkout step. Maybe there's a way we can make it more fun?

We also had a bunch of general feedback on how we could improvement mentorship that is particular relevant to the earlier session

Session 1: Introduction to Python
The goal of this session was to teach the basic of programming in Python. The basic curriculum was originally built off the Boston Python Workshop curriculum has been used many times and is well tested. Unsurprisingly, it worked well for us as well. We made several major changes. The biggest is that we retained only the Wordplay project and we installed createa new project Baby Names that uses Social Security Administration data on the frequency of Baby Names.

Afternoon sessions
We felt that that the new Baby Names was excellent and feedback was overwhelming positive. Because it includes both lists of names and numbers, it can do everything that Wordplay can but it has a much stronger feel of science to it and a higher ceiling. Wordplay felt relatively boring.

Suggestions based on feedback include:


 * Do a better job of brining folks back to gether to walk through potential solutions to the questions posed in the project rooms.
 * Consider simply having two smaller rooms doing Baby Names and perhaps have one that emphasizes more numeric and math operations.
 * Prepare questions before hand, list them all up front, and let folks choose what to work on.

Session 2: Learning APIs
The goal of this session was to describe what web APIs were, how they worked (making HTTP requests and receiving data back), how to understand JSON Data, and how to use common web APIs from Wikipedia and Twitter.

Morning lecture
The morning lecture was given by Frances Hocutt and it was was well received — if delivered too slowly for a significant minority of attendees. Unsurprisingly, the example of PlaceKitten as an API was an enormous hit: informative and cute.

Frances used excellent slides which are shared on the wiki page and which we will reuse. About half found Frances’s lecture either too fast or too slow and about half found the lecture to be just right.

Since many people felt the lecture was on the slower side, we want to use this time to introduce function definition up front. Then, functions can be reinforced in the week 2 workshops.

Afternoon sessions
There were three parallel afternoon sessions on Twitter, Wikipedia API and SQL. We plan to do some version of all three sessions next round:

Twitter:


 * Once again, the session too many people and we should consider splitting it if we have mentors who are comfortable splitting it.
 * Next time, we should be careful to make sure that the advance notice asks everybody to download the project zip file ahead of time. If we're going to do this in class, we should set up a short URL of some sort to help streamline the process without heading to the wiki things.
 * A bunch of people found the Twitter session too fast.
 * TweePy is not well documented.

the opaqueness of tweepy was a problem.. option to creat ea version of tweppty that just gives you json

or miku or michael for details onhow to do that

dharma might be able to do this.

Wikipedia workshop:


 * The teacher explained things very clearly. That was frustrating for those who didn’t need it, but super great for people that wanted/needed a lot of explanation.
 * Graduated challenges in a workhshop that go from less challenging to more and more challenging helps with the fact there is a range of learning levels.

SQL workshop:


 * Generally was very successfuly Seemed to work really well and did a good job of giving people an overview of a data science and a way to hook themselves in to it.
 * Next session, also do a workshop that closes the loop between SQL and Python.
 * Can we host an open SQL database somewhere?

- maybe split this into two session next time

- merge in some more python this time


 * 1) 1 intro into sql


 * 1) 2 using pythong o tgra data and bring python and pandas

Session 3: Data Analysis and Visualization
The goal of the lecture was to walk people through the actual mess of making a code.

Afternoon sessions
Afternoon of Session 3:

The spreadsheets session. People were modifying the code to build their own dataset and did their own visualizations. At least a few people. That was cool!

The MatPlotLib session. Most people in the session were deeply lost. The mentors who taught it were not at any of the other sessions and therefore didn’t go in with a good sense of where the mentees were at. Several people left and went to other room. In future, ensure mentor success by having them loop in better to where the mentees are at. Consider next time, encouraging new mentors do a practice session with some friendly folks before they let loose. Also, next session, consider using SeaBorn instead of MatPlotLib.

matplot lib

- maybe replace it with seaborn? - tommy will teach it

General Feedback

 * Generally, there was a sense that we should stop creating pages in the wik by copying and pasting old stuff. We when archive the old version, we can use MediaWiki to create links to the old version of the pages (we can intstall templates from English Wikipedia) to make this easier.
 * We should try to schedule the workshop not as close to the end of the quarter. The beginning or middle of the quarter should be better for UW students.
 * Mentors should post the code generated in the break-outs. Encourage them to capture the code created in examples and to post these afterward systematically.
 * There was general interest in pair programming or more team based excercises.


 * There was a need for several on-the-fly corrections of the instructions and files on the wiki during the workshop. Better planning and testing for this will be very useful.

Mentorship
Last time through, most of our observation were focused on improving the experience of attendees and we think we didn't spend as much time on helping mentors have a great experience and helping them prepare effectively. We had a series of pieces of feedback on how to improve this.

We had many new mentors this round. One general concern was the relative lack of mentor training, especially before the first sessions. We felt that we should:


 * Arrange a mentors meeting (perhaps a day or two before to over material) and maybe at a bar or other social environment with beer and pizza. We coudl use this tnorms, best practices, goals, planning, etc.
 * Perhaps meet 15-20 minutes early to get to know each other and over things
 * Create some easier way to distinguish mentors from students (e.g., t-shirts, buttons, paper them head to foot in sticky notes).
 * Send out details instructions and emails to mentors, or create pages in this wiki with detail on how to do this better.
 * Talk to mentors about much should you help? (e.g., some but be careful not to just give away the answer, to focus too much on elegance or technical correctness and be careful not to overwhelm the learners).
 * Explicitly encourage mentors to reach out to students and ask them how things are going by walking around to every single person to ask, “How are you doing? What are you working on? Show me what you’re doing.”

More Projects or Better Projects
We had certain afternoon project sessions that were much more effective than others. One thing we were conflited about was whether we wanted more break-out sessions or whether we should just use best of the break-out sessions (perhaps in two rooms).

Arguments for smaller groups of the best break-out session include:


 * Focus on a known good thing.
 * Precanned sessions make it easier for new mentors to feel confident and be successful.

Arguments against include:


 * Diversity of projects inspires people to do the kinds of things that people can do with this new knowledge.

Otjher ways encourage generative-ness? might include giving mentees creative/flexible moments within sessions and lectures might be empowering. Perhaps, calling out mentees who are doing creative things?