Coursera is a pure play online education provider distributing classes in a wide variety of subjects, from The Music of the Beatles to Analyzing Global Trends for Business and Society. Many courses are offered in the native languages of those who developed them, such as Peking University’s Methodologies in Social Research, while others have been translated for more widespread use – see Yale’s Financial Markets instructed by Prof. Robert Shiller (I’m a fan).
Part of the Massive Open Online Course (or “MOOC”) site’s push is linking courses developed by accredited higher-learning institutions into specializations, series of classes designed to develop skills in a particular field. There are seven specializations as of this writing, and the one I dove into is called Data Science.
Made up of nine segments created by Johns Hopkins University’s Bloomberg School of Public Health – taught by Professor Brian Caffo and Assistant Professors Roger Peng and Jeff Leek – it’s a half-year commitment for Energizer bunnies with a math/programming bent, and probably twelve to eighteen months if right this moment you’re distracted by your Instagram feed. Ok, maybe twenty-four to thirty-six.
Yours truly took the fast track, doubling and tripling up on classes at the outset, leaving the purportedly hard stuff for the windup to winter solstice. What follows is a summary of each class, including comparison to what was “sold”, tips for getting the most from them i.e. scoring well and then some, as well as supplementary materials discovered that turned out worthwhile. It’s the truth within from a dedicated student’s point of view, and a long road. So feel free to skip to the conclusions; just don’t make fun of my grades.
1) The Data Scientist’s Toolbox (link)
Estimated Workload: 3-4 hours/week
Upon completion of this course you will be able to identify and classify data science problems. You will also have created your Github account, created your first repository, and pushed your first markdown file to your account.
If you know your way around a laptop and the internet (and I don’t mean Facebook), you can complete this class in half a quiet weekend day. It took me about six total hours to rip through, and that was while also taking R Programming. It may seem easy to signup for a cloud service and install some software, but in later courses the forums were littered with Github, R Studio, and other menial problems that were otherwise taught here. I am glad I paid attention, regardless of the difficulty level.
2) R Programming (link)
Estimated Workload: 3-5 hours/week
The course will cover the following material each week: i) Overview of R, R data types and objects, reading and writing data, ii) Control structures, functions, scoping rules, dates and times, iii) Loop functions, debugging tools, and iv) Simulation, code profiling
If you know most any modern programming language, you’ll both enjoy the concepts presented within and not have too hard a time. However, the instructors are already throwing tricky problems at you (read: adjust your thinking cap), so the true investment is more like 5-7 hours per week. Caveat: that’s coming from someone who previous did all their data analysis in either Excel or relational databases. Further, I used for-each loops instead of the ‘apply’ family of functions for my project work; that voodoo kinda spooked me, and old school got the job done without having to reinvent the mental wheel in haste. Finally, run with swirl, the embedded tutorial package the course relies on – it’s a winner for working through R programming concepts, and you might need the extra credit.
3) Getting and Cleaning Data (link)
Estimated Workload: 3-5 hours/week
Upon completion of this course you will be able to obtain data from a variety of sources. You will know the principles of tidy data and data sharing. Finally, you will understand and be able to apply the basic tools for data cleaning and manipulation.
This course felt harder that the sell, but I was also taking Exploratory Data Analysis and Reproducible Research along with it (btw, not the best idea). Fully wrapped around the R language, if you are a rookie you’ll be very glad you took R Programming beforehand as this segment dove into data manipulation with nary a spreadsheet crutch to lean on. There were few zingers in the quizzes, but the project rubrics were cupcakes sprinkled with obfuscation; patient due diligence was required to discern what what being asked. But once you did, it was pretty easy goings. This was also the point where I realized peers were grading the projects, and to keep presentations simple/easy to understand and follow.
All in, I figure I invested 8-10 hours a week on the class, but as a result have since dumped Excel for most raw data manipulation. If you do any amount of number tumbling in your day job, this course will really sell you on adopting R.
4) Exploratory Data Analysis (link)
Estimated Workload: 3-5 hours/week
After successfully completing this course you will be able to make visual representations of data using the base, lattice, and ggplot2 plotting systems in R, apply basic principles of data graphics to create rich analytic graphics from different types of datasets, construct exploratory summaries of data in support of a specific question, and create visualizations of multidimensional data using exploratory multivariate statistical techniques.
Graphs, graphs and more graphs, the upper range of the workload estimate is in step. Learned how to create lots of fancy graphics to illustrate analysis, most of the techniques which were forgotten soon thereafter. Lesson learned … hold onto that code! This student wound up using chunks for Reproducible Research (which I took simultaneously with this class) and even integrated some of it into a few work-related reports. R’s graphics functions are nifty, and certainly more flexible than the canned stuff you might be used to. Moderately worth the effort, and while the quizzes were cake I did find peers morphed exceptionally picky during project grading; this might be the only course where it is worth getting more creative with a project rather than less. In addition, you should be pretty handy with Github by this point; if not you’re in trouble.
5) Reproducible Research (link)
Estimated Workload: 3-5 hours/week
In this course you will learn to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, and organize a data analysis so that it is reproducible and accessible to others.
I am 100% sold on the concepts presented within; if you can’t hand me the dataset and your code, and I can’t easily replicate your study with it, I now consider your work not worth the paper (or PDF) it is printed on. Some of the real-life examples exposed, where renowned academics completely screwed the pooch on otherwise world-changing studies, put even more gasoline on the belief fire. I probably spent ten plus hours a week on this one, but was enthralled after lecture #1 and have since used R markdown and related technologies everywhere I can. The quizzes aren’t too bad, nor was the first project. The final project was not difficult, but it was time consuming, taking nearly two full days on it’s own even while struggling NOT to overdo it. Probably the most utilitarian course in the series.
6) Statistical Inference (link)
Estimated Workload: 3-5 hours/week
In this class students will learn the fundamentals of statistical inference. Students will receive a broad overview of the goals, assumptions and modes of performing statistical inference. Students will be able to perform inferential tasks in highly targeted settings and will be able to use the skills developed as a roadmap for more complex inferential challenges.
If you don’t have a solid background in linear algebra, are loath to watch mathematicians draw greek letters on chalkboards, and/or generally fall asleep anytime someone brings up the topic of particle physics, you might want to watch the lectures once … twice … at least three times over. Then watch them again. The only truly positive note here is the instructors warned ahead of time how difficult it is. The estimates are a joke; I spent more time “getting” this material than I did on the previous five classes. Combined. Further, I had to reference quite a few outside resources; the lectures were steeped in theory so far over my practical nature that I debated dropping the rest of the series most every time I looked at one.
Needed every attempt on quizzes, and the projects left little to be desired either. In particular, one project was entirely not what it seemed at first; the forums were quickly littered with unbridled speculation about requirements, and I took all postings with a grain of salt thereafter (more on that later). Got through it all after putting on blinders. Finally, I took this course in conjunction with Developing Data Products, which was also a mistake.
7) Regression Models (link)
Estimated Workload: 3-5 hours/week
In this course students will learn how to fit regression models, how to interpret coefficients, how to investigate residuals and variability. Students will further learn special cases of regression models including use of dummy variables and multivariable adjustment. Extensions to generalized linear models, especially considering Poisson and logistic regression will be reviewed.
Much like Statistical Inference, course developers flagged this one as pain-in-the-ass. But after the former, it’ll seem like welcome relief (for about ten lousy seconds). The instructor starts off by noting that the majority of students likely come from a computing background, then proceeds by tearing into alphas and sigmas and gammas without reproach. But at least there are frequent notices saying you can “skip this bit” if you’re not interested in proofs.
I immediately applied my rip through the material as fast as you can and get through the quizzes by hook or crook approach (noted below), then went searching for supplementary material that might provide some insight into completing the project. I found it; meanwhile said project had strict limits on length which precluded overthinking. Having the skills taught in Reproducible Research is a must here, as are general research, writing and referencing capabilities. If you’ve got those, can type
lm(blah ~ blah, data=blah) into R, remember what a confidence interval is, and actually try (versus cry), you’ll pass. Additionally, my external research into regression methodologies supplanted most everything taken in via the primary material, but it was worth the time invested. Had the course essentially completed just as Week 2 was ending, then fiddled with my project presentation until the peer evaluation window opened up.
8) Practical Machine Learning (link)
Estimated Workload: 4-6 hours/week
Upon completion of this course you will understand the components of a machine learning algorithm. You will also know how to apply multiple basic machine learning tools. You will also learn to apply these tools to build and evaluate predictors on real data.
This is the pièce de résistance of the Data Science Specialization, the chance to learn how to tell someone something they don’t already know using data and R. I drank like a fish worked like a dog, non-stop for nearly two weeks straight, just so I could carve out some alone time with Practical Machine Learning. Broad strokes: the title is fitting, with the instructor cooking useful code examples while the introduction lecture is still warm; the quizzes are NOT to be trifled with – you are expected to know how to make all those functions you’ve just been shown work – and the project was no cake walk either.
Like Statistical Inference and Regression Models it’s a lot to take in, but at least there were fewer than a handful of mathematical equations presented throughout. It isn’t just button pushing though; if you don’t know how to use Github, can’t code R at minimum like an enthusiastic rookie, don’t understand basic statistical concepts such as resampling and variance, or think knitting is something done with a needle and yarn, you’ll be lost in the blink of a keystroke. If, however, you make it through the lectures on random forest and gradient boosting without being completely befuddled, consider yourself a prize-winner; those methodologies are the laser beams of the supervised learning world. Disappointing there aren’t any sharks involved though.
9) Developing Data Products (link)
Estimated Workload: 3-5 hours/week
Students will learn how communicate using statistics and statistical products. Emphasis will be paid to communicating uncertainty in statistical results. Students will learn how to create simple Shiny web applications and R packages for their data products.
While this is officially the last class in the specialization, I took it earlier in conjunction with Statistical Inference. There is almost nothing related to math here i.e. it’s easy, particularly if you know anything about html, css, client and server-side coding, or just loathe Powerpoint. If you fit that bill, and have already taken Reproducible Research (which introduces you to markdown), you can knock off all the lectures and quizzes in a few sittings then complete the projects while sipping whiskey one evening during a cold snap. Like this guy did.
Tips for Surviving the Data Science Specialization
1) Notes: Download all the lecture notes at the beginning of the class – you’ll find them useful reference both for quizzes and for having example code handy when it’s project time.
2) Quizzes: Print out all the quizzes, review each, and take notes on them as you follow the lectures. Read the quiz questions very carefully, as many of them are designed to trick. And take advantage of all your quiz attempts – don’t leave an 80% score sitting on the first try when you have two more open windows available for perfection. Use your quiz printouts to mark right/wrong answers on attempts, then scour your lecture notes for information related to that question. When in doubt, make an educated guess!
3) Projects: Print out the project rubric(s) and grading criteria, early. Keep it handy throughout the lectures, and by the time the project comes around you should know exactly what to do. When working on the projects think elegant, not encyclopedic. I realize early on that because many projects would be peer graded, going overboard on them would serve to confuse and subsequently hurt scores. You will be better off with succinct, well-written explanations, proper spelling and ordered labeling than you will dumping a pile of code into a PDF then praying your fellow students can figure it all out. The projects are the culmination of all previous efforts, where you’ll realize you learned more than you thought you did.
4) Timing: When this student got to Statistical Inference, he realized he had his hands full, and that didn’t change for Regression Models or Practical Machine Learning either. So a new course strategy was employed, whereby I cranked through the lectures with vigor, knocking off all the quiz questions I could (easily, on paper) as quickly as possible; in most cases I wound up with between 60% to 80% of the quiz questions down stone cold. Then I ran back through the lectures, week by week, filling in the mental blanks and completing the quizzes. By late in week two I could concentrate on the project requirements. Reasoning: some of the course material in the last week was actually needed to produce a salient project, particularly when the rubric seemed vague. Don’t try whipping together a project at the last minute – you will wind up getting caught short.
5) Statistical Inference: By far the most difficult course in the series, it is really an entire semester of statistical theory crammed into four weeks. My suggestion … take notes, lots of notes. Write down every goddamn equation, figure, and truism that comes out of the lectures. I used sticky notes for this, and plastered them across the wall over my desk. Then I rearranged them according to subject matter. After that I sorted them again, on paper, according to what I perceived the instructor was really trying to convey, while scribbling additional thoughts on the back of each sheet. By the time the projects came around, I was thinking big picture instead of minutiae.
And finally …
6) The Forums: I am all for joint interaction and open communication, but sometimes the forums were just too much. I am not suggesting ignoring them altogether, instead offering up a pointer to be diligent regarding who and what you pay attention to within. Alarms bells should ring when any of the following occur: A) a TA states that one solution is determinately better than another, referencing only their own project results from last term; B) a select few participants hijack every forum thread with long-winded explanations to seemingly straightforward problems; C) someone starts a thread with a question, then two posts later is conveying their spontaneous expertise to anyone attempting to assist; and D) a student loudly criticizes course requirements, exclaiming changes should be made to suit their need, particularly when said need is based on their admittedly ignoring prerequisites and/or published instructions.
The forums would be much more useful if the staff was more aggressive with moderation, including quashing duplicate inquiries, designating posts resolved, and otherwise managing the venue versus just doing the Q&A dance.
Supplementary Online Material
This student made use of the following resources; the first was suggested by course developers at the outset, while the latter two I sought out on my own:
StackOverflow – a solid resource, but be forewarned that a lot of folks post complex inquiries there, so you often have to parse, test and tweak solutions to “dumb down” for your particular issue.
Rdocumentation – a comprehensive resource for R functionality, easier to read than R’s internal help files.
Solid Reference Material
The materials below are (legally) free, although they may require a little effort to find. I’m making it easy on you, and they are now all permanent residents of my digital library:
OpenIntro Statistics 2nd Edition; Diez, Barr, Cetinkaya-Rundel – https://www.openintro.org (PDF)
R function reference cards – “cheat sheets” from a variety of authors – let me Google that for you
And some more advanced stuff …
An Introduction to Statistical Learning, with Applications in R; James, Witten, Hastie, Tibshirani – http://www-bcf.usc.edu/~gareth/ISL/ (PDF)
Elements of Statistical Learning, 2nd Edition; Hastie, Tibshirani, Friedman – http://statweb.stanford.edu/~tibs/ElemStatLearn/ (PDF)
Mining Massive Datasets, 2nd Edition; Leskovec, Rajaraman, Ullman – http://www.mmds.org (PDF)
Expectations and Conclusions
The Data Science Specialization is not for the undisciplined, the intellectually lazy or those with an entitlement mentality. If you do not have significant evening time to spare or are not willing to forego spending 20 hours a day on “social media”, you should not bother; the classes will, in several cases, entail significantly more time than the course developers estimate. If you spent your formal education expecting your teachers to hand you answers – like they did in grade school because your mother was captain of the PTA – not only are these courses not for you but the subject matter probably isn’t either. Go take a class in The Social Structure of Pre-Cambrian Basket Weavers if you seek validation; data analysis, statistics, and machine learning are for those who not only enjoy getting their asses kicked but will beg for more after every beatdown.
Pay for the courses, particularly if that’s what it takes to keep you motivated. I went the free route, figuring a “verified certificate” from a non-accredited institution was no better than an unverified one, and certainly a pittance compared to the knowledge gained. If your end-game is pasting “Johns Hopkins University” in the education section of your LinkedIn profile, Idiocracy is a must watch; after that review the disclaimers, paying particular attention to the part about these courses being entirely invalid as degree requirements go. Not paying also meant forgoing the opportunity to participate in the capstone project, but after the sometimes hellacious course-load I was in no mood to do any free work. If you apply what you’ve learned along the way, you may very well feel the same.
Will you be a bonafide benjamin-printing mofo after the Data Science Specialization? I don’t think so. Consider it instead an introduction to data analysis and machine learning, more of a primer if hoping to get into the field. If you are already involved somehow, whether it be as a financial analyst or lab biologist, taking the series on has significant merits; it might even be worth a pay raise. However, if you are a CIO, CMO, or otherwise tasked with managing those that do the prescribed work, you should not only consider the education suitable for consumption but also required training. For your own self.
About midway through Reproducible Research I started using R for various work-related projects, including website traffic and online ad spending analyses, as well as sorting through the historical asset and liability positions of an irrevocable trust. Practice makes perfect, and money buys extravagant fly-fishing trips.
Intriguing subject matter, and I’m somewhat convinced the big data movement’s supposed failure to find real answers stems not from shortage of knowhow, but from a dearth of perspective. Too many heads twirling too much code over too much data, and not enough stepping back to take in the forest. I witnessed many course participants, seemingly bright and articulate, get completely wound up in “proper” R usage when perusing a two-paragraph dataset description, reviewing some original survey methodology, or even doing a web search on “the primary parts of an automobile” would have made their lives so much easier.
MG signing off (to close with scores from this insanity – click here)
Editor’s note: A very special thanks goes out to Johns Hopkins University and the brilliant geniuses within – Brian Caffo, Roger Peng, and Jeff Leek. The sheer volume of effort required to produce the course materials must have been astounding, let alone what it took to keep the quality top notch as it was. Additional gratuity goes to Coursera and their staff – nice work folks.
UPDATE (7/3/19): I’ve fielded questions over the years as to why I just don’t use Python for this data analysis stuff and/or review a course on matter. I do (albeit sparingly) and did (at least on how to get started with Python).