Supervised Learning in a nutshell

I recently created a little notebook that describes popular supervised learning algorithms. It can be used as a little cheat sheet when it comes to remembering what these algorithms do. I embedded the notebook here. If you want to fullscreen version then head over to GitHub and open the Gist.

Continue reading “Supervised Learning in a nutshell”

Advertisements

How I created a SNL dataset with Scrapy

How I created a SNL dataset with Scrapy

Not long ago Kaggle got the new dataset feature. Every member of the community can now upload their own datasets for others to play with. This is a very cool thing and there are lots of interesting datasets out there. You can also use Kaggle to promote your dataset. I was thinking about a dataset that I could provide and when I was reading through the LiveFromNewYork subreddit I got the idea: what about a Saturday Night Live dataset? I searched around the web and found the website snlarchives.net which has a very comprehensive database. I contacted the creator but got no answer. But I didn’t want to stop my project before it really began so I decided to try to scrape the data from the website. This blog post shows you how I did that and what we can learn from over 40 seasons of hilarious data.

Continue reading “How I created a SNL dataset with Scrapy”

Goblins, Ghosts and Ghouls

Goblins, Ghosts and Ghouls

Yesterday a very fun competition over at kaggle.com finished: Goblins, Ghosts and Ghouls was this years halloween competition. It was a competition targeted at beginners and therefor right up my alley. The task was to classify three types of monsters: goblins, ghosts and ghouls. In this blog post I will talk about how I went about predicting the type of monsters.

Continue reading “Goblins, Ghosts and Ghouls”

Look into your ML black box with LIME

Look into your ML black box with LIME

While listening to the data sceptic podcast today I learned about a new python framework called LIME (local interpretable model-agnostic explanations). It tackles a common problem in the field of machine learning: the black box.
One thing that always raises some eyebrows when I talk to people without a data science background is the fact that it is often unexplainable how an algorithm arrives at its output.

Why would we invest in/work with/trust in machines we do not understand?

The developer of LIME actually has some great examples where even the data scientist can profit immensly from LIME. His main argument is that without looking into the black box it is impossible to use our own powerful tool: human intuition. For example, if one classifies newsgroup posts and LIME tells you that it classifies by looking at stop words you instantly know something is wrong and you can go back to tweaking your model.

Another more frightening example is a model that returns a probability that indicates if a patient is going to survive if he doesn’t get treated immediately. Image that this algorithm decides on the probability by looking at the name (or some other irrelevant feature).

LIME aims to tackle this problem and is already available at GitHub. I also want to recommend the data sceptic episode “Trusted machine learning models with LIME“. Read more about LIME here.