As a passionate fan of Borussia Dortmund I have a little bit of a weak spot for soccer data. I constantly find myself clicking through wikipedia or the german soccer data site transfermarkt. Last month a user on kaggle uploaded an unbelievable huge soccer database. You can find it here.
The kaggle community already jumped on this dataset and created some very interesting analyses. For example, are you interested in the different stats of the top 20 players? Or how the players moved between the leagues? You can also learn about the predictability of the different leagues.
While listening to the data sceptic podcast today I learned about a new python framework called LIME (local interpretable model-agnostic explanations). It tackles a common problem in the field of machine learning: the black box.
One thing that always raises some eyebrows when I talk to people without a data science background is the fact that it is often unexplainable how an algorithm arrives at its output.
Why would we invest in/work with/trust in machines we do not understand?
The developer of LIME actually has some great examples where even the data scientist can profit immensly from LIME. His main argument is that without looking into the black box it is impossible to use our own powerful tool: human intuition. For example, if one classifies newsgroup posts and LIME tells you that it classifies by looking at stop words you instantly know something is wrong and you can go back to tweaking your model.
Another more frightening example is a model that returns a probability that indicates if a patient is going to survive if he doesn’t get treated immediately. Image that this algorithm decides on the probability by looking at the name (or some other irrelevant feature).
LIME aims to tackle this problem and is already available at GitHub. I also want to recommend the data sceptic episode “Trusted machine learning models with LIME“. Read more about LIME here.