Not long ago Kaggle got the new dataset feature. Every member of the community can now upload their own datasets for others to play with. This is a very cool thing and there are lots of interesting datasets out there. You can also use Kaggle to promote your dataset. I was thinking about a dataset that I could provide and when I was reading through the LiveFromNewYork subreddit I got the idea: what about a Saturday Night Live dataset? I searched around the web and found the website snlarchives.net which has a very comprehensive database. I contacted the creator but got no answer. But I didn’t want to stop my project before it really began so I decided to try to scrape the data from the website. This blog post shows you how I did that and what we can learn from over 40 seasons of hilarious data.
Continue reading “How I created a SNL dataset with Scrapy”