Home    About    Archive    Feed

Introduction: Scraping a news website

##Introduction

The first project that I ever wanted to build when I was learning to program for mobile was a news app for my university paper Leeds Student. I got started, but I soon realised that an app like that requires more than a Objective-C know-how.

###Motivation I wanted to experiment with a new programming language to solve a problem I had. In this series, I will experiment with creating my own backend service that scrapes data from a news site using python, store the structured data in a mongoDB instance and serve the results from an API written on top of a python framework.

This tutorial is not just about scraping websites, its my journey into the world of big data, creating a news app is the most basic example I could think of, but the content collected can also be analysed for sentiments and trends and used by researchers and other PR/Media companies for a wide range of reasons. Is this all legal? Probably, but don’t take my word for it. Least we forget google is one big scrapper as well! It all depends on what you do with the information.

###Background I have an app out there that currently gets news articles from an RSS feed on Nation Media, the app is called Habari and its available on the app store . It is not a very well written piece of software. The backend, just like the app itself is not very well polished, I would like to think of it as a work-in-progress

This series will be divided into three parts. There’s no gurantee that I will finish writting all the parts, if anyone is reading and wants some help, I would be more than glad if you shoot me an email or find me on twitter

This series will include:

  1. Building the Scrapper and Crawler
  2. Saving our structured data using mongoDB
  3. Implementing the API using flask

Ultimately, this post’s raison d’être is to keep track of my learning activities. There will be plenty of bugs.

Welcome to my blog!

Welcome to my blog, here I will be writing whatever I wish, after all, it is my blog.

Posts will be about tech and programming, but on the off chance I am a bit moody, there could be insane rants ending in tears and regrets.

I want to keep an online journal of whatever I am learning in the hopes that this will provide me with the extra motivation to continue.

I am an iOS developer at RedAnt, before that I worked with some awesome people at The Other Media. I studied Electronic and Computer Engineering at the University of Leeds, although I enjoy learning on my own and mainly by breaking stuff.

I mess around in Python, Ruby on Rail and Android. I am currently deploying most of my stuff on Rackspace but I am slowly warming up to Amazon EC2 (this topic deserves a blog post). I hate all the recruiter spam on linkedin, I hate emails, I also hate the London tube during rush hour. And my surname is BOSIRE (BO-SEE-RAY) not BO-SA-YAH and definitely not BO-ZEE-YEH

Some iOS Apps I have worked on (In no particular order) This list is not complete

####Personal

  • Habari - News reading app powered by a RoR backend hosted on Rackspace

####RedAnt

####The Other Media

####Dead in the water projects

  • XKCD Reader.
  • World Travel Advice.
  • Leeds University Union.

“I don’t know the key to success, but the key to failure is trying to please everybody.”