Scraping bot

I just recently found the joys of scraping. Yes, the “technique” is a bit underhanded, but it sure is fun. I guess as long as you don’t profit off of it, it’s ok(and it doesn’t affect the site you’re scraping). If you scraped a site just to present it in a different design(like scrape all the news off CNN and then make a site called BNN or something and then charged users with a smaller fee), then yeah, that’s just plain dirty.

I scrape for fun and that’s it. Anyways, there I was, scraping movie schedules off the only movie schedule site that I know of( clickthecity.com ) and suddenly, I had an idea. What if I just made an RSS feed for a movie house that I liked? It would be just so cool since clickthecity(or CTC) doesn’t have an RSS feed. So I started researching about makign RSS feed, and while doing so, Topher brought about the idea of making a bot. He showed me how jabberbot and from there I went on and played around with it.

To make the scraper, I used the Rubyful-soup plugin for Ruby on Rails(along with Mechanize). Using it was easy, and the main problem actually was how to traverse the atrocious table layout of clickthecity. Once I got over that hurdle though, getting the info I wanted was easy. And what did I need? I only just needed the date of the showtimes(so I’ll know if it was the latest update), the location of the cinemas(the mall the cinemas were located), the cinema name(or number), the movies being shown on each movie house,¬† and the show times of each. These were saved into a database I made(I had a bit of a trouble in putting them in, especially since there were movie houses that had two different movies being shown the same day)

Next was the bot itself. Like I said, I used jabberbot, which was also a handy plugin for Ruby on Rails. The syntax was easy, and this was the easiest part. I made a new google account just for the bot and now it is up. I still have yet to make it work on yahoo messenger(Topher tried it and it didn’t work, but I haven’t tried it for myself yet) so I’ll just have to settle with Gtalk.

The flow of the program is simple: scrape the data from clickthecity.com and then save it to the database(done using a rake task I made). I have yet to automate these tasks, along with the connection of the bot(I still have to manually start it using another rake task). Anyways, after manually callign the rake tasks for scraping the site and starting the bot, movieschedule-bot is ready to go.

So far, I only added a select few movie houses that me and my friends usually go to. You can actually add movieschedules (yes, google mail) to your gtalk/gmail contact list and start using it(don’t worry about down times. that just means I am currently updating/restarting it – try again after five minutes if it doesn’t work) Don’t try to spam it though as right now it is running from my local machine. I have yet to upload it to my free Heroku account and make it run the bot forever.

So there you have it. A scraper bot. There are a LOT of other uses for this, like a thesaurus bot, wikipedia bot, imdb bot or something but that’s for another discussion. As you can see, bots can be useful too! Right now, movieschedules is looking for more friends. Be kind to it!

To start using¬† movieschedules, just add it to your Google contact list then type “help” and send it. It will return you a list of commands that are currently available. Have fun!