posted by lboje  10/15/2008

We started scraping as part of the Mail Yeti project. We wanted to expand our functionality by searching other sites on the web against our incoming emails to begin building a useful database of hoax emails. As part of our response to users that submit suspicious emails, we wanted to give them some information regarding the chance their email could be a hoax. We have been working on returning links to similar hoax emails found using Hpricot.

Hpricot is a powerful Ruby gem that allows you to search through HTML tags in a web page for specific data. Learning how to use Hpricot has proven very useful. It has helped us get up to speed very quickly with one of our current client’s projects. With current web standards being ignored by so many sites, getting the data our client requires has been a challenge. Digging through web pages constructed entirely with tables, and no consistency in how they display information has been brain wracking, but it has expanded my knowledge of Hpricot and its capability exponentially over the past few weeks.

Another great web scraping gem I am beginning to look at is WWW::Mechanize. It allows you to have a virtual user in your code that can interact with forms and buttons on a page.

With all of this Hpricot experience I came up with a cheat sheet for my fellow developers at Sagebit and I am working on putting it up here as a PDF. It should be available soon.

Update: The Hpricot Cheat Sheet is now available!

Preview:

Leave a Comment