18/02/2010 — A guide to the scraper writers
Categories: ruby, scraper, architecture
In the last 6 months I have written almost 60 scrapers. That's right, something like 6 scrapers per month, all of them in ruby for a mash-up that I was working on and another parallel project. 9.000 lines later I have some tips to give. I hope it helps you to write better scrapers or at least that you don’t bang your head against the wall, as much, as I did :-)
- Avoid scrapers writing directly to the database.
I don’t think this rule is specifically a ruby rule but it's general and can be applied to any kind of scraping project. You must be realistic, your scraper, sooner or later is going to break. It will hang in the middle of a scraping session and it will f** your database. If you have one scraper, it's ok. If you have 80 scrapers it is a problem. Besides that, you will need to keep the results table locked for long periods of time while doing the inserts, as your scrap a website and this time is not constant. This may be because some sites are slower than others or because they just stop answering and your scraper times-out and aborts. The best approach is to generate a CSV as the result for each scraper and let a well tested software import it for you. The import time (time that you push your data to your database) is linear. You can measure and guarantee that it won't change so often (only changing if the record’s numbers are really variable). If you are using active records, do it via rake task. If not, import it, for example, with the mysql client.
- Use a well test library to do the dirt job.
I can write a HTTP library, you can write a HTTP library too. Why to reinvent the wheel? For ruby coders, we have the mechanize that is a great lib, that handles HTTP exceptions, redirects, cookies and so on. I know scrapIT and others, but they are too much for me. I just wrote my lib over meachanize and it was perfect for me.
- UTF-8 handling
Unless you live in America and you scrap only American websites, you will have to handle it with a different charset. The best way to handle it, is with So the best way to handle it, is with iconv and specifically for ruby iconv class. I wrote some functions to my lib that do the necessary steps: try to get the page encoding, and run the iconv over the page content: In the function to get the page encoding, I basically search the page's body for the content-type and then I search the matched text by the charset. It would be something like this:
# as parameter here we get just the page content, as string, something like page.parser.to_s
def get_encoding(content)
encoding = nil
if meta = content.match(/(<meta\s*([^>]*)http-equiv=['"]?content-type['"]?([^>]*))/i)
if meta = meta[0].match(/charset=([\w-]*)/i)
encoding = meta[1]
end
end
encoding
end
Then we can do something like:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"
agent = WWW::Mechanize.new
page = agent.get(url)
@encoding = get_encoding(page.parser.to_s)
@document = Iconv.conv('utf-8//IGNORE', @encoding, page.parser.to_s)
- Sitemap
A common problem with scrapers is that any change in the page layout (html or css), will cause your scraper to break. So the more you rely on menus and submenus, the more you are exposed to problems. To avoid this, I suggest that you first, try to find the sitemap.xml and try to use it to parse the urls to be crawled. If it is possible you will just have to parse the product/article/whatever,,, itself and not tons of menus and submenus until you reach them. Good ways to find the sitemap.xml are by checking robots.txt and bruteforcing sitemap.xml, sitemap.xml.gz, sitemap_index.xml, and so on. More information you can find in the sitemap protocol spec
- Fakeweb
To avoid to hit 1000 times the webserver that you are scraping, the best thing is to stub some urls using fakeweb. You should download some HTML with curl, bind it to fake web, require it in your scraper and as soon as your scrape try to reach those urls that you have binded with fakeweb, it will be served locally, using the HTMLs that you downloaded. It's makes your test faster, since you remove the network delay and it allows you to work offline as well..
A book that was a good reference and i think will help people that want to write serious crawlers and scrapers is the Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL the book is written using PHP but the ideas can be used and adapted easily for ruby.
Go back to the index
Comments
Add Comment