(Originally publish December 3 2005)
Some of the gory details on building rubystuff.com, a Ruby-built site for Ruby stuff.
No claims are made as to whether this is the best, or even right, way to do this. This is simply how I did it.
rubystuff.com is a storefront site for a variety of Ruby paraphernalia, such as T-shirts, coffee mugs, and tote bags. The reality is that the different items are actually sold through a number of CafePress sites; rubystuff.com makes it easy to find and see all the choices in one place.
CafePress makes it snake-simple to create a free on-line shop. There is the restriction, though, that each shop offers only one version of each item. In order to have a full range of Ruby products, then the choices were to pay CafePress for a more sophisticated store site, or have a separate site for each design. I choose the latter.
The last thing I wanted to do for rubystuff was hand-craft the myriad product pages, so I thought about how best to dynamically construct the site from the CafePress content. Poking about the HTML source for different CafePress product pages, I saw that the markup was predictable enough to facilitate screen scraping.
Always mindful of the goal of being the laziest person in Rubyville, I cast about for tools to do most of the heavy lifting for me. I had previously spent some time with Michael Neumann ’s WWW::Mechanize I thought it would make a good first tool.
I was right, and here are the results
WWW::Mechanize is available from the Wee project page on RubyForge. You can also install at as a gem. It depends on the htmltools lib shipped with the NARF library.
Mechanize simplifies the retrieval, tidying, and access of HTML pages. You tell a Mechanize agent to get a page from an URL, and it provides a nice OO model into assorted page objects, such as form fields and links.
It does so while parsing the cooked HTML, creating special node sets based on certain element names. For example, when Mechanize encounters an anchor element, it uses the corresponding REXML node to add a new item to the links set:
@links << Link.new(node)
#
Mechanize has such a nice architecture that shortly after I first poked around in it I found I could easily add my own hooks for creating custom node sets. After a few E-mails with Herr Neumann and some code patches, this became a part Mechanize. Clearly, I’m biased, but I think this is one of the features that makes Mechanize so sweet. I hand it a URL, and I get back a set of custom business objects built from HTML nodes. #
The code needs to require the Mechanize lib, of course, and you may want to log stuff when you’re developing to see what’s going on if you aren’t getting the results you expect. (Side note: These comments have been added after the code was already complete. The final code was a result of playing around and experimenting and refactoring; no unit tests were harmed in its creation. If I thought the code was going to get much larger I would probably start over and use TDD, but don’t always start out that way, preferring to test the waters, so to speak, with sketch code that often just evolves into Good Enough production code.) #
require 'mechanize'
require 'logger'
CP_LOGGER = Logger.new( 'cp.log' )
CP_LOGGER.level = Logger::INFO
We’re also going to need some code to encapsulate the business objects generated from the HTML parsing. More on that later.
require 'product'
This is the main class. It handles fetching the HTML and outputting XML. The whole process of creating the rubystuff.com site is split into multiple steps. Ideally, we should be able just grab a bunch of remote HTML pages, extract the data needed, and spit out a nicely formated set of pages. The reality is that many things can go wrong, such as network timeouts or goofy typos. If you are doing any site scrapping you need to be considerate of your actions. If you write your app to execute every step, end-to-end, all at once, then you may find yourself making countless Web requests while trying to work out some final-step rendering detail. This is bad. It’s rude to get banging on someone’s server, and even if you are acting within whatever terms of service apply you run the risk of getting blacklisted for excessive page hits.
To build the rubystuff.com pages, there is a batch script that first grabs the HTML and writes a reformatted version to disk. A separate step takes those disk files and munges them up to create the final site. (If I knew of a way to get Mechanize to read from a disk file rather than having to make an actual HTTP request I would first save the raw HTML to disk and then work form those. Suggestions welcome.)
class CafeFetcher
Some Web sites will reject your scripted HTTP requests with a 403 Forbidden if it doesn’t have a clear idea what sort of user agent is calling. I couldn’t decide what user-agent string to pass in the request headers, so I decided to pick one at random from a static list: #
UA = [
'Windows IE 6' ,
'Windows Mozilla',
'Mac Safari' ,
'Mac Mozilla' ,
'Linux Mozilla',
'Linux Konqueror' ]
An instance gets created with a few options. You need to pass in a list of site URLs to hit, and a hash that maps product category keys with a regular expression. It turns out that the CafePress pages use img elements that contain all the data needed. The image URL itself has a product code; the alt attribute has a product description. Nice. But to categorize each item, the code needs a regex to identify the content.
The code will loop over the set of URLs. Rather than hit the site bang! bang! bang!, I have a pause value built in. There’s some set-up code here as well, for storing results. #
def initialize( opts = {} )
@stores = opts[ :stores ]
@product_matches = opts[ :product_matches ]
@pause = opts[ :pause ] || 5
@prods = {}
@product_matches.keys.each { |pt| @prods[ pt ] = {} }
end
The real goal of all our parsing is to produce two types of XML files. One is a product description. It’s short, and simply describes the important features (e.g. name, price, item ID) of an item. This method writes gets that data from each product object and writes it to disk
def write_product_xml( data_dir )
@product_matches.keys.each do |pt|
@prods[ pt ].values.sort.each do |prod|
File.open( "#{data_dir}/#{pt}.#{prod.pid}.xml", "wb" ) do |file|
file.puts( prod.to_xml )
end
end
end
end
We also want to write out an RSS file that lists all products of the same type across the set of URLs. Another application will re-use this to build the final Web site. #
def rss1( prod_type )
items_list_re = /<\?(channel)\s+(items).\?>/
full_items_re = /<\?(items)\s+(full).\?>/
rss_template = "<?xml version='1.0'?>
<rdf:RDF
xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:image='http://purl.org/rss/1.0/modules/image/'
xmlns='http://purl.org/rss/1.0/' >
<channel rdf:about='http://rss.rubystuff.com/rss/#{prod_type.to_s}'>
<title>Ruby Stuff Store: #{prod_type.to_s.capitalize}</title>
<link>http://rss.rubystuff.com/rss/#{prod_type.to_s}</link>
<description>#{prod_type.to_s.capitalize}</description>
<items>
<rdf:Seq>
<?channel items ?>
</rdf:Seq>
</items>
</channel>
<?items full ?>
</rdf:RDF>"
item_list = @prods[ prod_type ].keys.sort.map{ |k| "<rdf:li resource='#{k}' />" }.join( "\n" )
rss_template.gsub!( items_list_re, item_list )
item_set = @prods[ prod_type ].values.sort
items = item_set.map{ |item|
item.to_rss1_item
}.join( "\n" )
rss_template.gsub( full_items_re, items )
end
The action starts when we go fetch each page. This method loops over the list of site URLs, uses mechanize to extract a set of products and assign assorted details.
def fetch_site_products( url )
agent = WWW::Mechanize.new {|a| a.log = Logger.new(STDERR) }
agent.user_agent_alias = UA[ rand( UA.size ) ]
This is an important part: We tell Mechanize that when it comes across an element with the name ‘img’ it should use the corresponding parse-tree node to create a Product instance and save it in its watch list. The code is cleaner because once we’re done parsing HTML we can focus on (albeit simple) business objects, not HTML chunks.
agent.watch_for_set = { 'img' => Product }
page = agent.get( url )
products = page.watches[ 'img' ].select{ |i| i.src =~ /_store/i }
products.each do |i|
i.set_full_link url
assign_product_category i
end
end
As images are converted into product objects the code needs to track what sort of products we have so it can write it disk later.
def assign_product_category product
@product_matches.each do |key, re|
@prods[ key ][ product.src ] = product if product.alt =~ re
end
end
Helper stuff:
def product_types
@product_matches.keys
end
def products
@prods
end
Each URL is fetched and processed. Because we’re dealing with network calls, the code will retry the HTTP request a few times if there is an error before it gives up.
Note that this method returns a reference to self so that we can chain method calls later on. It’s a neat hack I stole from someplace.
def get
@stores.each do |store|
redo_cnt = 0
max_redo = 1
STDERR.puts( "Get #{store}")
begin
fetch_site_products( 'http://www.cafepress.com/' + store.to_s.strip )
rescue Exception
redo_cnt += 1
CP_LOGGER.warn "Error fetching #{store.to_s.strip}: #{$!}"
CP_LOGGER.warn "retry #{redo_cnt} ..."
retry unless redo_cnt > max_redo
redo_cnt = 0
end
print( "\n Sleep ...\n\n\n" )
sleep( @pause )
end
self
end
end
That’s the core code. The next step is to define our parameters and grab the sites.
An output directory is defined for the resulting XML files, and a list of site URLs is defined. The list shown here is much shorter then what I actually use; there are about 10 or so CafePress shops that are bundled into rubystuff.com.
if __FILE__ == $0
data_dir = '../data'
stores = %w{
speedmetalruby
rubyonrailsshop
}
This hash is used to partition product image data into product categories. CafePress has many more categories than I care to split out on rubystuff.com, so many things get lumped together.
product_matches = {
:mugs => /stein|mug/i ,
:shirts => /Tracksuit|shirt|Camisole|Tank|Tee|Ringer|apron|Raglan|Jersey|bib|creeper/i ,
:hats => /hat|black cap/i ,
:mousepads => /mouse/i ,
:stickers => /sticker/i ,
:assorted => /clock|sticker|teddy bear|pillow|frame|journal|light switch|tile|calendar|cards/i ,
:buttons_and_magnets => /button|magnet/i ,
:bags => /bag/i ,
:stamps => /postage/i ,
:posters => /poster/i ,
:undies => /boxer|thong/i ,
:media => /data cd|book|audio/i
}
An instance of CafeFetcher is created with our spiffy parameters, and set to work grabbing and writing:
f = CafeFetcher.new( :stores => stores, :product_matches => product_matches )
f.get.write_product_xml( data_dir )
f.product_types.each do |pt|
File.open( "#{data_dir}/#{pt}.rss", "wb" ) do|file|
file.puts(f.rss1( pt ) )
end
end
The Web site has a page listing all the underlying CafePress sites, and that gets built here using assorted string-munging calls:
shoplist_tmpl = IO.read( 'shoplist.html')
shop_tmpl = "
<div class='item'>
<a href='http://www.cafepress.com/SHOP'>
<img src='/images/logos/SHOP.png' alt='SHOP' />
</a>
</div>
"
shop_list = ""
stores.each do |store|
shop_list << shop_tmpl.gsub( 'SHOP', store )
end
shoplist_tmpl.gsub!( 'SHOPS' , shop_list )
File.open( '../public/shoplist.html', 'wb'){ |f| f.print shoplist_tmpl }
end
That’s pretty much it. I’ve omitted the next part of the build process, which slurps in the RSS files and uses Ruby to perform XML transformations. I have an article in the January 2006 issue of Dr. Dobbs that explains one way of doing this if you want to see how it might work.
There is, however, one more detail that must be explained: The Product class.
You can read about that here .
Copyright 2005 © James Britt
Thanks to Stephen Waits for pointing out a bug where I was calling rand with UA.size-1, now fixed.