-- Search Engine Spider and User Agent Identification
---- Stopping scrapers from the get-go
encyclo - 2:03 am on Feb 28, 2011 (gmt 0)
Going back to the original post:
I'm looking to stop the scraping/copying/bots from the outset and I need bandwidth kept to a minimum
A lot of posts have discussed the first part, but not much has been said about the second - I would like to hear some comments from others regarding how you reconcile the desire to reduce bandwidth and latency to a minimum on a large, static dataset with the legitimate concerns over content-scraping.
If you are looking for speed (and you should be, as speed is a critical factor), then there is nothing faster than static content. For evergreen content such as the original post implies, static HTML and aggressive caching rules can make a huge difference in server load and page-load speed for the end user as well as significantly reduce bandwidth. You can tell Apache to set a max-age header with a long expiry time for text/html content, then the server can simply reply with 304 not-modified responses for user-agents such as Googlebot.
Does anyone have any evidence of scraping via ISP or other public caches? If not (and I'm not aware of such a problem ever being discussed), then static HTML is the way to go in my opinion.