Thought I'd do a little tech briefing to bring those up to speed that aren't aware of the rapidly changing server side landscape including support for JavaScript thanks to Node.js.
What used to be true about bots and JavaScript: For years I've been preaching you could tell easily the difference between humans and bots behind a browser user agent based on whether or not the browser read JavaScript or not. I usually lumped CSS into that blanket generalization which was never 100% true but good enough for 99% of the crawlers using browser user agents. Another alternative reason for browser user agents was an actual server side browser taking screen shots.
The current state of bots and JavaScript Well ladies and gents, the playing field has completely changed and is topsy turvy these days as not only are bot using JavaScript the crawlers are actually WRITTEN in JavaScript!
They may also be taking screen shots but that's just icing on the crawling cake.
The technology making all this happen is called Node.js [
nodejs.org...] which is a very powerful platform built on Chrome's JavaScript runtime.
Here's a list of crawlers you can get to deploy on Node.js
[
nodejsmodules.org...]
Now, add to this PhantomJS [
phantomjs.org...] which is a scriptable headless WebKit with a JavaScript API. What that means is you basically have a full blown browser running in a server and it doesn't need XWindows installed or any GUI. This is the toolkit you use to scrape web pages and do some serious data mining. Other possibilities include scripts that locate and "click the ads" to perform click fraud attacks.
The technology is all there, it's crawling, scraping, data mining,
Here's a post on how to make screen shots using PhantomJS:
[
skookum.com...]
As mentioned above, here's a script to scrape AdSense ads!
[
garysieling.com...]
How hard to you think it would be to CLICK those ads being scraped?
Anyway, those browsers aren't browsers so block those data centers as these are NOT people out there and this new code can probably respond to some rudimentary captcha's. One catpcha that I used to deploy would detect whether there was actual typing at a keyboard and these new APIs may be able to easily simulate actual key clicks, not sure as I'm just digging into the APIs.
Everything we used to know about how to detect and stop bots is out the window now that scrapers are written in headless browsers.
Total Game Changer. Obviously I'll be experimenting a lot more and testing for exploitable tells, but since the scrapers and the browsers are now the same thing publicly discussing the differences will allow them to easily patch up the remaining holes in detectability.
Truthfully, I didn't expect it to get quite this easy for scrapers to go completely unnoticed for a few more years but it appears the timetable escalated substantially thanks to Google, Chrome and Node.js.
There's more but and maybe I'll dive into some other stuff later but this is the basic information so anyone wanting to deploy it and test it can easily make it happen.
Enjoy.