They may also be taking screen shots but that's just icing on the crawling cake.
The technology making all this happen is called Node.js [nodejs.org
Here's a list of crawlers you can get to deploy on Node.js
Now, add to this PhantomJS [phantomjs.org
The technology is all there, it's crawling, scraping, data mining,
Here's a post on how to make screen shots using PhantomJS:
As mentioned above, here's a script to scrape AdSense ads!
How hard to you think it would be to CLICK those ads being scraped?
Anyway, those browsers aren't browsers so block those data centers as these are NOT people out there and this new code can probably respond to some rudimentary captcha's. One catpcha that I used to deploy would detect whether there was actual typing at a keyboard and these new APIs may be able to easily simulate actual key clicks, not sure as I'm just digging into the APIs.
Everything we used to know about how to detect and stop bots is out the window now that scrapers are written in headless browsers. Total Game Changer.
Obviously I'll be experimenting a lot more and testing for exploitable tells, but since the scrapers and the browsers are now the same thing publicly discussing the differences will allow them to easily patch up the remaining holes in detectability.
Truthfully, I didn't expect it to get quite this easy for scrapers to go completely unnoticed for a few more years but it appears the timetable escalated substantially thanks to Google, Chrome and Node.js.
There's more but and maybe I'll dive into some other stuff later but this is the basic information so anyone wanting to deploy it and test it can easily make it happen.