-- Search Engine Spider and User Agent Identification
incrediBILL - 12:13 am on Oct 31, 2013 (gmt 0)
In all cases, it will (eventually) come down to page access speed
Multi-threaded, so it can pull down a LOT of pages per second if instructed or your pages could easily be randomized in a massive queue of millions of pages and not appear to be requested very frequently.
Additionally, a good scraper may use a common technique to hide their presence by using a rotating list of IPs that can be almost virtual and indiscernible to the average observer. A hard core scraper might even enlist the IPs of a botnet so just blocking data centers only gets the junk that's easy to spot. Blocking a botnet is much harder because it enlists residential machines of idiots that clicked on attachments to spam.
There are tells that I can't mention that stop them dead in their tracks, often the first access, and IPs really aren't it. Eventually they'll figure out how i'm identifying them and they'll fix that too so it's just a short matter of time now.
Basically, the crawling of the page and the headless browser aren't 100% the same, but they can be. It's a mix-and-match world using those tools and you can pretty much do whatever you want.
FWIW, even if you were to make a screen shot I can first download the page, scrape it, and then pass it to the headless browser that needs to a screen shot which requires the other files so the scraping can be going at breakneck speeds while other activities handled by other child tasks are moving at a more leisurely pace.
All the same tools, it's not an either/or situation, it can be used to do super fast scraping and things like screen shots but these tools are real handy to get into a page and grab the AJAX or JSON data.
Basically what we're talking about is something capable of doing fully automated testing of websites and obviously you have to do scraping to build test tools, so the issue is what the author of the code intends to do with it whether it's friendly or not.
It sure explains the sheer volume of real browser user agents being used to scrape as I always assumed they were trying to fly under the radar and hide, which used to be the case, but today it just happens to be the user agent of the tool itself which sucks big time.
Might as well be "larbin", "curl" or "wget" from the old days but fast forward 10 years and it's the webkit itself instead.