homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

TECH UPDATE: Javascript Makes Dumb Bots Obsolete
Javascript Must Be Crawled For Scrapers To Survive

 8:59 am on Dec 15, 2013 (gmt 0)

This is a related update to the previous post TECH UPDATE: Bots as Browsers Using JavaScript [webmasterworld.com] regarding the rise of bots using javascript.

Based on my analysis of the quantity of sites using jQuery and AngularJS the current reign of dumb scrapers is about to end.

Some applications that actually display lots of data on your screen look like this to your average bot:

<!doctype html>
<html lang="en" ng-app="phonecatApp">
<meta charset="utf-8">
<title>Google Phone Gallery</title>
<link rel="stylesheet" href="css/app.css">
<link rel="stylesheet" href="css/bootstrap.css">
<link rel="stylesheet" href="css/animations.css">

<script src="lib/jquery/jquery-1.10.2.js"></script>
<script src="lib/angular/angular.js"></script>
<script src="lib/angular/angular-animate.js"></script>
<script src="lib/angular/angular-resource.js"></script>
<script src="lib/angular/angular-route.js"></script>

<script src="js/app.js"></script>
<script src="js/animations.js"></script>
<script src="js/controllers.js"></script>
<script src="js/filters.js"></script>
<script src="js/services.js"></script>

<div class="view-container">
<div ng-view class="view-frame"></div>


Notice there's nothing of any value here for the spider to find unless it can actually run JavaScript and suddenly the view-frame displays a wealth of information generated by the scripts.

What this means to scrapers is necessity is going to push scrapers, or already has, to use tools like PhantomJS to scrape.

What this means to website owners is looking in your log files for clues like whether the JavaScript or CSS files were loaded is meaningless as both the bots and the browsers alike will all be doing the same thing and many bots are already.

Some of the only clues website owners will have are:

* Is the bot using the default user agent string?

* Is the bot hosted in a data center IP range?

* Is the bot hosted on Linux?
Linux will be less of a clue as more end users dump Windows.
There is a TrifleJS headless browser for Windows so expect to see Windows servers involved with JavaScript based-scrapers just as easily as the Linux crowd.

* Speed or duration of page requests to site.

I'm predicting soon the data center IP range will be about the only clue left unless it's a greedy high speed scraper so take advantage of the low hanging fruit while it lasts because the circumstances of the evolving web are going to force scrapers to evolve to meet the challenges of the current website technology.



 9:35 am on Dec 15, 2013 (gmt 0)

do we have a list of scraper ips we could block, default.


 10:12 pm on Dec 15, 2013 (gmt 0)

See the "server farms" thread in this same subforum. It runs for many, many pages, with restarts every few months.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved