Welcome to WebmasterWorld Guest from 54.166.114.43

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

TECH UPDATE: Javascript Makes Dumb Bots Obsolete

Javascript Must Be Crawled For Scrapers To Survive

   
8:59 am on Dec 15, 2013 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



This is a related update to the previous post TECH UPDATE: Bots as Browsers Using JavaScript [webmasterworld.com] regarding the rise of bots using javascript.

Based on my analysis of the quantity of sites using jQuery and AngularJS the current reign of dumb scrapers is about to end.

Some applications that actually display lots of data on your screen look like this to your average bot:

<!doctype html>
<html lang="en" ng-app="phonecatApp">
<head>
<meta charset="utf-8">
<title>Google Phone Gallery</title>
<link rel="stylesheet" href="css/app.css">
<link rel="stylesheet" href="css/bootstrap.css">
<link rel="stylesheet" href="css/animations.css">

<script src="lib/jquery/jquery-1.10.2.js"></script>
<script src="lib/angular/angular.js"></script>
<script src="lib/angular/angular-animate.js"></script>
<script src="lib/angular/angular-resource.js"></script>
<script src="lib/angular/angular-route.js"></script>

<script src="js/app.js"></script>
<script src="js/animations.js"></script>
<script src="js/controllers.js"></script>
<script src="js/filters.js"></script>
<script src="js/services.js"></script>
</head>
<body>

<div class="view-container">
<div ng-view class="view-frame"></div>
</div>

</body>
</html>


Notice there's nothing of any value here for the spider to find unless it can actually run JavaScript and suddenly the view-frame displays a wealth of information generated by the scripts.

What this means to scrapers is necessity is going to push scrapers, or already has, to use tools like PhantomJS to scrape.

What this means to website owners is looking in your log files for clues like whether the JavaScript or CSS files were loaded is meaningless as both the bots and the browsers alike will all be doing the same thing and many bots are already.

Some of the only clues website owners will have are:

* Is the bot using the default user agent string?

* Is the bot hosted in a data center IP range?

* Is the bot hosted on Linux?
Linux will be less of a clue as more end users dump Windows.
There is a TrifleJS headless browser for Windows so expect to see Windows servers involved with JavaScript based-scrapers just as easily as the Linux crowd.

* Speed or duration of page requests to site.

I'm predicting soon the data center IP range will be about the only clue left unless it's a greedy high speed scraper so take advantage of the low hanging fruit while it lasts because the circumstances of the evolving web are going to force scrapers to evolve to meet the challenges of the current website technology.
9:35 am on Dec 15, 2013 (gmt 0)

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member



do we have a list of scraper ips we could block, default.
10:12 pm on Dec 15, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



See the "server farms" thread in this same subforum. It runs for many, many pages, with restarts every few months.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month