homepage Welcome to WebmasterWorld Guest from 54.226.213.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 48 message thread spans 2 pages: 48 ( [1] 2 > >     
TECH UPDATE: Bots as Browsers Using JavaScript
The Bot Detection Game Has Changed
incrediBILL




msg:4619882
 9:19 pm on Oct 29, 2013 (gmt 0)

Thought I'd do a little tech briefing to bring those up to speed that aren't aware of the rapidly changing server side landscape including support for JavaScript thanks to Node.js.

What used to be true about bots and JavaScript:
For years I've been preaching you could tell easily the difference between humans and bots behind a browser user agent based on whether or not the browser read JavaScript or not. I usually lumped CSS into that blanket generalization which was never 100% true but good enough for 99% of the crawlers using browser user agents. Another alternative reason for browser user agents was an actual server side browser taking screen shots.

The current state of bots and JavaScript
Well ladies and gents, the playing field has completely changed and is topsy turvy these days as not only are bot using JavaScript the crawlers are actually WRITTEN in JavaScript!

They may also be taking screen shots but that's just icing on the crawling cake.

The technology making all this happen is called Node.js [nodejs.org...] which is a very powerful platform built on Chrome's JavaScript runtime.

Here's a list of crawlers you can get to deploy on Node.js
https://nodejsmodules.org/tags/spider

Now, add to this PhantomJS [phantomjs.org...] which is a scriptable headless WebKit with a JavaScript API. What that means is you basically have a full blown browser running in a server and it doesn't need XWindows installed or any GUI. This is the toolkit you use to scrape web pages and do some serious data mining. Other possibilities include scripts that locate and "click the ads" to perform click fraud attacks.

The technology is all there, it's crawling, scraping, data mining,

Here's a post on how to make screen shots using PhantomJS:
[skookum.com...]

As mentioned above, here's a script to scrape AdSense ads!
[garysieling.com...]

How hard to you think it would be to CLICK those ads being scraped?

Anyway, those browsers aren't browsers so block those data centers as these are NOT people out there and this new code can probably respond to some rudimentary captcha's. One catpcha that I used to deploy would detect whether there was actual typing at a keyboard and these new APIs may be able to easily simulate actual key clicks, not sure as I'm just digging into the APIs.

Everything we used to know about how to detect and stop bots is out the window now that scrapers are written in headless browsers.

Total Game Changer.

Obviously I'll be experimenting a lot more and testing for exploitable tells, but since the scrapers and the browsers are now the same thing publicly discussing the differences will allow them to easily patch up the remaining holes in detectability.

Truthfully, I didn't expect it to get quite this easy for scrapers to go completely unnoticed for a few more years but it appears the timetable escalated substantially thanks to Google, Chrome and Node.js.

There's more but and maybe I'll dive into some other stuff later but this is the basic information so anyone wanting to deploy it and test it can easily make it happen.

Enjoy.

 

wilderness




msg:4619909
 11:53 pm on Oct 29, 2013 (gmt 0)

It seems there will be some just reward for a few that spent endless hours accumulating IP lists ;)

SevenCubed




msg:4619910
 12:24 am on Oct 30, 2013 (gmt 0)

I had to create a new path of subfolders in my bookmarks:

Web Development References > Webmaster World > University of incrediBILL > Rainy Day Mischief

jmccormac




msg:4620062
 3:01 pm on Oct 30, 2013 (gmt 0)

Bot detection now probably requires a good IP database along with some statistically based behavior rules. Otherwise these might go relatively unnoticed. How immediate is the JS risk in comparison to the existing scrapers?

Regards...jmcc

incrediBILL




msg:4620088
 4:29 pm on Oct 30, 2013 (gmt 0)

How immediate is the JS risk in comparison to the existing scrapers?


I don't think it's one type vs. the other, all scrapers are in high demand.

I think full blown JavaScript scrapers in use right now are in search of "big data" because there's quite bit money in that data and you have to run JavaScript to properly scrape AJAX and JSON data feeds.

jmccormac




msg:4620092
 5:02 pm on Oct 30, 2013 (gmt 0)

A lot of existing high volume scrapers tend to be distinguishable from human users in the way that they don't download graphics or have a randomised downloading sequence. Perhaps it is more a site specific issue in that the kind of data and the requirements of the scraper vary?

@wilderness A precient or accidental waste of time. Hundreds of millions of IPs and so many data centres. :)

Regards...jmcc

brotherhood of LAN




msg:4620094
 5:14 pm on Oct 30, 2013 (gmt 0)

I do a bit of browser automation and have to agree it's not as easy to block.

One thing that is worth measuring is the browser dimensions of the visitor. Another is to check whether an element is visible on the screen or not when 'something happens', i.e. if a submit button is pressed it should be visible on the screen.

Some of the bigger sites out there no doubt have experience of this as obvious IDs and classes in their DOM layout are obfuscated and randomised per page load, which makes it harder to grab hold of something on the page as a frame of reference to where everything is.

tangor




msg:4620117
 7:29 pm on Oct 30, 2013 (gmt 0)

In all cases, it will (eventually) come down to page access speed... as some of these new buggers download images, too... and why not? Given system abilities these days all is possible.

But bots work on "how many per hour" and that's a giveaway.

lucy24




msg:4620159
 9:29 pm on Oct 30, 2013 (gmt 0)

Huh. I've never looked at javascript. Except, occasionally, to tell robots to keep the ### out of piwik.

My starting criterion was always the favicon-- or, nowadays, any version of the apple-touch-icon. There exist robots that request only the page and the favicon, but you can pretty well count them on the fingers of one hand.

I'm currently studying the plainclothes bingbot (different thread). It executes javascript without a hiccup-- including returning appropriate values on feature-detection tests-- but it never, ever looks at the favicon.

incrediBILL




msg:4620193
 12:13 am on Oct 31, 2013 (gmt 0)

In all cases, it will (eventually) come down to page access speed


Multi-threaded, so it can pull down a LOT of pages per second if instructed or your pages could easily be randomized in a massive queue of millions of pages and not appear to be requested very frequently.

Additionally, a good scraper may use a common technique to hide their presence by using a rotating list of IPs that can be almost virtual and indiscernible to the average observer. A hard core scraper might even enlist the IPs of a botnet so just blocking data centers only gets the junk that's easy to spot. Blocking a botnet is much harder because it enlists residential machines of idiots that clicked on attachments to spam.

There are tells that I can't mention that stop them dead in their tracks, often the first access, and IPs really aren't it. Eventually they'll figure out how i'm identifying them and they'll fix that too so it's just a short matter of time now.

Basically, the crawling of the page and the headless browser aren't 100% the same, but they can be. It's a mix-and-match world using those tools and you can pretty much do whatever you want.

FWIW, even if you were to make a screen shot I can first download the page, scrape it, and then pass it to the headless browser that needs to a screen shot which requires the other files so the scraping can be going at breakneck speeds while other activities handled by other child tasks are moving at a more leisurely pace.

All the same tools, it's not an either/or situation, it can be used to do super fast scraping and things like screen shots but these tools are real handy to get into a page and grab the AJAX or JSON data.

Basically what we're talking about is something capable of doing fully automated testing of websites and obviously you have to do scraping to build test tools, so the issue is what the author of the code intends to do with it whether it's friendly or not.

It sure explains the sheer volume of real browser user agents being used to scrape as I always assumed they were trying to fly under the radar and hide, which used to be the case, but today it just happens to be the user agent of the tool itself which sucks big time.

Might as well be "larbin", "curl" or "wget" from the old days but fast forward 10 years and it's the webkit itself instead.

This is going to be loads of fun.

lorax




msg:4620306
 1:07 pm on Oct 31, 2013 (gmt 0)

Thanks for bringing me up to speed iBill.

Q: I assume you can use these tools to download pages AND process the data/content or you can simply use it to scrape and then process with another script. Is there a trade-off or advantage of one method versus the other?

jojy




msg:4620332
 2:48 pm on Oct 31, 2013 (gmt 0)

Being a programmer, I think it's possible to block such bots if they are heavily consuming server resources. Here is how:

1. Detect mouse movement using javascript and fire event on exit and send it to server if user has not made any mouse movement.

2. Record this behavior in memory (i.e. memcached, APC etc.) for couple of page

3. If there is no movement on couple of pages then you know what you have to do!

This is just a thought and it can be refined to work better.

WesleyC




msg:4620334
 2:49 pm on Oct 31, 2013 (gmt 0)

While this is more of an anti-form-spam technique, there are a couple of serverside detection techniques I've been using to find and block badly-behaved 'bots that attempt form submissions.

The first is an encrypted (blowfish) timestamp embedded into a hidden form field. This lets the server know what time the form was generated--if it's too old, throw the submission out. This prevents 'bots that "capture" the form once then re-submit it multiple times from operating for more than a few hours (or whatever you set the timeout to).

The second method I use is a form field with a juicy name attribute (say, name="comment") that's hidden via unorthodox CSS methods. A simple display: none is too obvious, and many 'bots will ignore it. What you want is something much more insidious, that would require the 'bot to actually render the full page, notice that this particular form field isn't visible, then decide not to fill it in. I prefer using strange margins with an overflow: hidden container, odd z-indexing, and absolute or fixed positioning to make it not at all apparent from simple parsing of the CSS that the is not visible. Then, if the field is filled in, throw the submission out as spam.

A third technique is a simple form field with a randomly-chosen string in it, and a matching value saved in the session (not in any value the user has access to). If the session value doesn't match what's in the form, throw the submission out as spam. While cookie-enabled 'bots will walk right past this particular trap, a surprising number of spambots still get caught by it.

rish3




msg:4620339
 3:20 pm on Oct 31, 2013 (gmt 0)

Since Phantom.js is one of the larger threats, it might be worth specifically detecting it.

This won't catch the more sophisticated users of PJS, but it's a simple check...

if (typeof window.callPhantom === 'function') {
// insert your js here
}

If you want the next level of the arms race, I assume there a re known bugs in PJS that you could exploit and/or detect without affecting normal users.

Brings up an interesting point in that you are uniquely able to trigger potentially harmful code on the scraper's server. Might be interesting to see, if for example, logging 2TB of data via console.log for detected PJS scrapers works.

Edit: Seems they got wise to this in their 1.2 release, so the only part of what I posted that's useful anymore is triggering specific bugs.

goodoldweb




msg:4620418
 8:50 pm on Oct 31, 2013 (gmt 0)

Its been going on for a long time now. msn bot is a good example.

Detection is fairly easy, check ip addresses and screen size as well as interactivity (mouse movement, questions answered etc.)

I wrote an advanced "whosonline" script that provides me a window to traffic on my sites in real time (had to get to the bottom of the "zombies" phenomena). Mobile phones and Javascript bots galore!

physics




msg:4620459
 10:51 pm on Oct 31, 2013 (gmt 0)

Re: detecting mouse movement. What about mobile?

brotherhood of LAN




msg:4620462
 11:11 pm on Oct 31, 2013 (gmt 0)

Fair point RE: mobile and mouse, sicne there's no pointer.

Facebook tracking cursor, mouse. [webmasterworld.com]

The linked article does mention that mobile isn't an option there, but the visibility of elements of the page is somewhat comparable, I guess since there's more scrolling and less screen estate available.

In general, it's quite safe to say that a bot could easily appear human, at least for an amount of time. The most important way to detect is how they are using the site, i.e. repetitious behaviour, to achieve whatever goal it has.

incrediBILL




msg:4620477
 12:38 am on Nov 1, 2013 (gmt 0)

3. If there is no movement on couple of pages then you know what you have to do!


Since these tools are built to do automated testing of websites they can replicate all mouse and keyboard events so testing for those won't help in the long run as they can easily write code to fake a mouse move.

Here's the PhantomJS mouse and keyboard event code:
[github.com...]
sendEvent(mouseEventType[, mouseX, mouseY, button='left'])
sendEvent(keyboardEventType, keyOrKeys, [null, null, modifier])

If all you do is check for mouse moves or key clicks to detect bots, it's trivial to script random mouse moves or write a function to make a mouse move to the button or link you want to click or provide the keyboard events to activate the link.

However, PhantomJS still can't solve a captcha by itself unless they use a blow-thru technique where humans elsewhere answer the captcha's. Javascript mouse moves and key presses used to be one of my favorite 'captchas' but alas, no more. still useful techniques but not fully reliable.

I've noticed escalating webkit user agents scraping for the last few years but didn't know exactly what was going on and now I know.

Knowing is half the battle :)

goodoldweb




msg:4620666
 9:17 pm on Nov 1, 2013 (gmt 0)

MSN bot in action captured in real time. Executing javascript fully, including page refferer information, page at "url" and page at "title":


131.253.24.100 [msnbot-131-253-24-100.search.msn.com] --[Microsoft Internet Explorer]--[Screen Size: 800x600]--[Color Depth: 16 colors]

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)

lucy24




msg:4620710
 1:01 am on Nov 2, 2013 (gmt 0)

MSN bot in action captured in real time.

Coincidentally I've been tracking it for the past month. The 800x600 MSIE7 is one of its UA families; the other is a 1024x768 MSIE9. Longer post next week, probably. It seems to be changing its behavior before my eyes.

I've noticed escalating webkit user agents scraping for the last few years

Bing Preview is currently webkit. (This is hard to say with a straight face.) It executes javascript with humanoid results.

incrediBILL




msg:4620714
 1:32 am on Nov 2, 2013 (gmt 0)

FYI, here's a PhantomJS User Agent string:
"Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.2 Safari/534.34"

Out of the box it actually identifies itself as PhantomJS so unless that user agent string gets changed, the Achilles heel of many bot writers, we're good.

If you didn't know what the PhantomJS part meant previously, now you do, it means NOT HUMAN!

dataguy




msg:4620785
 5:55 pm on Nov 2, 2013 (gmt 0)

I've spent years writing server-side code to detect and block bots. It's been obvious for a while now that this can't be done with server-side techniques alone, since many bots are closely mimicking real users using real web browsers.

I'm glad I'm not working on this any more, but the trick which joyj has suggested is a great idea, though it has to be tweaked for touch.

lucy24




msg:4620813
 9:23 pm on Nov 2, 2013 (gmt 0)

Calling yourself Safari 1.x.x might be considered a bit of a giveaway as well. And that's speaking as someone who only recently and with extreme reluctance blocked Firefox 2. Cursory detour to raw logs turns up one PhantomJS who acted on js and one that didn't. Both claimed to be Safari version 1.something.

What does "Unknown" mean? Can it ever occur in a human UA?

blend27




msg:4620859
 1:50 pm on Nov 3, 2013 (gmt 0)

What does "Unknown" mean?

We don't know what we don't know...:)


--------------------------------------------
inetnum: 89.107.225.0 - 89.107.225.255
netname: BERILTECH (look it up)
descr: BERIL TECHNOLOGY
route: 89.107.225.0/24

Comes to us from Turkey and until 09/12/12 was identified as businessdbbot/v1.2 (http ://www.businessdb.com/bot.php)

also here: [webmasterworld.com...]

Then sometime on 10/04/12:

UA: Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.6.1 Safari/534.34

Same story 11 month later on 09/04/13 but with upgraded PhantomJS:

UA: Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.1 Safari/534.34

Tried to download an image that was referenced on the page as well 1 second later.
----------------------------------------------------------------------------------

There is a bunch of other bot runners using it for a while now: as early as Feb of 2012.

inetnum: 178.238.234.0 - 178.238.234.255
netname: CONTABO(server farm)
country: DE

178.238.234.113 (mail.artao.cz) - Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.3 (KHTML, like Gecko) PhantomJS/1.4.1 Safari/534.3

Executed JS functioned from external JS file and was able to download image as a result of document.write in it.

netmeg




msg:4621148
 8:17 pm on Nov 4, 2013 (gmt 0)

So ok. What does an ordinary Jane Webmaster do?

brotherhood of LAN




msg:4621155
 8:53 pm on Nov 4, 2013 (gmt 0)

netmeg, I think basically the battle has moved from strictly command line bots to browser bots as well, so the battle is on the browser/client-side too but essentially the war is won by banning on the server side. It's really a case of deciding when to ban based on how you decide whether it's a human or not.

As mentioned above you'd want to track user behaviour and decide for yourself whether the browser has a human or a program behind it. I suspect there'll be some scripts written to meet demand for people who require such a thing. Right now it does seem like there's a lot of issues coming up from browser based bots but not much recognition of it.

As this evolves over time I think it'll come down to heat maps and what other visitors to the site do, as essentially everything can be automated and every action can be mimicked. Hopefully we won't see another captcha style problem solving task where "user must click on the red triangle" to continue using the site (something that circumvented could lead to an IP ban)


Another issue here is that you can disable javascript on the browser but still exectute javascript in the browser... which makes it a tad more difficult to detect.

Something that is symptomatic of bots is randomising UA strings (as well as hopefully having a lot of IPs), sometimes you can test for which OS the person is really on, so someone doing automation on a Linux box but with a Windows UA could/should be flagged. Same goes for the reported browser.

Evaluating request HTTP headers can also reveal a lot, but not unique to browser based bots. This site how unique and trackable is your browser [panopticlick.eff.org] was discussed a few years back on this forum.

dstiles




msg:4621159
 9:21 pm on Nov 4, 2013 (gmt 0)

This may go beyond server-blocking. There are a lot of general "attacks" coming from compromised DLS-based machines or from DSL-based clever-clogs trying it on. And bad bots that switch UAs are not in the majority in my experience.

Another point: few of my sites' pages have javascript: they do not need it. So trapping bots that try to access JS is not always an option.

There are ways to trap bots (or at least, chancer humans) using header fields but does the subject of this thread send headers? The title "headless browser" suggests not but I think that's an incorrect reading on my part. If header fields are still sent AND are screwy then we stand a chance. Once they start putting believable header field combinations into bots, then we're stuffed.

Meanwhile, perhaps a few more in-page traps leading to IP trapping?

brotherhood of LAN




msg:4621161
 9:28 pm on Nov 4, 2013 (gmt 0)

>send headers

Yes, they absolutely must. "headless browser" refers to the browser not being visible on a screen.

wilderness




msg:4621168
 10:07 pm on Nov 4, 2013 (gmt 0)

So ok. What does an ordinary Jane Webmaster do?


Comes to us from Turkey and until 09/12/12 was identified as businessdbbot/v1.2
inetnum: 89.107.225.0 - 89.107.225.255


There is a bunch of other bot runners using it for a while now: as early as Feb of 2012.

inetnum: 178.238.234.0 - 178.238.234.255


despite everybody proclaiming overkill, in this particular instance a simple solution is

deny from 178.
deny from 89.

brotherhood of LAN




msg:4621170
 10:16 pm on Nov 4, 2013 (gmt 0)

What overkill? That's just an isolated example coming from a class C using the same UA.

How about several hundred ranges, random UAs and 'polite' scraping via a browser... you wouldn't notice.

I guess it depends on how important you feel it is to ban non-human non-useful visitors.

This 48 message thread spans 2 pages: 48 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved