homepage Welcome to WebmasterWorld Guest from 54.145.183.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 48 message thread spans 2 pages: < < 48 ( 1 [2]     
TECH UPDATE: Bots as Browsers Using JavaScript
The Bot Detection Game Has Changed
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 9:19 pm on Oct 29, 2013 (gmt 0)

Thought I'd do a little tech briefing to bring those up to speed that aren't aware of the rapidly changing server side landscape including support for JavaScript thanks to Node.js.

What used to be true about bots and JavaScript:
For years I've been preaching you could tell easily the difference between humans and bots behind a browser user agent based on whether or not the browser read JavaScript or not. I usually lumped CSS into that blanket generalization which was never 100% true but good enough for 99% of the crawlers using browser user agents. Another alternative reason for browser user agents was an actual server side browser taking screen shots.

The current state of bots and JavaScript
Well ladies and gents, the playing field has completely changed and is topsy turvy these days as not only are bot using JavaScript the crawlers are actually WRITTEN in JavaScript!

They may also be taking screen shots but that's just icing on the crawling cake.

The technology making all this happen is called Node.js [nodejs.org...] which is a very powerful platform built on Chrome's JavaScript runtime.

Here's a list of crawlers you can get to deploy on Node.js
https://nodejsmodules.org/tags/spider

Now, add to this PhantomJS [phantomjs.org...] which is a scriptable headless WebKit with a JavaScript API. What that means is you basically have a full blown browser running in a server and it doesn't need XWindows installed or any GUI. This is the toolkit you use to scrape web pages and do some serious data mining. Other possibilities include scripts that locate and "click the ads" to perform click fraud attacks.

The technology is all there, it's crawling, scraping, data mining,

Here's a post on how to make screen shots using PhantomJS:
[skookum.com...]

As mentioned above, here's a script to scrape AdSense ads!
[garysieling.com...]

How hard to you think it would be to CLICK those ads being scraped?

Anyway, those browsers aren't browsers so block those data centers as these are NOT people out there and this new code can probably respond to some rudimentary captcha's. One catpcha that I used to deploy would detect whether there was actual typing at a keyboard and these new APIs may be able to easily simulate actual key clicks, not sure as I'm just digging into the APIs.

Everything we used to know about how to detect and stop bots is out the window now that scrapers are written in headless browsers.

Total Game Changer.

Obviously I'll be experimenting a lot more and testing for exploitable tells, but since the scrapers and the browsers are now the same thing publicly discussing the differences will allow them to easily patch up the remaining holes in detectability.

Truthfully, I didn't expect it to get quite this easy for scrapers to go completely unnoticed for a few more years but it appears the timetable escalated substantially thanks to Google, Chrome and Node.js.

There's more but and maybe I'll dive into some other stuff later but this is the basic information so anyone wanting to deploy it and test it can easily make it happen.

Enjoy.

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 11:13 pm on Nov 4, 2013 (gmt 0)

What overkill? That's just an isolated example coming from a class C using the same UA.


I provided examples of denying Class A's. NO C's.

How about several hundred ranges, random UAs and 'polite' scraping via a browser... you wouldn't notice.


When the day comes that I don't notice, I'll remove my sites from the internet.

I guess it depends on how important you feel it is to ban non-human non-useful visitors.


I'm sure my list of UA's and IP's that are denied exceeds yours by leaps and bounds.

brotherhood of LAN

WebmasterWorld Administrator brotherhood_of_lan us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 11:39 pm on Nov 4, 2013 (gmt 0)

It sounds like you have it sussed out then, given your terse replies do you think this thread is a bit pointless?

When the day comes that I don't notice, I'll remove my sites from the internet.


Very assured. I'm also fairly certain that I could have your site into a database before you notice, but TBH it sounds like you already know everything you need to.

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 2:54 am on Nov 5, 2013 (gmt 0)

Doesn't sound like something the average Joe or Jane is likely going to be able to handle. Most people don't have developers on staff (fortunately I do). All this is WAY above their technical ability. Sucks, but there it is.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 3:47 am on Nov 5, 2013 (gmt 0)

Doesn't sound like something the average Joe or Jane is likely going to be able to handle.


nutmeg,
You shouldn't be discouraged by the over-simplification expressed in this thread.

If Apache and htaccess have proven anything over time, it's there are multiple methods and/or approaches that accomplish similar tasks.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 3:51 am on Nov 5, 2013 (gmt 0)

I'm also fairly certain that I could have your site into a database before you notice,


Kinda doubt that, however your welcome to waste your time trying.


but TBH it sounds like you already know everything you need to.


Within another two years, my widget commitments will be completed and I could care less if both widgets and the internet are relocated to Mars.
All I need to know is enough to get my through those same two years.

brotherhood of LAN

WebmasterWorld Administrator brotherhood_of_lan us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 4:56 am on Nov 5, 2013 (gmt 0)

Clarke's Three Laws spring to mind, but clearly you spend a lot of effort keeping out bad bots so it's not like you'd be the easiest target for such a thing. But feel free to sticky me your site.

IanKelley

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4619880 posted 5:21 am on Nov 5, 2013 (gmt 0)

Some great ideas in this thread.

There are a couple of different kinds of automated traffic being talked about, the difference is important where detection is concerned.

The first is traffic from real users that is being hijacked in one of various ways. In those cases testing for things like focus, interaction and visibility work really well.

The second is completely automated traffic with the ability to interpret pretty much anything a real browser can. Right now this isn't all that common but as the OP points out, it's going to become more common. The big search engines have been doing it for quite a while but they behave (usually) so there's no need to worry about them.

In a malicious case, it's possible to fake almost everything except the IP address. You can use proxies to change the IP but that slows you down dramatically. Still, it's pretty common. In any case, with this kind of traffic the IP itself is the best way to detect it. There are all kinds of interesting things you can do with an IP.

This does leave out traffic running through a botnet where it's both completely automated, with full browser capabilities and also running through a legitimate machine; however this kind of traffic is much less common than you'd think.

I'd like to contribute to the detection ideas in this thread with some of my own but traffic fraud VS fraud detection is an arms race. An entertaining one.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 7:51 am on Nov 5, 2013 (gmt 0)

This does leave out traffic running through a botnet where it's both completely automated, with full browser capabilities and also running through a legitimate machine; however this kind of traffic is much less common than you'd think.


80legs does that and they claim up to have 50K+ machines.

That's an awful big bunch of residential IPs to track if you can catch them in the first place. Then the ISP, like Comcast, does a periodic IP shuffle which happened to me 3 times in the last couple of weeks and you're blocking innocent people.

As we escalate our war on bots the bot operators will probably start offering people a small stipend to become part of a distributed botnet. The poor that can't really afford to spend money to be online are most likely to let distributed botnets subsidize their internet access.

Heck, all it took for 80legs to get 50K+ machines was offer some shiny trinkets for people to install, an IM client if I remember correctly.

See:
[webmasterworld.com...]
[whatis.riskyinternet.com...]

Think about it.

If you had a big lucrative crawler operating for a wide variety of SEO and analysis purposes, and it was raking in a bunch of bucks, then paying people $5 per PC to share their IP address for scraping a few pages would probably pay for itself in no time, especially if you rented it out to other people needing similar services. Adding a few thousand IPs to the harem would simply be the cost of doing business and would easily get covered by client work.

How people are naive enough to let 80legs do it for free I'll never understand.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4619880 posted 1:24 pm on Nov 5, 2013 (gmt 0)

despite everybody proclaiming overkill, in this particular instance a simple solution is

deny from 178.
deny from 89.

But then one misses all the other JUICY info that one could learn from, e.g not using methods of deduction while observing.
Detection is, or ought to be, an exact science and should be treated in the same cold and unemotional manner.
Sherlock Holmes, The Sign of Four.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 4:46 pm on Nov 5, 2013 (gmt 0)

But then one misses all the other JUICY info that one could learn from, e.g not using methods of deduction while observing.


blend,
IMO, the most interesting portion of this entire thread is that an inquiry was generated by a non-forum-regular for assistance and thread participants (many of which are non-forum-regulars) could care less about assisting the inquiry.

Don

IanKelley

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4619880 posted 10:59 pm on Nov 5, 2013 (gmt 0)

@incrediBILL: has 80legs stopped identifying itself via UA?

jskrewson

5+ Year Member



 
Msg#: 4619880 posted 1:53 am on Nov 11, 2013 (gmt 0)

So, this was a timely post. I'm getting hit by a browser based bot right now and have been for about three days. The problem is, the darn thing is behind 1,000's of broadband residential computers that I assume have been compromised. Today I banned 1,000 IP addresses and tomorrow can do the same. Do they ever get tired and "go away"? I wonder how many flipping IP addresses this "proxy net" actually has? They are using an obviously illegal proxy net, but seem to have a lot of spending power.

Some other ideas I've had... Send them bad information. Send them p*or*n. Try various applet mechanisms for getting their true IP address. Send a virus.

I think I like the bad info the best. They are scraping my website for the info, which I hope they are making business decisions with. Would be hilarious to send data that makes them lose money...

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4619880 posted 2:51 am on Nov 11, 2013 (gmt 0)

They are scraping my website for the info, which I hope they are making business decisions with.

Yah, but suppose the info gets turned around and sent to paying customers who wanted to know who links to them, and the robot reports back that it's an egregious ### site.

I think of this every time I get visited by a robot with an element like "seo" in its name.* At the far end of that robot is a human who has paid money to get information. Sure you can lie to it or slam the door in its face. But ultimately you're not hurting the robot; you're denying information to a human. Possibly a human from hereabouts who has been advised to try such-and-such service.


* This detail narrows it down to about 200 currently active robots, give or take.

jskrewson

5+ Year Member



 
Msg#: 4619880 posted 3:22 am on Nov 11, 2013 (gmt 0)

The bot hitting my site is for a very specialized purpose. They are doing competitive pricing of retail items.

IanKelley

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4619880 posted 3:23 am on Nov 11, 2013 (gmt 0)

Do they ever get tired and "go away"?

Maybe but don't count on it.

Don't waste time banning IPs manually... in my experience there's always some common behavior in a particular bot that you can use to identify them automatically.

Then you can automate IP banning, or serving alternate content, pretty much anything. I like serving legitimate HTTP errors.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4619880 posted 10:11 am on Nov 11, 2013 (gmt 0)

@incrediBILL: has 80legs stopped identifying itself via UA?

Well I'm not Bill, but just a FYI - as of yesterday at least one build still carries the UA:

***.**.**.* - - [10/Nov/2013:10:36:23 -0800] "GET /robots.txt HTTP/1.1" 200 4185 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html;) Gecko/2008032620

Also, it has been my experience that they all obey robots.txt, at least as far as the Disallow: / directive.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4619880 posted 9:09 pm on Nov 11, 2013 (gmt 0)

jskrewson - Botnets are cheap! You can rent one for a few dollars if you know where/how. The cost goes up if you want to use it for virus implantation, of course. As IanKelley says (and most other people hereabouts): find a common feature of the bot and block that. There is always something even if it's not the user-agent (which it often isn't).

There is no point in trying to send back bad data. It's expensive (in terms of server power) and serves no purpose. Just return a 403 code with a blank page.

You will not get a lot of actual blocking information here because any bot driver could pick up the information and counteract it. I suspect many people here have different header fields and other techniques they block on. With a bit of hard graft it's possible to block most bots. Start by creating a log of all header fields for each access and parse them for common traits.

I would stress, though, that a LOT of bad bots come from server farms and clouds. Go back through this forum for a year or two and create an IP blocklist AS WELL AS creating a behaviour block.

keyplr - I agree the real bot obeys robots.txt. There are (or used to be) several rogues that don't, though, so it's worth trapping for the UA.

jskrewson

5+ Year Member



 
Msg#: 4619880 posted 9:18 pm on Nov 11, 2013 (gmt 0)

I have automated the detection and banning. I do have to press a button though, so once a day. I have a set of servers, so it bit difficult to gather data across all servers. Needed a control computer to do that, so I'm using my dev box.

Hmm, I rented a proxy net once and it was over $1 per IP address, per month. I didn't know that hacked computers were so much cheaper.

Bad data is pretty easy and inexpensive for me to send back. Much faster and cheaper than real data...

This 48 message thread spans 2 pages: < < 48 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved