| This 48 message thread spans 2 pages: < < 48 ( 1  ) || |
The Bot Detection Game Has Changed
They may also be taking screen shots but that's just icing on the crawling cake.
Here's a list of crawlers you can get to deploy on Node.js
The technology is all there, it's crawling, scraping, data mining,
Here's a post on how to make screen shots using PhantomJS:
As mentioned above, here's a script to scrape AdSense ads!
How hard to you think it would be to CLICK those ads being scraped?
Anyway, those browsers aren't browsers so block those data centers as these are NOT people out there and this new code can probably respond to some rudimentary captcha's. One catpcha that I used to deploy would detect whether there was actual typing at a keyboard and these new APIs may be able to easily simulate actual key clicks, not sure as I'm just digging into the APIs.
Everything we used to know about how to detect and stop bots is out the window now that scrapers are written in headless browsers.
Total Game Changer.
Obviously I'll be experimenting a lot more and testing for exploitable tells, but since the scrapers and the browsers are now the same thing publicly discussing the differences will allow them to easily patch up the remaining holes in detectability.
Truthfully, I didn't expect it to get quite this easy for scrapers to go completely unnoticed for a few more years but it appears the timetable escalated substantially thanks to Google, Chrome and Node.js.
There's more but and maybe I'll dive into some other stuff later but this is the basic information so anyone wanting to deploy it and test it can easily make it happen.
|What overkill? That's just an isolated example coming from a class C using the same UA. |
I provided examples of denying Class A's. NO C's.
|How about several hundred ranges, random UAs and 'polite' scraping via a browser... you wouldn't notice. |
When the day comes that I don't notice, I'll remove my sites from the internet.
|I guess it depends on how important you feel it is to ban non-human non-useful visitors. |
I'm sure my list of UA's and IP's that are denied exceeds yours by leaps and bounds.
|brotherhood of LAN|
It sounds like you have it sussed out then, given your terse replies do you think this thread is a bit pointless?
|When the day comes that I don't notice, I'll remove my sites from the internet. |
Very assured. I'm also fairly certain that I could have your site into a database before you notice, but TBH it sounds like you already know everything you need to.
Doesn't sound like something the average Joe or Jane is likely going to be able to handle. Most people don't have developers on staff (fortunately I do). All this is WAY above their technical ability. Sucks, but there it is.
|Doesn't sound like something the average Joe or Jane is likely going to be able to handle. |
You shouldn't be discouraged by the over-simplification expressed in this thread.
If Apache and htaccess have proven anything over time, it's there are multiple methods and/or approaches that accomplish similar tasks.
|I'm also fairly certain that I could have your site into a database before you notice, |
Kinda doubt that, however your welcome to waste your time trying.
|but TBH it sounds like you already know everything you need to. |
Within another two years, my widget commitments will be completed and I could care less if both widgets and the internet are relocated to Mars.
All I need to know is enough to get my through those same two years.
|brotherhood of LAN|
Clarke's Three Laws spring to mind, but clearly you spend a lot of effort keeping out bad bots so it's not like you'd be the easiest target for such a thing. But feel free to sticky me your site.
Some great ideas in this thread.
There are a couple of different kinds of automated traffic being talked about, the difference is important where detection is concerned.
The first is traffic from real users that is being hijacked in one of various ways. In those cases testing for things like focus, interaction and visibility work really well.
The second is completely automated traffic with the ability to interpret pretty much anything a real browser can. Right now this isn't all that common but as the OP points out, it's going to become more common. The big search engines have been doing it for quite a while but they behave (usually) so there's no need to worry about them.
In a malicious case, it's possible to fake almost everything except the IP address. You can use proxies to change the IP but that slows you down dramatically. Still, it's pretty common. In any case, with this kind of traffic the IP itself is the best way to detect it. There are all kinds of interesting things you can do with an IP.
This does leave out traffic running through a botnet where it's both completely automated, with full browser capabilities and also running through a legitimate machine; however this kind of traffic is much less common than you'd think.
I'd like to contribute to the detection ideas in this thread with some of my own but traffic fraud VS fraud detection is an arms race. An entertaining one.
|This does leave out traffic running through a botnet where it's both completely automated, with full browser capabilities and also running through a legitimate machine; however this kind of traffic is much less common than you'd think. |
80legs does that and they claim up to have 50K+ machines.
That's an awful big bunch of residential IPs to track if you can catch them in the first place. Then the ISP, like Comcast, does a periodic IP shuffle which happened to me 3 times in the last couple of weeks and you're blocking innocent people.
As we escalate our war on bots the bot operators will probably start offering people a small stipend to become part of a distributed botnet. The poor that can't really afford to spend money to be online are most likely to let distributed botnets subsidize their internet access.
Heck, all it took for 80legs to get 50K+ machines was offer some shiny trinkets for people to install, an IM client if I remember correctly.
Think about it.
If you had a big lucrative crawler operating for a wide variety of SEO and analysis purposes, and it was raking in a bunch of bucks, then paying people $5 per PC to share their IP address for scraping a few pages would probably pay for itself in no time, especially if you rented it out to other people needing similar services. Adding a few thousand IPs to the harem would simply be the cost of doing business and would easily get covered by client work.
How people are naive enough to let 80legs do it for free I'll never understand.
|despite everybody proclaiming overkill, in this particular instance a simple solution is |
deny from 178.
deny from 89.
But then one misses all the other JUICY info that one could learn from, e.g not using methods of deduction while observing.
Detection is, or ought to be, an exact science and should be treated in the same cold and unemotional manner.
Sherlock Holmes, The Sign of Four.
|But then one misses all the other JUICY info that one could learn from, e.g not using methods of deduction while observing. |
IMO, the most interesting portion of this entire thread is that an inquiry was generated by a non-forum-regular for assistance and thread participants (many of which are non-forum-regulars) could care less about assisting the inquiry.
@incrediBILL: has 80legs stopped identifying itself via UA?
So, this was a timely post. I'm getting hit by a browser based bot right now and have been for about three days. The problem is, the darn thing is behind 1,000's of broadband residential computers that I assume have been compromised. Today I banned 1,000 IP addresses and tomorrow can do the same. Do they ever get tired and "go away"? I wonder how many flipping IP addresses this "proxy net" actually has? They are using an obviously illegal proxy net, but seem to have a lot of spending power.
Some other ideas I've had... Send them bad information. Send them p*or*n. Try various applet mechanisms for getting their true IP address. Send a virus.
I think I like the bad info the best. They are scraping my website for the info, which I hope they are making business decisions with. Would be hilarious to send data that makes them lose money...
|They are scraping my website for the info, which I hope they are making business decisions with. |
Yah, but suppose the info gets turned around and sent to paying customers who wanted to know who links to them, and the robot reports back that it's an egregious ### site.
I think of this every time I get visited by a robot with an element like "seo" in its name.* At the far end of that robot is a human who has paid money to get information. Sure you can lie to it or slam the door in its face. But ultimately you're not hurting the robot; you're denying information to a human. Possibly a human from hereabouts who has been advised to try such-and-such service.
* This detail narrows it down to about 200 currently active robots, give or take.
The bot hitting my site is for a very specialized purpose. They are doing competitive pricing of retail items.
|Do they ever get tired and "go away"? |
Maybe but don't count on it.
Don't waste time banning IPs manually... in my experience there's always some common behavior in a particular bot that you can use to identify them automatically.
Then you can automate IP banning, or serving alternate content, pretty much anything. I like serving legitimate HTTP errors.
|@incrediBILL: has 80legs stopped identifying itself via UA? |
Well I'm not Bill, but just a FYI - as of yesterday at least one build still carries the UA:
***.**.**.* - - [10/Nov/2013:10:36:23 -0800] "GET /robots.txt HTTP/1.1" 200 4185 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html;) Gecko/2008032620
Also, it has been my experience that they all obey robots.txt, at least as far as the Disallow: / directive.
jskrewson - Botnets are cheap! You can rent one for a few dollars if you know where/how. The cost goes up if you want to use it for virus implantation, of course. As IanKelley says (and most other people hereabouts): find a common feature of the bot and block that. There is always something even if it's not the user-agent (which it often isn't).
There is no point in trying to send back bad data. It's expensive (in terms of server power) and serves no purpose. Just return a 403 code with a blank page.
You will not get a lot of actual blocking information here because any bot driver could pick up the information and counteract it. I suspect many people here have different header fields and other techniques they block on. With a bit of hard graft it's possible to block most bots. Start by creating a log of all header fields for each access and parse them for common traits.
I would stress, though, that a LOT of bad bots come from server farms and clouds. Go back through this forum for a year or two and create an IP blocklist AS WELL AS creating a behaviour block.
keyplr - I agree the real bot obeys robots.txt. There are (or used to be) several rogues that don't, though, so it's worth trapping for the UA.
I have automated the detection and banning. I do have to press a button though, so once a day. I have a set of servers, so it bit difficult to gather data across all servers. Needed a control computer to do that, so I'm using my dev box.
Hmm, I rented a proxy net once and it was over $1 per IP address, per month. I didn't know that hacked computers were so much cheaper.
Bad data is pretty easy and inexpensive for me to send back. Much faster and cheaper than real data...
| This 48 message thread spans 2 pages: < < 48 ( 1  ) |