homepage Welcome to WebmasterWorld Guest from 54.197.19.35
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Quick primer on identifying bot activity.
And a how to guide to slow and stop scraping
Ocean10000




msg:3614000
 7:29 am on Mar 29, 2008 (gmt 0)

The following items can be used to identity bots and slow down and stop most unwanted traffic if applied with proper do care.

  1. Check for signs that some proxy servers send in the supplied request headers. The goals here are to try to detect where a browser is really coming from, and to make a note to run some additional checks later on that are proxy specific. Proxy servers will often include by default the browsers original ip address in the "X-Forwarded-For" header. If this is present save it to so it can be checked later.

    Reason to note this is because you do not want say Google bot to crawl your entire site though a proxy and get hit with a duplicate content penalty or have someone else earn money by inserting there own Google Adsense ads in.

  2. Check if your business plan or site focus to see if allows you to exclude foreign countries where you can not or will not do business with, this includes if they came in via proxy server which supplied the "X-Forwarded-For" IP address as well. If you can exclude these countries by using Geo Location software to do one of the following: (A) redirecting them to a nicely worded page stating the reason why they can not order/view the site, (B) outright block them from accessing the site. And also remember just in case to allow proper openings for your allowed search engine crawlers just in case they come from some of these ranges. What this does is allow you to focus your attention to a smaller group of web browsers and crawlers.

  3. If they read robots.txt log the IP addresses and User-Agents and weather by the rules outlined in the robots.txt if they should be banned or not based. I usually assume anything that reads the robots.txt file is a bot or someone snooping around who is up to no good.

    Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4, which in my opinion are Google, Yahoo, MSN, and Ask Jeeves.

  4. Check to see if the IP address or User-Agent has been previously banned by disobeying the Robots.txt file. And take action if they are not allowed to access the site, with measures that are appropriate for the site. Generally I prefer to send the user a 403 status code with no further content, so not to waste valuable bandwidth on bad bots, and not to supply the bot owners with information how to sneak around the anti-scraping measure put into place on the website.

  5. Check if the IP or User-Agent has previously been given a captcha check and has not answered it, send it another captcha check and make a note of how many captcha checks it triggered.

  6. Check if the IP has previously been banned, and if it has been give it the proper message it so well deserves.

  7. Check to see if the "From" header is present or not, this header should only be supplied by bots, so if it is found it can be marked as a bot even when the User-Agent is a non-crawler or Identifiable as bot by other methods. This header is usually takes the form of an email address, which you can report problems to in response to the bots activities on the website.

  8. Check to see if the User-Agent contains one of the following terms so it is possible to flag the User-Agent as a possible bot. These may catch some malformed User-Agents but at this point it is only being flagged as being a possible bot, it is not known for sure if it is or not yet at this stage.

    • "Crawler"
    • "Bot"
    • "Spider"
    • "Ask Jeeves"
    • "Search"
    • "Indexer"
    • "Archiver"
    • "Larbin" <-- Email scraper
    • "Nutch" <-- Open source web crawler which is abused.
    • "Libwww" <-- Used by a lot of scrapers
    • "User-Agent" <-- Badly formed User-Agent

  9. Check if was previously flagged as a possible bot perform analysis of the supplied User-Agent. The purpose is to weed out bots that are not on a white list of allowed bots, or bots that have been explicitly banned from accessing the site.
    • This is done by seeing if the User-Agent matches a known string which an allowed bot uses, and letting it continue on to go though further checks.
    • And if the User-Agent matches a known disallowed bot mark the ip as banned and give it a proper message.
    • If it does not match a known bot User-Agents which have been coded as disallowed or allowed, the proper thing here would be show it a captcha page where a human may continue on but a bot would get stuck. Mark the IP and User-Agent as being giving a captcha check and note if they answer it properly.

  10. Check if the User-Agent [u]is identified as a bot[/u] and weather the "From" Header was supplied. And check the "From" Header against known valid from "From" headers from the white listed bots to see if it matches and is present when it is expected to be present. And if the white listed bot "From" header does not match what is expected ip as banned and give it a proper message.

  11. Check the allowed bots that have made it this far against the list of bots that support DNS Checks to validate them.

    The following checks will also stop major search engines which are crawling though a transparent proxy server unknowingly, thus saving duplicate content penalties for the website as a side benefit.

    (A) DNS check, require looking up the IP to get the Hostname. Check resolved hostname against the known patterns for the search engine in question. And if they do not match mark the ip as banned and give it a proper message.

    (B) Then doing a look up on the Hostname to see if it resolves back with a list of ip addressís that contain the ip which you started with.

    Something to watch out for with some fake bots will have there ip address resolve to a hostname which matches there ipís address and thus would pass the test, so this must be explicitly tested for to bounce these results by default. For example ip "10.0.0.1" would resolve to "10.0.0.1" hostname.

    MSN, Yahoo, Google, Ask Jeeves all support this functionality currently, others may as well. The purpose of this check is prevent others from spoofing well known Crawlers and setting up there DNS records to resolve there ipís to a well known Search Engine hostname, but since they will not control the reversing of the Hostname to ip they will get caught with this check..

  12. [More to come at a later date]

 

blend27




msg:3614096
 12:24 pm on Mar 29, 2008 (gmt 0)

Thanks for storming it up Ocean.

One thing we do, at the early stage, is check against the IP Blocks of known Datacenters/Hosting Ranges/Colos. From real life expirience most of the scrapers come from thouse.

These most offen are: ThePlanet(EV1), nLayer, SoftLayer, GNAC, Abovenet, ISPRIME, bluehost, NOC4HOSTS, SCHLUND(1&1), KEYWEB, OVH(french have sence of humor), and my recent Favorite of all Netdirekt.

That cuts down on a lot of overhead. But then again I remember someone(think it was Martinibuster) mentioned something that stock in my head:.. Give your site to scrapers, they provide backlinks....

We do but custom.

There are also a different kinds of scrapers. Chances are there are only a few(hundred) that are targetting your niche. So Tracking where they scrape your stuff from and where they host it gets to be a lot of fun. Ok, I'll stop with the ranting(Slayer stuff :) ), but More to come at a later date.

Blend27

wilderness




msg:3614131
 1:10 pm on Mar 29, 2008 (gmt 0)

Case errors.

Although most everybody names their pages and directories in lower case.

A proven method of identifying lame bots is in having established pages and/or directories which utilize upper case names.
These lame bots wll attempt to grab the pages in lower-case.

incrediBILL




msg:3614283
 6:00 pm on Mar 29, 2008 (gmt 0)

Give your site to scrapers, they provide backlinks....

While Martini's thesis was correct at some level it only covered a few scrapers, the "legit" scrapers trying to build actual resource sites and they are far and few between and it's what we call a "BAD IDEA".

Giving your site to the scrapers normally results in them competing with you for your own content trying to hijack your SERPs or associating your site to bad keywords, nasty neighborhoods and worse.

When it comes to scrapers, just say 403.

Ocean10000




msg:3614340
 7:52 pm on Mar 29, 2008 (gmt 0)

  1. Check to see if the User-Agent contains one of the following terms so it is possible to flag the User-Agent as a possible mobile phone browser.
    • "Windows CE"
    • "PalmOS"
    • "MIDP-"
    • "Portalmmm"
    • "Symbian OS"
    • "UP.Browser"

  2. Check to see if the "x-wap-profile" header is present, this is another header which is often sent with mobile phone browsers. If present it is usually safe to assume and flag the browser as a mobile browser of some type.

    The value of this header is usually a URL pointing to an xml file describing the supported features of the browser and phone.

  3. Check if the "Accept" header is present, and make a note of this for later.

  4. Check if the browser is identified as one of the following browsers "IE", "Opera", "Firefox" and is missing the "Accept" header. There are two options that one could take in this case they are listed below. If checking all non-bot User-Agents excluding mobiles, it is advised to use the first method just in case.


    • One would be show it a captcha page where a human may continue on but a bot would get stuck. Mark the IP and User-Agent as being giving a captcha check and note if they answer it properly.

    • Mark the IP and block them from accessing the site. And send them a 403 status error and no further content.

    Most major web browsers will send this header along with the request, which tells the web servers what the browser can accept. I have only listed the major browser providers but this is usually safe for all known browser except for a few mobile browsers which is why the mobile browser checks are in place earlier.


Webwork




msg:3615241
 11:37 am on Mar 31, 2008 (gmt 0)

Given the prevelence of unfriendly bot activity why isn't the following true:

1. Hosting providers exist that offer robust bot blocking as a value added feature or benefit of their hosting plans. (They needn't make it an across the board offer of all hosting plans.)

It seems to me that "as a value added service" - a service that would also cut down on bandwidth and server load - that such an offer would be attractive.

Of course, those hosting providers that host scrapers and bots likely wouldn't be first to offer the service. ;-P

2. Absent #1 an entity, such as CPanel, would offer bot identification and blocking software-as-a-service on a per-server basis, with routine automatic uploads.

Under version #2 the service could offer opt-out functions, built into the bot blocking control panel, which would allow user control of IP range or block exclusion, etc.

The downside of bot blocking is the risk of blocking a friendly bot. IF the search engines are a contributing cause of bad bot activity then one would think that the search engines, as an expression of their do no harm policy, would support efforts and inititatives to identify, track and block bad bots. At the very least one would think that the search engines would provide the bot blocking services with a ready remedy to reverse the effects of any improvident blocking of their bots.

If the search engines aren't helping third-party software developers or hosting firms to identify, track and block bad bots what would be the reason(s)?

[edited by: Webwork at 11:57 am (utc) on Mar. 31, 2008]

mikedee




msg:3615290
 12:35 pm on Mar 31, 2008 (gmt 0)

If hosting companies offered bot blockers then the bots would easily work around them. It is possible to make bots unidentifiable from normal visitors. I think you guys are fighting a losing battle by trying to block them. Captchas may work on a limited basis, but it makes things harder for your users too.

Number 3 doesn't look good to me, blocking anyone from robots.txt does not help new search engines entering the market. From what I have seen, bad bots do not get robots.txt anyway. Blocking a bot that takes robots.txt but then access a blocked page would be more reasonable.

Even if you block a specific bot, it is trivial to crawl your site through one of the many public caching services.

Romeo




msg:3615335
 1:58 pm on Mar 31, 2008 (gmt 0)

4. ..... Generally I prefer to send the user a 403 status code with no further content, so not to waste valuable bandwidth on bad bots, and not to supply the bot owners with information how to sneak around the anti-scraping measure put into place on the website.

Yes, but the 403 itself may already be too much of information for them. It may be more fun to send them a 200 with an empty page just containing a few blank characters. Or a short alternate content.

I once even came across a public forum discussion where one scraper script kiddie complained that its scraping script may have a bug when trying to scrape www.<mysite>.com but getting no content and what to do about that bug.
Lol. That made may day, of course.

Kind regards,
R.

Clark




msg:3615485
 5:00 pm on Mar 31, 2008 (gmt 0)

Thank you for starting this thread. I've been looking for a tutorial of best practices on this topic.

himalayaswater




msg:3615647
 8:27 pm on Mar 31, 2008 (gmt 0)

Neat. Here is my robots.txt that only allows Google, Y!, Msn, Teoma.

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: Slurp
Disallow:

User-agent: msnbot
Disallow:

User-agent: Teoma
Disallow:

Please note that bad bot will mostly ignore your robots.txt.

There is also list of bad bots IP; load them into your firewall using a shell script or using a php script - [spamhaus.org...]

Hope this helps someone.

JAB Creations




msg:3615763
 11:20 pm on Mar 31, 2008 (gmt 0)

I read with great interest Brett's comments in a very unusual place to blog them when he was dealing with this issue here.

I am very quiet about my security (and it's quiet to my human visitors) but one thing I don't allow are any open source spiders. Don't get me wrong, I'm all for open source but online believe it or not I'm mostly business minded. Commercial spiders are allowed on my site so long as they aren't redistributable in any way (open source or not).

Beyond that it only takes a little ingenuity to combat bad bots. It's not really that difficult and the logic behind it is a lot of fun. :)

- John

incrediBILL




msg:3615784
 11:48 pm on Mar 31, 2008 (gmt 0)

It's not really that difficult

Most of them aren't that difficult but when you get into the wonderful world of commercial spybots it can become quite difficult because they don't want to be found in the first place and have the money to hide quite effectively.

Imagine if you would a bot using MSIE's default user agent and 100 different IP addresses from various locations around the world, how easy would it be to spot even a sequential scan of your site that hops from service provider to service provider, or country to country?

Now imagine how easy that is to accomplish when you get your hands on a list of 6K open proxies operating from random locations around the world which means I could probably scrape 100K pages from any web site without getting caught unless they block this proxy list.

OK, now imagine this proxy list isn't public and it's run and used by a private consortium of customers that need to operate without being detected...

Ocean10000




msg:3615833
 1:13 am on Apr 1, 2008 (gmt 0)

If they read robots.txt log the IP addresses and User-Agents and weather by the rules outlined in the robots.txt if they should be banned or not based. I usually assume anything that reads the robots.txt file is a bot or someone snooping around who is up to no good.

Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4, which in my opinion are Google, Yahoo, MSN, and Ask Jeeves.

I left out have a dynamic robots.txt similar to WebmasterWorld here, where I show only the bots I want with the version not blocking everything. So if they are not white listed to receive the unblocked robots.txt the next file they take will get them blocked by default.

As for new search engined I stopped being an early adopter when I had to pay the high bandwidth bill every month that allowing all the new spiders free access caused. I like to have some profits, and eat nicely thank you very much.

wilderness




msg:3615862
 1:58 am on Apr 1, 2008 (gmt 0)

Now imagine how easy that is to accomplish when you get your hands on a list of 6K open proxies operating from random locations around the world which means I could probably scrape 100K pages from any web site without getting caught unless they block this proxy list.

OK, now imagine this proxy list isn't public and it's run and used by a private consortium of customers that need to operate without being detected...

Unique concept!

Imagine how the schemed and collective group above (as well as many of those within "the www" that banter about the theme of "free access" or "public domian") percieve a collective group of webmasters at SSID discussing the limited access of possible infractions? ;)

Saw where some universities where taking part in public-domain archive of their libraries through Archive.org (I believe), rather than Goggle because google presented too many possibilites for future access restrictions and future paid access (subscription) possibilities.

Bewenched




msg:3616507
 7:16 pm on Apr 1, 2008 (gmt 0)

Thank you so much Ocean for addressing this issue and explaining in laymens terms what we can potentially do about it. You may think about adding thefind to the list of acceptable bots since paypal has now enlisted them to list paypal merchants products. Just a thought.

On a side note, I always find it fun to feed known scrapers either mashed up copy, completely reverse all the text so it's backwards and mixed up and or feed them their own tail through a proxy. Yes .. it's mean, but so is using my content to display ads and trashing our rankings.

We recently signed up with hackersafe and one of the things that threw a flag in their system was disclosing ANY directories within robots.txt

blend27




msg:3616777
 4:48 am on Apr 2, 2008 (gmt 0)

@maikedee

-- bot blockers then the bots would easily work around them. It is possible to make bots unidentifiable from normal visitors. --

But 99% of them(scrapers) have no clue on how to do it. I've compiled a long list hosting/datacenter ranges within a past couple of years. Anything that comes from thouse gets an automatic boot. The ones that do know what they are doing would most likely rent a server and a big PIPE.

The rest gets stock in later subroutines/logic. + Random Spider Traps

incrediBILL




msg:3616855
 7:02 am on Apr 2, 2008 (gmt 0)

It is possible to make bots unidentifiable from normal visitors.

They can try, but even trying to look 100% human they often fail.

The problem they have is looking too human slows the scraping process down and uses a significant amount of bandwidth just to hide their activities.

It's all a cat and mouse game as I've automatically snared some impressive bots they they still hit bot traps and can't respond to unknown situations which makes them vulnerable.

chandrika




msg:3617381
 5:38 pm on Apr 2, 2008 (gmt 0)

Any thoughts on using an IPlist such as at [iplists.com...] to cloak a sitemap, as I have noticed that most scrapers are using my sitemaps in their scripts, If I disallow all except known spider IPs from viewing the sitemaps, might this help?

Ocean10000




msg:3618069
 1:36 pm on Apr 3, 2008 (gmt 0)

Any thoughts on using an IPlist such as at [iplists.com...] to cloak a sitemap, as I have noticed that most scrapers are using my sitemaps in their scripts, If I disallow all except known spider IPs from viewing the sitemaps, might this help?

Restricting access your sitemaps file is a good idea to stop unwanted bots from reading the file. Weather you use the lists of IPs offered at that site is your choice. I think what you are asking are the public lists provided good enough to build your access list your sitemaps file with. Personally I would only use those lists as a small part of the solution not the whole part. I use a bunch of smaller tests to validate the allowed bots, so to weed out the fake bots, without having to maintain lists of IP's for each bot.

chandrika




msg:3618083
 1:56 pm on Apr 3, 2008 (gmt 0)

Thanks, I had come to conclusion today that I would run a risk relying on public IP lists, and had found a few pages about spider traps and such that I thought may be a better solution. Also as you say to keep any kind of IP list updated could be difficult.

Is it these kind of spider traps that you use? I saw a few such as a little test to see if the bot followed robots.txt rules, that sounded like a good idea, do you do stuff like that?

Ocean10000




msg:3618999
 12:48 pm on Apr 4, 2008 (gmt 0)

Thanks, I had come to conclusion today that I would run a risk relying on public IP lists, and had found a few pages about spider traps and such that I thought may be a better solution. Also as you say to keep any kind of IP list updated could be difficult.

Is it these kind of spider traps that you use? I saw a few such as a little test to see if the bot followed robots.txt rules, that sounded like a good idea, do you do stuff like that?

Yes I use a custom written module for Asp.Net which allows me to filter out the unwanted traffic, by following the steps I outlined in my posts here and then some.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved