How to Identify bad bots from legitimate users

Forum Moderators: open

Message Too Old, No Replies

How to Identify bad bots from legitimate users

Need help on hints and tips that give away a bad bot

liam_85

11:09 am on Aug 3, 2010 (gmt 0)

Hello All,

I have been reading this forum but i am still having a hard time figuring out what is a bad bot from the access logs.

219.91.141.244 - - [31/Jul/2010:05:08:18 -0700] "PROPFIND /halfstar.gif HTTP/1.1" 500 - "-" "-"

Accessed the site many times in a few minute ns lookup reveals:


>nslookup 219.91.141.244
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Name: 244-141-91-219.static.youtele.com
Address: 219.91.141.244

I also checked another IP identifying itself as yahoo slurp:



67.195.111.185 - - [31/Jul/2010:05:22:13 -0700] "GET /example.html HTTP/1.0" 500 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)"


>nslookup 67.195.111.185
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Name: b3091326.crawl.yahoo.net
Address: 67.195.111.185

C:\Users\MOR02>

wilderness

2:05 pm on Aug 3, 2010 (gmt 0)

219.91.141.244 - - [31/Jul/2010:05:08:18 -0700] "PROPFIND /halfstar.gif HTTP/1.1" 500 - "-" "-"

67.195.111.185 - - [31/Jul/2010:05:22:13 -0700] "GET /example.html HTTP/1.0" 500 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]

I have been reading this forum but i am still having a hard time figuring out what is a bad bot from the access logs.

You've bigger issues than bot identification.

Note the 500 following HTTP/1.0" or HTTP/1.1" in the two lines above?

Either your server is NOT functioning at all or is sending Http requests in a loop.
(Please note; either issue could be due a syntax error you may have previously added in modifying your htaccess.)

jdMorgan

2:37 pm on Aug 3, 2010 (gmt 0)

Check your server error log when you get a 500-Server Error -- It will often tell you exactly what is wrong.

The number one most common problem when blocking unwelcome visitors is a failure to "allow" a custom 403 error page to be served. The server detects the unwanted access, invokes the custom 403 error page, and finds that access to that error page is also denied, so it re-invokes 403 error handling, gets another error, and thus ends up in a loop. Eventually, the server gives up and throws a 500-Server Error.

So, if you are currently blocking these requests, look to your access-control code; It needs a tweak.

Jim

liam_85

2:55 pm on Aug 3, 2010 (gmt 0)

I was using:


RewriteCond %{HTTP_USER_AGENT} ^bot* [NC]
RewriteRule .* - [F,L]

To stop certain bots - I have since commented this out and put in a bot trap to try and save bandwidth.

liam_85

10:39 am on Aug 4, 2010 (gmt 0)

Okay I have set up the bot trap and this morning there were 2 entries:


66.249.65.236 - - [2010-08-03 (Tue) 09:37:55] "GET /bot-trap/ HTTP/1.1" Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

67.195.111.185 - - [2010-08-03 (Tue) 09:51:24] "GET /bot-trap/ HTTP/1.0" Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)

I followed the steps by google to try and confirm wether it was a legitimate googlebot here [google.com]

I did nslookup on the ip and then a lookup again on the returned result and I got a non authoritative answer.


Microsoft Windows [Version 6.0.6001]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.

>nslookup 66.249.65.236
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Name: crawl-66-249-65-236.googlebot.com
Address: 66.249.65.236


>nslookup crawl-66-249-65-236.googlebot.com
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Non-authoritative answer:
Name: crawl-66-249-65-236.googlebot.com
Address: 66.249.65.236

Does this mean it is a genuine google crawler?

enigma1

12:13 pm on Aug 4, 2010 (gmt 0)

yes it the google ip is legit for the googlebot.

liam_85

9:34 am on Aug 5, 2010 (gmt 0)

I wonder why it didnt follow robots.txt then? That is very bizzare.


User-agent: *

Disallow: /cgi-bin/
Disallow: /error_log
Disallow: /error.php
Disallow: /bot-trap/
User-agent: Slurp
Crawl-delay: 2

Any ideas?

enigma1

12:47 pm on Aug 5, 2010 (gmt 0)

Ref:
[google.com...]

While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web

Robots.txt is more of "guidelines" I never rely on it to restrict bot access.

jdMorgan

1:18 pm on Aug 5, 2010 (gmt 0)

There is also an error in your robots.txt syntax. Although it is unlikely to affect "advanced" robots like Googlebot, it may affect less-sophisticated robots from other companies.

The syntax should be:

User-agent: Slurp
Crawl-delay: 2
Disallow: /cgi-bin/
Disallow: /error_log
Disallow: /error.php
Disallow: /bot-trap/
<one blank line>
User-agent: *
Disallow: /cgi-bin/
Disallow: /error_log
Disallow: /error.php
Disallow: /bot-trap/
<one blank line>

The 'format' of robots.txt is *not* free-form, and even the blank lines have significance to some robots.

Also, as you can see, the most-specific policy records should go first and the most-general one last, and you should assume that the policy records are mutually-exclusive per-robot.

What enigma1 posted above is very important to understand: robots.txt says, "Please do not fetch URL-paths beginning with this string from this host." Note that word "fetch" -- the robots.txt protocol says nothing about *listing* URLs in search results.

On the other hand, we have the on-page HTML directive <meta name="robots" content="noindex"> which specifically says, "Please don't include this page in your search index."

Now note that if you want to use that <meta name="robots" content="noindex"> directive, then the page with that directive on it *must not* be disallowed in robots.txt -- The 'good' robots.txt-compliant robots must be allowed to fetch the page in order to 'see' the <meta name="robots" content="noindex"> (!)

Jim

liam_85

9:16 am on Aug 6, 2010 (gmt 0)

Thank you for your replies,

The issue is as soon as the page is accessed the agent is blacklisted. I have no meta information on the page title, description, keywords etc. The page would be unlikely to rank for anything.

I have ammended my robots.txt now. If you have any other ideas they will be most welcome. I will continue to check and reference the blacklist daily for any critical search engines/users/hosts.

wilderness

3:43 pm on Aug 6, 2010 (gmt 0)

Webmasters are using a variety and combination of methods restrict the visits of bots.
1) white-listing (UA includes Http)
2) black-listing (UA includes known abusive keywords)
3) scripts and traps

Many bots have coverted from formerly using the bots software name into using standard (although in some instances flawed) browser user agent names.

Most that participate here have learned to be reserved in providing htaccess lines, because the bots and harvesters also monitor these forums and simply use our own materials against us.