homepage Welcome to WebmasterWorld Guest from 54.227.20.250
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How to Identify bad bots from legitimate users
Need help on hints and tips that give away a bad bot
liam_85

5+ Year Member



 
Msg#: 4181071 posted 11:09 am on Aug 3, 2010 (gmt 0)

Hello All,

I have been reading this forum but i am still having a hard time figuring out what is a bad bot from the access logs.

219.91.141.244 - - [31/Jul/2010:05:08:18 -0700] "PROPFIND /halfstar.gif HTTP/1.1" 500 - "-" "-"

Accessed the site many times in a few minute ns lookup reveals:


>nslookup 219.91.141.244
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Name: 244-141-91-219.static.youtele.com
Address: 219.91.141.244


I also checked another IP identifying itself as yahoo slurp:



67.195.111.185 - - [31/Jul/2010:05:22:13 -0700] "GET /example.html HTTP/1.0" 500 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)"


>nslookup 67.195.111.185
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Name: b3091326.crawl.yahoo.net
Address: 67.195.111.185

C:\Users\MOR02>

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4181071 posted 2:05 pm on Aug 3, 2010 (gmt 0)

219.91.141.244 - - [31/Jul/2010:05:08:18 -0700] "PROPFIND /halfstar.gif HTTP/1.1" 500 - "-" "-"

67.195.111.185 - - [31/Jul/2010:05:22:13 -0700] "GET /example.html HTTP/1.0" 500 - "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]

I have been reading this forum but i am still having a hard time figuring out what is a bad bot from the access logs.


You've bigger issues than bot identification.

Note the 500 following HTTP/1.0" or HTTP/1.1" in the two lines above?

Either your server is NOT functioning at all or is sending Http requests in a loop.
(Please note; either issue could be due a syntax error you may have previously added in modifying your htaccess.)

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4181071 posted 2:37 pm on Aug 3, 2010 (gmt 0)

Check your server error log when you get a 500-Server Error -- It will often tell you exactly what is wrong.

The number one most common problem when blocking unwelcome visitors is a failure to "allow" a custom 403 error page to be served. The server detects the unwanted access, invokes the custom 403 error page, and finds that access to that error page is also denied, so it re-invokes 403 error handling, gets another error, and thus ends up in a loop. Eventually, the server gives up and throws a 500-Server Error.

So, if you are currently blocking these requests, look to your access-control code; It needs a tweak.

Jim

liam_85

5+ Year Member



 
Msg#: 4181071 posted 2:55 pm on Aug 3, 2010 (gmt 0)

I was using:


RewriteCond %{HTTP_USER_AGENT} ^bot* [NC]
RewriteRule .* - [F,L]


To stop certain bots - I have since commented this out and put in a bot trap to try and save bandwidth.

liam_85

5+ Year Member



 
Msg#: 4181071 posted 10:39 am on Aug 4, 2010 (gmt 0)

Okay I have set up the bot trap and this morning there were 2 entries:

66.249.65.236 - - [2010-08-03 (Tue) 09:37:55] "GET /bot-trap/ HTTP/1.1" Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

67.195.111.185 - - [2010-08-03 (Tue) 09:51:24] "GET /bot-trap/ HTTP/1.0" Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)


I followed the steps by google to try and confirm wether it was a legitimate googlebot here [google.com]

I did nslookup on the ip and then a lookup again on the returned result and I got a non authoritative answer.


Microsoft Windows [Version 6.0.6001]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.

>nslookup 66.249.65.236
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Name: crawl-66-249-65-236.googlebot.com
Address: 66.249.65.236


>nslookup crawl-66-249-65-236.googlebot.com
Server: host254.isg.ll.opaltelecom.net
Address: 62.24.243.1

Non-authoritative answer:
Name: crawl-66-249-65-236.googlebot.com
Address: 66.249.65.236



Does this mean it is a genuine google crawler?

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4181071 posted 12:13 pm on Aug 4, 2010 (gmt 0)

yes it the google ip is legit for the googlebot.

liam_85

5+ Year Member



 
Msg#: 4181071 posted 9:34 am on Aug 5, 2010 (gmt 0)

I wonder why it didnt follow robots.txt then? That is very bizzare.

User-agent: *

Disallow: /cgi-bin/
Disallow: /error_log
Disallow: /error.php
Disallow: /bot-trap/
User-agent: Slurp
Crawl-delay: 2


Any ideas?

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4181071 posted 12:47 pm on Aug 5, 2010 (gmt 0)

Ref:
[google.com...]

While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web


Robots.txt is more of "guidelines" I never rely on it to restrict bot access.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4181071 posted 1:18 pm on Aug 5, 2010 (gmt 0)

There is also an error in your robots.txt syntax. Although it is unlikely to affect "advanced" robots like Googlebot, it may affect less-sophisticated robots from other companies.

The syntax should be:
User-agent: Slurp
Crawl-delay: 2
Disallow: /cgi-bin/
Disallow: /error_log
Disallow: /error.php
Disallow: /bot-trap/
<one blank line>
User-agent: *
Disallow: /cgi-bin/
Disallow: /error_log
Disallow: /error.php
Disallow: /bot-trap/
<one blank line>

The 'format' of robots.txt is *not* free-form, and even the blank lines have significance to some robots.

Also, as you can see, the most-specific policy records should go first and the most-general one last, and you should assume that the policy records are mutually-exclusive per-robot.

What enigma1 posted above is very important to understand: robots.txt says, "Please do not fetch URL-paths beginning with this string from this host." Note that word "fetch" -- the robots.txt protocol says nothing about *listing* URLs in search results.

On the other hand, we have the on-page HTML directive <meta name="robots" content="noindex"> which specifically says, "Please don't include this page in your search index."

Now note that if you want to use that <meta name="robots" content="noindex"> directive, then the page with that directive on it *must not* be disallowed in robots.txt -- The 'good' robots.txt-compliant robots must be allowed to fetch the page in order to 'see' the <meta name="robots" content="noindex"> (!)

Jim

liam_85

5+ Year Member



 
Msg#: 4181071 posted 9:16 am on Aug 6, 2010 (gmt 0)

Thank you for your replies,

The issue is as soon as the page is accessed the agent is blacklisted. I have no meta information on the page title, description, keywords etc. The page would be unlikely to rank for anything.

I have ammended my robots.txt now. If you have any other ideas they will be most welcome. I will continue to check and reference the blacklist daily for any critical search engines/users/hosts.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4181071 posted 3:43 pm on Aug 6, 2010 (gmt 0)

Webmasters are using a variety and combination of methods restrict the visits of bots.
1) white-listing (UA includes Http)
2) black-listing (UA includes known abusive keywords)
3) scripts and traps

Many bots have coverted from formerly using the bots software name into using standard (although in some instances flawed) browser user agent names.

Most that participate here have learned to be reserved in providing htaccess lines, because the bots and harvesters also monitor these forums and simply use our own materials against us.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved