homepage Welcome to WebmasterWorld Guest from 54.204.141.129
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Inktomi User-Agent blocked
Inktomi blocked
cyberdyne




msg:3595654
 4:04 pm on Mar 9, 2008 (gmt 0)
Hi,
I found the following IP 'denied by server' in my error logs today; 74.6.8.102
It resolves as Inktomi which I believe is Yahoo?

I cannot find any reference to Yahoo, Slurp or Inktomi in my .htaccess file User-Agent blocks (only in 'permits'), nor any reference to a 'deny from' either 74.6.8.102 or the IP CIDR: 74.6.0.0/16

Can anyone suggest another way I might have inadvertently blocked this IP please as I clearly do not wish to block it?

Thank you.

 

cyberdyne




msg:3598338
 11:49 am on Mar 12, 2008 (gmt 0)

Still need help with this if possible please.

Regarding the above, when using 'Order Deny,Allow', can I use an 'allow from' line above my 'deny from' lines in order to permit the above IP addresses and over-rule any block I must have inadvertently used? eg:

Order Deny,Allow
allow from 74.6.0.0/16 "#Inktomi - Yahoo"

deny from 111.222.333.444
deny from 211.222.333.444
deny from 311.222.333.444

Thank you.

cyberdyne




msg:3598345
 11:57 am on Mar 12, 2008 (gmt 0)

Also, does Inktomi crawl with the identity of 'yahoo-blogs/v3.9' or 'yahoo-mmcrawler' or by any chance?

Thanks

jdMorgan




msg:3598467
 1:59 pm on Mar 12, 2008 (gmt 0)

> can I use an 'allow from' line above my 'deny from' lines in order to permit the above IP addresses and over-rule any block

The order of your "Allow from" directives with respect to your "Deny from" directives in your code does not matter; They will be processed in groups as specified by the "Order" directive. That is, in your code above, all "Deny from" directives are evaluated first. Access will be denied unless an "Allow from" directive overrides the denied IP address or range.

74.6.0.0/16 is a valid Inktomi/Yahoo IP address range, but I can't answer about the yahoo-blogs or yahoo-mmcrawler user-agents; These are Disallowed in robots.txt on my sites because there's no blog or media content I'd want indexed out-of-context on my sites. As such, all I can say is that the address range is valid for Yahoo.

Note that it's a very good idea to 'override' your access control to unconditionally allow all user-agents (even 'bad' ones) to access your custom 403 error page (if you use one) and your robots.txt file. If access to your custom 403 error page is denied, then any attempt to access your site by a Deny'ed (unwelcome) user-agent will basically put your server into a 403-Forbidden loop; The server responds to the denied attempt by trying to serve the custom 403 error page, but access to that page is also denied. So, it tries to serve the custom 403 error page, but access to that page is denied... You get the picture. (Failure to prevent this problem can be thought of as a low-impact-but-still-unpleasant denial-of-service mechanism -- provided by the Webmaster!)

If access to the robots.txt file is denied, some robots (although not the major ones) will take that as carte-blanche to spider your entire site. Although they likely won't be successful (because of your "Deny"s), they will waste a lot of bandwidth and make a mess of your log files and stats.

You can provide for these functions using mod_setenvif:

ErrorDocument 403 /403error.html
#
# ...(Other directives)
#
SetEnvIf Request_URI "/(403error\.html¦robots\.txt)$" allowit
#
Order Deny,Allow
#
# ...(Other Allows and Denys)
#
Allow from env=allowit

Note also that comments should be placed on separate lines as shown to prevent generation of Apache Warnings -- These warnings --even if not logged due to LogLevel settings-- will still consume/waste processing time.

Jim

cyberdyne




msg:3598495
 2:33 pm on Mar 12, 2008 (gmt 0)

Thank you very much Jim,
I should have noted that I do in fact have a line permitting my bad-bot files, I presume it is sufficient:

RewriteRule (robots\.txt¦block\.html¦403\.shtml)$ - [L]

Having read through your post, would I be correct in deducing that I can have something like the following in order to permit the currently-blocked Inktomi's IP range?:

Options +FollowSymlinks All -Indexes
ErrorDocument 403 /403.shtml
#
RewriteRule (robots\.txt¦block\.html¦403\.shtml)$ - [L]
#
Order Deny,Allow
deny from 111.222.255.255
deny from 211.222.255.255
#
Order Allow,Deny
#Inktomi - Yahoo"
74.6.0.0/16

Thank you as always.

cyberdyne




msg:3600359
 8:36 am on Mar 14, 2008 (gmt 0)

Help! Please,

I'm still ,regretably, blocking 'User-agent: Yahoo! Slurp' and 'Inktomi ' IP's and I have no idea how.

My error log reads:
[Fri Mar 14 06:02:42 2008] [error] [client 74.6.28.28] Directory index forbidden by Options directive: /home/mylogin/html/

The corresponding raw log entry reads:
lj511178.crawl.yahoo.net - - [14/Mar/2008:06:02:42 +0000] "GET /html/ HTTP/1.0" 403 671 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

BUT...
I also have an entry:
[Fri Mar 14 03:34:04 2008] [error] [client 209.191.123.33] File does not exist: /home/mylogin/html/file.html"
The IP of which, belongs to Yahoo.

Which leads me to assume that it is the IP that's being blocked and not any combination of the User-agent strings.

I've checked and double-checked my .htaccess and cannot find either 74.6.28.28 or 74.6.0.0/16

I have removed all references to 'Yahoo!', 'Slurp' and 'Mozilla' from my Disallows and ensure that 'Yahoo!' and 'Slurp' are in the allow section.

Does anyone have any suggestions as to what else might be blocking them.

Thank you in advance for any advice.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved