Forum Moderators: open

Message Too Old, No Replies

Yahoo-VerticalCrawler appears to have ignored robots.txt

         

isitreal

2:24 pm on Apr 26, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been running a spider trap for a while now on a site, with robots.txt of:

User-agent: *
Disallow: /stats/stop_spider.htm

where stop_spider.htm is the spider trap page, but this morning I got this notification:

IP address: 66.77.73.32
Navigator user agent: Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com;
referrer: [alltheweb.com...]

Assuming this is a real yahoo crawler and not a spoof this seem to indicate that yahoo didn't pay attention to the robots.txt file, which has been up for over a month now.

Last week the yahoo crawler spidered the whole site without triggering the spider trap.

The only thing I can think of that might have gone wrong is that I just switched the site to apache mod_rewrite to convert to search engine friendly urls,

mysite.com/index.htm?section=main&page=5

is now mysite.com/main/page5.htm

through this htaccess:

RewriteRule ^(.*)/overview\.htm /index.htm?section=$1&page=0 [NC,L]
RewriteRule ^(.*)/page(.*)\.htm /index.htm?section=$1&page=$2 [NC,L]
RewriteRule ^(.*)/$ /index.htm?section=$1 [NC]

but this shouldn't make a difference that I see. Any ideas, or has anyone else seen this, or is this a fake yahoo spider?