The big news is that they claim they're invalidating all the old Slurp IP addresses so anyone that validates by IP instead of reverse DNS-based identification [ysearchblog.com] of Slurp is about to be in a world of hurt until the new IPs are known.
They claim that Slurp 3.0 will recognize the old Slurp information which means the robots.txt file should be OK but those of you that do very narrow rewrite rules might need to update. Additionally, reverse dns validation of crawl.yahoo.net domain will continue to function properly for the new smaller set of IPs.
Many sites will start bouncing Slurp! that didn't heed the call to use rDNS validation for major SEs so this will be ugly.
enough is enough!
Per their own NEW press release.
#SetEnvIf User-Agent "Slurp/3.0" keep_out
SetEnvIf User-Agent "Slurp/1.0" keep_out
SetEnvIf User-Agent "Slurp/2.0" keep_out
SetEnvIf User-Agent "email@example.com" keep_out
SetEnvIf User-Agent "Yahoo! Slurp;" keep_out
In addition I have some very old references to the following (have no idea when they were last used):
Nor have I kept updates on the follwing which are contained in my robots.txt:
Yahoo! DE Slurp
Yahoo! Slurp China
So is this the Range 188.8.131.52/16?
if so, this is whats comming to my sites from that range.
Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
Slurp/3.0 was first seen around 2007-11-19 and got caught in the trap, trying...
|So is this the Range 184.108.40.206/16? |
According to their blog post:
|The crawlers will start crawling from a different and much smaller set of IP addresses, but it'll still be from the crawl.yahoo.net domain. |
So I'm not sure if that means they're switching to a completely new set of IPs or just dropping a large segment of their existing IPs, but it does say "different" so the jury is still out on what that means until we can verify it.
From what I've seen, Slurp is still crawling with the new and old UAs. They do seem to be using different approaches, but I haven't figured out what the intention is yet. It's funny watching them both grab pages at the same time, though ;)
Still, these are our sites they're unleashing themselves on. Would be nice to be told what to expect, eh?
|Still, these are our sites they're unleashing themselves on. Would be nice to be told what to expect, eh? |
They could care less of what webmasters desire. At least those few that are aware of their activity.
The bots and their Dr. Frankenstien's have simply grown accustomed to crawling as they please with as many different number of bots simultaneously.
Unfortuantely, even if every participant here banded together in a joint denial it wouldn't slow down the crawling of the bots in amy manner, nor, even make them blink and wonder. . .
These guys are the new Microsoft. On one of my sites, they are the second biggest spider by volume each month. In March they downloaded approximately 37K pages and were responsible for about 647 referrals. I am strongly considering banning them.
So far today Slurp/3.0 only crawled 46 pages out of 21K total Slurped pages today.
Googlebot came in 3rd with only got 8K pages and msnbot claimed 2nd with 11K pages, making Slurp the biggest crawler and it's been this way for many weeks now.
The bot that crawls the least sends the most traffic, the irony.
Thanks for the info. I've seen Slurp coming in from the following new Class Cs today:
As well as a number of Slurp visits from older ranges.
Since everyone looking for a list of IP's. The following is a list of IP's in which I have seen slurp/3.0 coming from since this last November.
I have noticed a thing our two about the newest Slurp/3.0 vs the other versions. (1) Is the new version downloads all the style sheets over again, but does supply the referrer for each style sheet (2) It comes though a proxy server, which I see in the "Via" Header supplied.
I am still digging for more details and doing comparisons to see if there is anything else worth noting here.