|Bye Bye Yahoo Slurp, Hello crawl.yahoo.net|
| 4:21 pm on Jun 5, 2007 (gmt 0)|
|As of today, the transition is complete and all machines crawling as Slurp are now in crawl.yahoo.net. You can see this change in your web server logs, where the page accesses from inktomisearch.com are being fully replaced by crawl.yahoo.net contacts. Note that this does not cover other Yahoo! crawlers, such Yahoo! China, and other verticals, like Yahoo! Shopping, Yahoo! Travel, etc., which have their own user-agent. |
Don't fret though; there is no need to change your robots.txt file because the crawler user-agent is still Yahoo! Slurp. If you use IP based filtering, there is no need to change that either, since the IP addresses from which we crawl remain the same. However, please ensure that your network or firewall setup does not keep crawl.yahoo.net out as we won't be able to include your content in our results.
With this transition complete, we also encourage you to setup reverse DNS-based authentication of our crawler to ensure that no rogue bots masquerading as 'Slurp' visit your site.
Yahoo Slurp now crawl.yahoo.net [ysearchblog.com]
| 5:49 pm on Jun 5, 2007 (gmt 0)|
This is about 18 months after the rest of the crawler world updated their DNS, but they still deserve a pat on the back for finally getting it done.
Now what about Yahoo's crawlers using a common CACHE server?
Why do we need to allow an army of Yahoo spiders to redundantly abuse our servers?
Is it a conceptual problem that Yahoo can't share pages already downloaded?
When I posed that question to one of their engineers I was given a lame excuse that the various crawlers had different needs.
OK, what could one crawler need that's different when you download a page?
The images? the CSS? well you certainly don't need to download the page AGAIN just to get those items and you cache anything else downloaded and share it as well, it's not rocket science. If it's the age of the cached page that's the issue, download it again, just to the CACHE server for all to share.
Funny, Google managed to make some of their crawlers share CACHE, so we know it can be done.
FWIW, the only thing worse than Yahoo's army of crawlers is the ton of Nutch's out there.
[edited by: incrediBILL at 5:51 pm (utc) on June 5, 2007]
| 7:21 am on Jun 6, 2007 (gmt 0)|
About time! Although I wish the multiple crawlers would share a cache too.
phpBB3 boards rely on labeling bots to remove session ID's from the uri's when they visit. I just looked in the "whos online" section of a buddies forum and of 29 users online over the past 30 minutes 4 are people and 25 are bots.
The board owner, has all bots share one account so the who's online section reads... Users online : googlebot, googlebot, googlebot etc.
| 11:02 pm on Jun 6, 2007 (gmt 0)|
Actually, from my perspective, Yahoo's change and declaration of a consistent method for identifying their bots is a small evolutionary improvement above that of Google.
Yahoo is representing that webmasters can check to see if any bot reflecting the useragent of "Slurp" is actually from Yahoo by doing the reverse DNS lookup. As we all know, there's no difficulty for nefarious dataminers in changing their useragent strings to masquerade as a search engine bot -- so that just leaves one with being able to identify via IP address.
Google doesn't offer this sort of standard identification method, at least according to their help materials:
According to that, you cannot necessarily know for sure that something pretending to be Googlebot is actually Googlebot, even if it appears to be coming from an IP address block that shows non-Google ownership info. Everyone's been nervous about that, since folx have supposed that Google might be visiting sites from IP addresses that they could've purchased through a proxy, so as to hide whether the addresses are owned by Google or not.
So, I think Yahoo's move is actually a bit more advanced than Google's squirrely refusal to commit to a particular method of IDing the IP/Domain.
As for a reason why a company's bots might request a page multiple times rather than share it, I can see at least two reasons:
- Some sites perform content negotiation, delivering up their pages in multiple different languages, depending on the browser's Accept-Language request setting. So, a search engine might have to make multiple requests for every page in order to see if it needed to index multiple versions for people in different countries. If a site only has one version/language per page URL, then this could seem unnecessary.
- Accept headers can also specify the desired content format, so a server might deliver up content in one format for wireless device requests differently from a PC's request. I think it's possible that the mobile googlebot may request the same pages which were already requested by Googlebot, for instance - so, it wouldn't be just Yahoo that does this, but Google as well.
So, I think it's hard for the search engines to know for sure that a site won't deliver up a few different versions of a page at a particular URL without asking for that page in a few different ways.
| 7:33 am on Jun 7, 2007 (gmt 0)|
|Google doesn't offer this sort of standard identification method, at least according to their help materials |
That would be incorrect, Google did this first back in September and it's documented here:
| 7:36 am on Jun 7, 2007 (gmt 0)|
|As for a reason why a company's bots might request a page multiple times rather than share it, I can see at least two reasons: |
I've got two more:
1. Poor organizational structure that led to each little group at Yahoo to have their own little fiefdom and individual crawlers.
2. Doesn't care that they don't all honor robots.txt properly and making webmasters upset.
| 12:30 pm on Jun 7, 2007 (gmt 0)|
Ah - interesting that they're now suggesting similar authentication in the blog.
Though, you can clearly see that I'm still technically correct with my statement: Google's Webmaster help section has not been updated to include that methodology and it still says to just use the user-agent.
I think they need to update the help materials to make it really feel like a commitment to that authentication method...
| 8:48 pm on Jun 9, 2007 (gmt 0)|
Here are my logs regarding Yahoo at this moment of the day. Can you please help me? I allready banned about 60 IP addresses linke those above. Can you please tell me what to do to limit the number of yahoo crwawlers on my website?
| 9:27 pm on Jun 9, 2007 (gmt 0)|
|I allready banned about 60 IP addresses linke those above. Can you please tell me what to do to limit the number of yahoo crwawlers on my website? |
Why would you ban Yahoo IPs unless you don't want Yahoo traffic?
Yahoo AND Google use many IPs during their crawl, they aren't different crawlers, they are the same crawler unless the user agent changes.
For any major search engine you should only use ROBOTS.TXT to disallow the various crawlers, never block them by IP as you can totally destroy your search engine rankings.
[edited by: incrediBILL at 9:27 pm (utc) on June 9, 2007]
| 3:02 pm on Jun 11, 2007 (gmt 0)|
meetzah2, incrediBILL's right - if you ban search engine bots by IP addresses, you can ruin your rankings in the search engine. Likewise, banning them completely via robots.txt could do the same.
Even though you don't fully explain the problem you're experiencing, I'm going to go out on a limb and assume that perhaps your server is having problems keeping up with too-frequent requests from Slurp?
If you're really just trying to reduce the frequency of the spider requests, Yahoo! allows you to specify a "Crawl-delay" in your robots.txt file.
You could set the delay from 5 to 10 to space out the requests coming into you. The search engines need to be able to request your pages in order to index your content and make if findable through their results for your users, but they don't want to request pages so frequently as to become a defacto denial-of-service attack. So, unless you have some reason you don't want your website/pages to be found on the internet, just limit the crawl rate and don't ban anything.
See Yahoo!'s help section for more details: