homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

Bye Bye Yahoo Slurp, Hello crawl.yahoo.net

 4:21 pm on Jun 5, 2007 (gmt 0)

As of today, the transition is complete and all machines crawling as Slurp are now in crawl.yahoo.net. You can see this change in your web server logs, where the page accesses from inktomisearch.com are being fully replaced by crawl.yahoo.net contacts. Note that this does not cover other Yahoo! crawlers, such Yahoo! China, and other verticals, like Yahoo! Shopping, Yahoo! Travel, etc., which have their own user-agent.

Don't fret though; there is no need to change your robots.txt file because the crawler user-agent is still Yahoo! Slurp. If you use IP based filtering, there is no need to change that either, since the IP addresses from which we crawl remain the same. However, please ensure that your network or firewall setup does not keep crawl.yahoo.net out as we won't be able to include your content in our results.

With this transition complete, we also encourage you to setup reverse DNS-based authentication of our crawler to ensure that no rogue bots masquerading as 'Slurp' visit your site.

Yahoo Slurp now crawl.yahoo.net [ysearchblog.com]



 5:49 pm on Jun 5, 2007 (gmt 0)

This is about 18 months after the rest of the crawler world updated their DNS, but they still deserve a pat on the back for finally getting it done.


Now what about Yahoo's crawlers using a common CACHE server?

Why do we need to allow an army of Yahoo spiders to redundantly abuse our servers?

Is it a conceptual problem that Yahoo can't share pages already downloaded?

When I posed that question to one of their engineers I was given a lame excuse that the various crawlers had different needs.

OK, what could one crawler need that's different when you download a page?

The images? the CSS? well you certainly don't need to download the page AGAIN just to get those items and you cache anything else downloaded and share it as well, it's not rocket science. If it's the age of the cached page that's the issue, download it again, just to the CACHE server for all to share.

Funny, Google managed to make some of their crawlers share CACHE, so we know it can be done.

FWIW, the only thing worse than Yahoo's army of crawlers is the ton of Nutch's out there.


[edited by: incrediBILL at 5:51 pm (utc) on June 5, 2007]


 7:21 am on Jun 6, 2007 (gmt 0)

About time! Although I wish the multiple crawlers would share a cache too.

phpBB3 boards rely on labeling bots to remove session ID's from the uri's when they visit. I just looked in the "whos online" section of a buddies forum and of 29 users online over the past 30 minutes 4 are people and 25 are bots.

The board owner, has all bots share one account so the who's online section reads... Users online : googlebot, googlebot, googlebot etc.

Ironic huh?


 11:02 pm on Jun 6, 2007 (gmt 0)

Actually, from my perspective, Yahoo's change and declaration of a consistent method for identifying their bots is a small evolutionary improvement above that of Google.

Yahoo is representing that webmasters can check to see if any bot reflecting the useragent of "Slurp" is actually from Yahoo by doing the reverse DNS lookup. As we all know, there's no difficulty for nefarious dataminers in changing their useragent strings to masquerade as a search engine bot -- so that just leaves one with being able to identify via IP address.

Google doesn't offer this sort of standard identification method, at least according to their help materials:


According to that, you cannot necessarily know for sure that something pretending to be Googlebot is actually Googlebot, even if it appears to be coming from an IP address block that shows non-Google ownership info. Everyone's been nervous about that, since folx have supposed that Google might be visiting sites from IP addresses that they could've purchased through a proxy, so as to hide whether the addresses are owned by Google or not.

So, I think Yahoo's move is actually a bit more advanced than Google's squirrely refusal to commit to a particular method of IDing the IP/Domain.

As for a reason why a company's bots might request a page multiple times rather than share it, I can see at least two reasons:

- Some sites perform content negotiation, delivering up their pages in multiple different languages, depending on the browser's Accept-Language request setting. So, a search engine might have to make multiple requests for every page in order to see if it needed to index multiple versions for people in different countries. If a site only has one version/language per page URL, then this could seem unnecessary.

- Accept headers can also specify the desired content format, so a server might deliver up content in one format for wireless device requests differently from a PC's request. I think it's possible that the mobile googlebot may request the same pages which were already requested by Googlebot, for instance - so, it wouldn't be just Yahoo that does this, but Google as well.

So, I think it's hard for the search engines to know for sure that a site won't deliver up a few different versions of a page at a particular URL without asking for that page in a few different ways.


 7:33 am on Jun 7, 2007 (gmt 0)

Google doesn't offer this sort of standard identification method, at least according to their help materials

That would be incorrect, Google did this first back in September and it's documented here:


 7:36 am on Jun 7, 2007 (gmt 0)

As for a reason why a company's bots might request a page multiple times rather than share it, I can see at least two reasons:

I've got two more:

1. Poor organizational structure that led to each little group at Yahoo to have their own little fiefdom and individual crawlers.

2. Doesn't care that they don't all honor robots.txt properly and making webmasters upset.


 12:30 pm on Jun 7, 2007 (gmt 0)

Ah - interesting that they're now suggesting similar authentication in the blog.

Though, you can clearly see that I'm still technically correct with my statement: Google's Webmaster help section has not been updated to include that methodology and it still says to just use the user-agent.

I think they need to update the help materials to make it really feel like a commitment to that authentication method...


 8:48 pm on Jun 9, 2007 (gmt 0)

Here are my logs regarding Yahoo at this moment of the day. Can you please help me? I allready banned about 60 IP addresses linke those above. Can you please tell me what to do to limit the number of yahoo crwawlers on my website? lj511076.crawl.yahoo.net lj511089.crawl.yahoo.net lj511097.crawl.yahoo.net lj511102.crawl.yahoo.net lj511117.crawl.yahoo.net lj511122.crawl.yahoo.net lj511141.crawl.yahoo.net lj511147.crawl.yahoo.net lj511172.crawl.yahoo.net lj511194.crawl.yahoo.net lj511205.crawl.yahoo.net lj511211.crawl.yahoo.net lj511229.crawl.yahoo.net lj511284.crawl.yahoo.net lj511332.crawl.yahoo.net lj511341.crawl.yahoo.net lj511351.crawl.yahoo.net lj511364.crawl.yahoo.net lj511393.crawl.yahoo.net lj511436.crawl.yahoo.net lj511449.crawl.yahoo.net lj511452.crawl.yahoo.net lj511456.crawl.yahoo.net lj511540.crawl.yahoo.net lj511555.crawl.yahoo.net lj511573.crawl.yahoo.net lj511607.crawl.yahoo.net lj511608.crawl.yahoo.net lj511635.crawl.yahoo.net lj511642.crawl.yahoo.net lj511677.crawl.yahoo.net lj511737.crawl.yahoo.net lj511768.crawl.yahoo.net lj511783.crawl.yahoo.net lj511808.crawl.yahoo.net lj511863.crawl.yahoo.net lj511886.crawl.yahoo.net lj511931.crawl.yahoo.net lj511941.crawl.yahoo.net lj511950.crawl.yahoo.net lj511996.crawl.yahoo.net lj512007.crawl.yahoo.net lj512010.crawl.yahoo.net lj512020.crawl.yahoo.net lj512032.crawl.yahoo.net lj512035.crawl.yahoo.net lj512070.crawl.yahoo.net lj512081.crawl.yahoo.net lj512091.crawl.yahoo.net lj611054.crawl.yahoo.net lj611096.crawl.yahoo.net lj611108.crawl.yahoo.net lj611197.crawl.yahoo.net lj611212.crawl.yahoo.net lj611272.crawl.yahoo.net lj611283.crawl.yahoo.net lj611400.crawl.yahoo.net lj611493.crawl.yahoo.net lj611500.crawl.yahoo.net lj611706.crawl.yahoo.net lj611855.crawl.yahoo.net lj611938.crawl.yahoo.net lj611952.crawl.yahoo.net lj611997.crawl.yahoo.net lj612090.crawl.yahoo.net lj612114.crawl.yahoo.net lj612131.crawl.yahoo.net lj612168.crawl.yahoo.net lj612172.crawl.yahoo.net lj612218.crawl.yahoo.net lj612515.crawl.yahoo.net lj612570.crawl.yahoo.net lj612590.crawl.yahoo.net rz502206.inktomisearch.com rz502209.inktomisearch.com rz502210.inktomisearch.com rz502220.inktomisearch.com rz502225.inktomisearch.com rz502232.inktomisearch.com rz502234.inktomisearch.com rz502236.inktomisearch.com rz502244.inktomisearch.com rz502248.inktomisearch.com rz502258.inktomisearch.com rz502260.inktomisearch.com rz502263.inktomisearch.com rz502265.inktomisearch.com rz502272.inktomisearch.com rz502276.inktomisearch.com rz502284.inktomisearch.com rz502285.inktomisearch.com rz502299.inktomisearch.com rz502300.inktomisearch.com rz502308.inktomisearch.com rz502313.inktomisearch.com rz502319.inktomisearch.com rz502320.inktomisearch.com rz502330.inktomisearch.com rz502333.inktomisearch.com rz502340.inktomisearch.com rz502342.inktomisearch.com rz502345.inktomisearch.com rz502364.inktomisearch.com rz502367.inktomisearch.com rz502368.inktomisearch.com rz502372.inktomisearch.com rz502375.inktomisearch.com rz502383.inktomisearch.com rz502384.inktomisearch.com rz502394.inktomisearch.com rz502395.inktomisearch.com rz502397.inktomisearch.com rz502401.inktomisearch.com rz502403.inktomisearch.com rz502404.inktomisearch.com rz502406.inktomisearch.com rz502408.inktomisearch.com rz502409.inktomisearch.com rz502410.inktomisearch.com rz502443.inktomisearch.com rz502454.inktomisearch.com rz502455.inktomisearch.com rz502461.inktomisearch.com rz502466.inktomisearch.com rz502470.inktomisearch.com rz502476.inktomisearch.com rz502480.inktomisearch.com rz502481.inktomisearch.com rz502483.inktomisearch.com rz502492.inktomisearch.com rz502494.inktomisearch.com rz502504.inktomisearch.com rz502507.inktomisearch.com rz502511.inktomisearch.com rz502517.inktomisearch.com rz502518.inktomisearch.com rz502523.inktomisearch.com rz502532.inktomisearch.com rz502542.inktomisearch.com rz502546.inktomisearch.com rz502562.inktomisearch.com rz502572.inktomisearch.com rz502573.inktomisearch.com rz502576.inktomisearch.com rz502577.inktomisearch.com rz502583.inktomisearch.com rz502587.inktomisearch.com rz502589.inktomisearch.com rz502591.inktomisearch.com rz502598.inktomisearch.com rz502603.inktomisearch.com rz502606.inktomisearch.com rz502611.inktomisearch.com rz502612.inktomisearch.com rz502613.inktomisearch.com rz502625.inktomisearch.com rz502630.inktomisearch.com rz502632.inktomisearch.com rz502633.inktomisearch.com rz502635.inktomisearch.com


 9:27 pm on Jun 9, 2007 (gmt 0)

I allready banned about 60 IP addresses linke those above. Can you please tell me what to do to limit the number of yahoo crwawlers on my website?

Why would you ban Yahoo IPs unless you don't want Yahoo traffic?

Yahoo AND Google use many IPs during their crawl, they aren't different crawlers, they are the same crawler unless the user agent changes.

For any major search engine you should only use ROBOTS.TXT to disallow the various crawlers, never block them by IP as you can totally destroy your search engine rankings.

[edited by: incrediBILL at 9:27 pm (utc) on June 9, 2007]


 3:02 pm on Jun 11, 2007 (gmt 0)

meetzah2, incrediBILL's right - if you ban search engine bots by IP addresses, you can ruin your rankings in the search engine. Likewise, banning them completely via robots.txt could do the same.

Even though you don't fully explain the problem you're experiencing, I'm going to go out on a limb and assume that perhaps your server is having problems keeping up with too-frequent requests from Slurp?

If you're really just trying to reduce the frequency of the spider requests, Yahoo! allows you to specify a "Crawl-delay" in your robots.txt file.

You could set the delay from 5 to 10 to space out the requests coming into you. The search engines need to be able to request your pages in order to index your content and make if findable through their results for your users, but they don't want to request pages so frequently as to become a defacto denial-of-service attack. So, unless you have some reason you don't want your website/pages to be found on the internet, just limit the crawl rate and don't ban anything.

See Yahoo!'s help section for more details:


Global Options:
 top home search open messages active posts  

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved