homepage Welcome to WebmasterWorld Guest from 50.19.169.37
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

    
Bye Bye Yahoo Slurp, Hello crawl.yahoo.net
engine




msg:3359253
 4:21 pm on Jun 5, 2007 (gmt 0)

As of today, the transition is complete and all machines crawling as Slurp are now in crawl.yahoo.net. You can see this change in your web server logs, where the page accesses from inktomisearch.com are being fully replaced by crawl.yahoo.net contacts. Note that this does not cover other Yahoo! crawlers, such Yahoo! China, and other verticals, like Yahoo! Shopping, Yahoo! Travel, etc., which have their own user-agent.

Don't fret though; there is no need to change your robots.txt file because the crawler user-agent is still Yahoo! Slurp. If you use IP based filtering, there is no need to change that either, since the IP addresses from which we crawl remain the same. However, please ensure that your network or firewall setup does not keep crawl.yahoo.net out as we won't be able to include your content in our results.

With this transition complete, we also encourage you to setup reverse DNS-based authentication of our crawler to ensure that no rogue bots masquerading as 'Slurp' visit your site.

Yahoo Slurp now crawl.yahoo.net [ysearchblog.com]

 

incrediBILL




msg:3359324
 5:49 pm on Jun 5, 2007 (gmt 0)

This is about 18 months after the rest of the crawler world updated their DNS, but they still deserve a pat on the back for finally getting it done.

<RANT>

Now what about Yahoo's crawlers using a common CACHE server?

Why do we need to allow an army of Yahoo spiders to redundantly abuse our servers?

Is it a conceptual problem that Yahoo can't share pages already downloaded?

When I posed that question to one of their engineers I was given a lame excuse that the various crawlers had different needs.

OK, what could one crawler need that's different when you download a page?

The images? the CSS? well you certainly don't need to download the page AGAIN just to get those items and you cache anything else downloaded and share it as well, it's not rocket science. If it's the age of the cached page that's the issue, download it again, just to the CACHE server for all to share.

Funny, Google managed to make some of their crawlers share CACHE, so we know it can be done.

FWIW, the only thing worse than Yahoo's army of crawlers is the ton of Nutch's out there.

</RANT>

[edited by: incrediBILL at 5:51 pm (utc) on June 5, 2007]

Kurgano




msg:3359759
 7:21 am on Jun 6, 2007 (gmt 0)

About time! Although I wish the multiple crawlers would share a cache too.

phpBB3 boards rely on labeling bots to remove session ID's from the uri's when they visit. I just looked in the "whos online" section of a buddies forum and of 29 users online over the past 30 minutes 4 are people and 25 are bots.

The board owner, has all bots share one account so the who's online section reads... Users online : googlebot, googlebot, googlebot etc.

Ironic huh?

Silvery




msg:3360576
 11:02 pm on Jun 6, 2007 (gmt 0)

Actually, from my perspective, Yahoo's change and declaration of a consistent method for identifying their bots is a small evolutionary improvement above that of Google.

Yahoo is representing that webmasters can check to see if any bot reflecting the useragent of "Slurp" is actually from Yahoo by doing the reverse DNS lookup. As we all know, there's no difficulty for nefarious dataminers in changing their useragent strings to masquerade as a search engine bot -- so that just leaves one with being able to identify via IP address.

Google doesn't offer this sort of standard identification method, at least according to their help materials:

[google.com...]

According to that, you cannot necessarily know for sure that something pretending to be Googlebot is actually Googlebot, even if it appears to be coming from an IP address block that shows non-Google ownership info. Everyone's been nervous about that, since folx have supposed that Google might be visiting sites from IP addresses that they could've purchased through a proxy, so as to hide whether the addresses are owned by Google or not.

So, I think Yahoo's move is actually a bit more advanced than Google's squirrely refusal to commit to a particular method of IDing the IP/Domain.

As for a reason why a company's bots might request a page multiple times rather than share it, I can see at least two reasons:

- Some sites perform content negotiation, delivering up their pages in multiple different languages, depending on the browser's Accept-Language request setting. So, a search engine might have to make multiple requests for every page in order to see if it needed to index multiple versions for people in different countries. If a site only has one version/language per page URL, then this could seem unnecessary.

- Accept headers can also specify the desired content format, so a server might deliver up content in one format for wireless device requests differently from a PC's request. I think it's possible that the mobile googlebot may request the same pages which were already requested by Googlebot, for instance - so, it wouldn't be just Yahoo that does this, but Google as well.

So, I think it's hard for the search engines to know for sure that a site won't deliver up a few different versions of a page at a particular URL without asking for that page in a few different ways.

incrediBILL




msg:3360888
 7:33 am on Jun 7, 2007 (gmt 0)

Google doesn't offer this sort of standard identification method, at least according to their help materials

That would be incorrect, Google did this first back in September and it's documented here:
[googlewebmastercentral.blogspot.com...]

incrediBILL




msg:3360893
 7:36 am on Jun 7, 2007 (gmt 0)

As for a reason why a company's bots might request a page multiple times rather than share it, I can see at least two reasons:

I've got two more:

1. Poor organizational structure that led to each little group at Yahoo to have their own little fiefdom and individual crawlers.

2. Doesn't care that they don't all honor robots.txt properly and making webmasters upset.

Silvery




msg:3361015
 12:30 pm on Jun 7, 2007 (gmt 0)

Ah - interesting that they're now suggesting similar authentication in the blog.

Though, you can clearly see that I'm still technically correct with my statement: Google's Webmaster help section has not been updated to include that methodology and it still says to just use the user-agent.

I think they need to update the help materials to make it really feel like a commitment to that authentication method...

meetzah2




msg:3363044
 8:48 pm on Jun 9, 2007 (gmt 0)

Here are my logs regarding Yahoo at this moment of the day. Can you please help me? I allready banned about 60 IP addresses linke those above. Can you please tell me what to do to limit the number of yahoo crwawlers on my website?

74.6.28.234 lj511076.crawl.yahoo.net
74.6.28.143 lj511089.crawl.yahoo.net
74.6.28.151 lj511097.crawl.yahoo.net
74.6.28.156 lj511102.crawl.yahoo.net
74.6.28.171 lj511117.crawl.yahoo.net
74.6.28.102 lj511122.crawl.yahoo.net
74.6.28.121 lj511141.crawl.yahoo.net
74.6.28.70 lj511147.crawl.yahoo.net
74.6.28.22 lj511172.crawl.yahoo.net
74.6.28.44 lj511194.crawl.yahoo.net
74.6.27.105 lj511205.crawl.yahoo.net
74.6.27.111 lj511211.crawl.yahoo.net
74.6.27.72 lj511229.crawl.yahoo.net
74.6.26.202 lj511284.crawl.yahoo.net
74.6.26.146 lj511332.crawl.yahoo.net
74.6.26.155 lj511341.crawl.yahoo.net
74.6.26.165 lj511351.crawl.yahoo.net
74.6.26.104 lj511364.crawl.yahoo.net
74.6.26.76 lj511393.crawl.yahoo.net
74.6.26.46 lj511436.crawl.yahoo.net
74.6.25.207 lj511449.crawl.yahoo.net
74.6.25.210 lj511452.crawl.yahoo.net
74.6.25.214 lj511456.crawl.yahoo.net
74.6.25.120 lj511540.crawl.yahoo.net
74.6.25.78 lj511555.crawl.yahoo.net
74.6.25.23 lj511573.crawl.yahoo.net
74.6.24.205 lj511607.crawl.yahoo.net
74.6.24.206 lj511608.crawl.yahoo.net
74.6.24.233 lj511635.crawl.yahoo.net
74.6.24.136 lj511642.crawl.yahoo.net
74.6.24.171 lj511677.crawl.yahoo.net
74.6.24.27 lj511737.crawl.yahoo.net
74.6.23.206 lj511768.crawl.yahoo.net
74.6.23.221 lj511783.crawl.yahoo.net
74.6.23.142 lj511808.crawl.yahoo.net
74.6.23.123 lj511863.crawl.yahoo.net
74.6.23.16 lj511886.crawl.yahoo.net
74.6.22.209 lj511931.crawl.yahoo.net
74.6.22.219 lj511941.crawl.yahoo.net
74.6.22.228 lj511950.crawl.yahoo.net
74.6.22.46 lj511996.crawl.yahoo.net
74.6.20.205 lj512007.crawl.yahoo.net
74.6.20.208 lj512010.crawl.yahoo.net
74.6.20.218 lj512020.crawl.yahoo.net
74.6.20.230 lj512032.crawl.yahoo.net
74.6.20.233 lj512035.crawl.yahoo.net
74.6.20.164 lj512070.crawl.yahoo.net
74.6.20.101 lj512081.crawl.yahoo.net
74.6.20.111 lj512091.crawl.yahoo.net
74.6.74.124 lj611054.crawl.yahoo.net
74.6.74.6 lj611096.crawl.yahoo.net
74.6.74.46 lj611108.crawl.yahoo.net
74.6.73.165 lj611197.crawl.yahoo.net
74.6.73.90 lj611212.crawl.yahoo.net
74.6.73.47 lj611272.crawl.yahoo.net
74.6.72.227 lj611283.crawl.yahoo.net
74.6.72.8 lj611400.crawl.yahoo.net
74.6.71.187 lj611493.crawl.yahoo.net
74.6.71.162 lj611500.crawl.yahoo.net
74.6.70.75 lj611706.crawl.yahoo.net
74.6.69.139 lj611855.crawl.yahoo.net
74.6.69.14 lj611938.crawl.yahoo.net
74.6.69.49 lj611952.crawl.yahoo.net
74.6.68.167 lj611997.crawl.yahoo.net
74.6.68.42 lj612090.crawl.yahoo.net
74.6.68.21 lj612114.crawl.yahoo.net
74.6.67.201 lj612131.crawl.yahoo.net
74.6.67.160 lj612168.crawl.yahoo.net
74.6.67.143 lj612172.crawl.yahoo.net
74.6.67.111 lj612218.crawl.yahoo.net
74.6.75.29 lj612515.crawl.yahoo.net
74.6.74.212 lj612570.crawl.yahoo.net
74.6.74.231 lj612590.crawl.yahoo.net
74.6.21.140 rz502206.inktomisearch.com
74.6.21.143 rz502209.inktomisearch.com
74.6.21.144 rz502210.inktomisearch.com
74.6.21.154 rz502220.inktomisearch.com
74.6.21.159 rz502225.inktomisearch.com
74.6.21.166 rz502232.inktomisearch.com
74.6.21.168 rz502234.inktomisearch.com
74.6.21.170 rz502236.inktomisearch.com
74.6.21.104 rz502244.inktomisearch.com
74.6.21.108 rz502248.inktomisearch.com
74.6.21.118 rz502258.inktomisearch.com
74.6.21.120 rz502260.inktomisearch.com
74.6.21.123 rz502263.inktomisearch.com
74.6.21.125 rz502265.inktomisearch.com
74.6.21.75 rz502272.inktomisearch.com
74.6.21.79 rz502276.inktomisearch.com
74.6.21.14 rz502284.inktomisearch.com
74.6.21.15 rz502285.inktomisearch.com
74.6.21.29 rz502299.inktomisearch.com
74.6.21.30 rz502300.inktomisearch.com
74.6.21.38 rz502308.inktomisearch.com
74.6.21.43 rz502313.inktomisearch.com
74.6.21.49 rz502319.inktomisearch.com
74.6.19.10 rz502320.inktomisearch.com
74.6.19.20 rz502330.inktomisearch.com
74.6.19.23 rz502333.inktomisearch.com
74.6.19.30 rz502340.inktomisearch.com
74.6.19.32 rz502342.inktomisearch.com
74.6.19.35 rz502345.inktomisearch.com
74.6.18.202 rz502364.inktomisearch.com
74.6.18.205 rz502367.inktomisearch.com
74.6.18.206 rz502368.inktomisearch.com
74.6.18.210 rz502372.inktomisearch.com
74.6.18.213 rz502375.inktomisearch.com
74.6.18.221 rz502383.inktomisearch.com
74.6.18.222 rz502384.inktomisearch.com
74.6.18.232 rz502394.inktomisearch.com
74.6.18.233 rz502395.inktomisearch.com
74.6.18.235 rz502397.inktomisearch.com
74.6.18.135 rz502401.inktomisearch.com
74.6.18.137 rz502403.inktomisearch.com
74.6.18.138 rz502404.inktomisearch.com
74.6.18.140 rz502406.inktomisearch.com
74.6.18.142 rz502408.inktomisearch.com
74.6.18.143 rz502409.inktomisearch.com
74.6.18.144 rz502410.inktomisearch.com
74.6.18.103 rz502443.inktomisearch.com
74.6.18.114 rz502454.inktomisearch.com
74.6.18.115 rz502455.inktomisearch.com
74.6.18.121 rz502461.inktomisearch.com
74.6.18.126 rz502466.inktomisearch.com
74.6.18.73 rz502470.inktomisearch.com
74.6.18.79 rz502476.inktomisearch.com
74.6.18.10 rz502480.inktomisearch.com
74.6.18.11 rz502481.inktomisearch.com
74.6.18.13 rz502483.inktomisearch.com
74.6.18.22 rz502492.inktomisearch.com
74.6.18.24 rz502494.inktomisearch.com
74.6.18.34 rz502504.inktomisearch.com
74.6.18.37 rz502507.inktomisearch.com
74.6.18.41 rz502511.inktomisearch.com
74.6.18.47 rz502517.inktomisearch.com
74.6.18.48 rz502518.inktomisearch.com
74.6.17.201 rz502523.inktomisearch.com
74.6.17.210 rz502532.inktomisearch.com
74.6.17.220 rz502542.inktomisearch.com
74.6.17.224 rz502546.inktomisearch.com
74.6.17.136 rz502562.inktomisearch.com
74.6.17.146 rz502572.inktomisearch.com
74.6.17.147 rz502573.inktomisearch.com
74.6.17.150 rz502576.inktomisearch.com
74.6.17.151 rz502577.inktomisearch.com
74.6.17.157 rz502583.inktomisearch.com
74.6.17.161 rz502587.inktomisearch.com
74.6.17.163 rz502589.inktomisearch.com
74.6.17.165 rz502591.inktomisearch.com
74.6.17.172 rz502598.inktomisearch.com
74.6.17.103 rz502603.inktomisearch.com
74.6.17.106 rz502606.inktomisearch.com
74.6.17.111 rz502611.inktomisearch.com
74.6.17.112 rz502612.inktomisearch.com
74.6.17.113 rz502613.inktomisearch.com
74.6.17.125 rz502625.inktomisearch.com
74.6.17.73 rz502630.inktomisearch.com
74.6.17.75 rz502632.inktomisearch.com
74.6.17.76 rz502633.inktomisearch.com
74.6.17.78 rz502635.inktomisearch.com

incrediBILL




msg:3363061
 9:27 pm on Jun 9, 2007 (gmt 0)

I allready banned about 60 IP addresses linke those above. Can you please tell me what to do to limit the number of yahoo crwawlers on my website?

Why would you ban Yahoo IPs unless you don't want Yahoo traffic?

Yahoo AND Google use many IPs during their crawl, they aren't different crawlers, they are the same crawler unless the user agent changes.

For any major search engine you should only use ROBOTS.TXT to disallow the various crawlers, never block them by IP as you can totally destroy your search engine rankings.

[edited by: incrediBILL at 9:27 pm (utc) on June 9, 2007]

Silvery




msg:3364382
 3:02 pm on Jun 11, 2007 (gmt 0)

meetzah2, incrediBILL's right - if you ban search engine bots by IP addresses, you can ruin your rankings in the search engine. Likewise, banning them completely via robots.txt could do the same.

Even though you don't fully explain the problem you're experiencing, I'm going to go out on a limb and assume that perhaps your server is having problems keeping up with too-frequent requests from Slurp?

If you're really just trying to reduce the frequency of the spider requests, Yahoo! allows you to specify a "Crawl-delay" in your robots.txt file.

You could set the delay from 5 to 10 to space out the requests coming into you. The search engines need to be able to request your pages in order to index your content and make if findable through their results for your users, but they don't want to request pages so frequently as to become a defacto denial-of-service attack. So, unless you have some reason you don't want your website/pages to be found on the internet, just limit the crawl rate and don't ban anything.

See Yahoo!'s help section for more details:

[help.yahoo.com...]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved