DeeCee - 8:13 pm on Mar 5, 2012 (gmt 0)
The Yahoo Cache System is one of those technologies that is ripe for abuse.. Like all the Search Engine APIs.
If used by the good guys, it is a valid technology. If used by the bad guys, it is very negative.
Yahoo years ago bought edge caching technology from Inktomi. Basically a proxy caching system. Also a CDN used to speed up things significantly when used for that purpose. They to my knowledge use it both for their own caching and for their APIs. Yahoo a few years ago made the technology (Inktomi's Traffic Server) public domain.
While it does have its applications, and it in itself is not "bad", the problem is that when the Yahoo cache system reaches into our sites and grabs page-sources, we do not know why or on behalf of who. Sort of like all the anonymous bots hiding out behind the Amazon EC2 cloud.
Under the Yahoo Developer network, Yahoo provides (paid) APIs to execute searches by remote. Directly specifying URLs to pick up. The unfortunate site-effect of Yahoo (and for that matter Google's page caching) is that it can be used by info trackers, mark scanners, and spammers that would otherwise have been blocked from accessing our sites.
In fact, articles have been written about how to use SE caches to access otherwise blocked sites. As blocked individual users and otherwise.
The caching systems is also another example of how the search engines as content scrapers use our content as revenue drivers for themselves. When the Yahoo system for example caches up the full source content of site pages, and sell API access for outsiders to search this content. (Plus for the "bad guys" serving up content they would not have been able to access on their own.)
Another example of bad use of SE search APIs is the fact that blog spammer software uses them directly to help set up their blog and forum spammer attacks. (Black Hat software XRumer for example comes with the add-on HRefer, which uses the SE APIs to find and prioritize the blogs and forums that should be spammed so they attack highest-ranked sites for the topic's specific keywords first for maximum effect).
Personally, I block the YahooCacheSystem bots. I dislike both content scrapers and systems that act as proxies, hiding the real users.
Notice, though that blocking it might be for naught. The normal Bing or Yahoo Slurp robots stash cached pages as well, unless you have caching blocked in your meta headers. If I am guessing correctly, the YahooCacheSystem is merely a secondary bot for content not already "lifted" by the normal bots or content that is outdated in the base caches when someone calls on it. The Yahoo API allows its users to specific max cache age.