homepage Welcome to WebmasterWorld Guest from 54.204.215.209
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
YahooCacheSystem - what is it?
motorhaven




msg:4425064
 2:30 pm on Mar 5, 2012 (gmt 0)

Yes, I've searched on this, but can't find the answer.

Like many others YahooCacheSystem comes in and fetches both the home index and favicon.ico. What I haven't been able to find in discussion here is what this crawler is and does with these. Does anyone know? Block them or not?

 

lucy24




msg:4425140
 5:38 pm on Mar 5, 2012 (gmt 0)

When in doubt, block ;)

:: joining you in line waiting for an answer to the "what are they up to?" question ::

keyplyr




msg:4425210
 7:47 pm on Mar 5, 2012 (gmt 0)


I wouldn't block YahooCacheSystem.

DeeCee




msg:4425219
 8:13 pm on Mar 5, 2012 (gmt 0)

The Yahoo Cache System is one of those technologies that is ripe for abuse.. Like all the Search Engine APIs.
If used by the good guys, it is a valid technology. If used by the bad guys, it is very negative.

Yahoo years ago bought edge caching technology from Inktomi. Basically a proxy caching system. Also a CDN used to speed up things significantly when used for that purpose. They to my knowledge use it both for their own caching and for their APIs. Yahoo a few years ago made the technology (Inktomi's Traffic Server) public domain.

While it does have its applications, and it in itself is not "bad", the problem is that when the Yahoo cache system reaches into our sites and grabs page-sources, we do not know why or on behalf of who. Sort of like all the anonymous bots hiding out behind the Amazon EC2 cloud.

Under the Yahoo Developer network, Yahoo provides (paid) APIs to execute searches by remote. Directly specifying URLs to pick up. The unfortunate site-effect of Yahoo (and for that matter Google's page caching) is that it can be used by info trackers, mark scanners, and spammers that would otherwise have been blocked from accessing our sites.

In fact, articles have been written about how to use SE caches to access otherwise blocked sites. As blocked individual users and otherwise.

The caching systems is also another example of how the search engines as content scrapers use our content as revenue drivers for themselves. When the Yahoo system for example caches up the full source content of site pages, and sell API access for outsiders to search this content. (Plus for the "bad guys" serving up content they would not have been able to access on their own.)

Another example of bad use of SE search APIs is the fact that blog spammer software uses them directly to help set up their blog and forum spammer attacks. (Black Hat software XRumer for example comes with the add-on HRefer, which uses the SE APIs to find and prioritize the blogs and forums that should be spammed so they attack highest-ranked sites for the topic's specific keywords first for maximum effect).

Personally, I block the YahooCacheSystem bots. I dislike both content scrapers and systems that act as proxies, hiding the real users.

Notice, though that blocking it might be for naught. The normal Bing or Yahoo Slurp robots stash cached pages as well, unless you have caching blocked in your meta headers. If I am guessing correctly, the YahooCacheSystem is merely a secondary bot for content not already "lifted" by the normal bots or content that is outdated in the base caches when someone calls on it. The Yahoo API allows its users to specific max cache age.

motorhaven




msg:4425228
 8:35 pm on Mar 5, 2012 (gmt 0)

Thank you! That is exactly the information I was looking for. I've seen many prominent members here say they block it, but never an in depth reason why.

wilderness




msg:4425235
 8:47 pm on Mar 5, 2012 (gmt 0)

I wouldn't block YahooCacheSystem.


Well, they certainly don't know how to read:

(and before another chirps in; I do realize this is overkill)

<meta name="MSSmartTagsPreventParsing" content="TRUE">
<META NAME="robots" CONTENT="noarchive">
<meta name="robots" content="noimageindex, nomediaindex" />
<META HTTP-EQUIV="pragma" CONTENT="no-cache">

I must have near a dozen listings in robots.text for all the Yahoo crap over the years, and I'm certainly NOT going to add another.

DeeCee




msg:4425262
 9:57 pm on Mar 5, 2012 (gmt 0)

The YahooCacheSystem does not even look at robots.txt.
I believe they do not look at it as a robot, since it is "merely" a caching system acting on behalf of another user. The user that is trying to hide.

My blocks for YahooCacheSystem is purely in denying them access. Though .htaccess or other methods. If their "customers" want my stuff, they can try to load directly and skip Yahoo's theft tool..

I would not even try to "talk" to them. They are like Google Preview, who does not care about such things either. (Except through blocking any and all snippets with 'nosnippet').

dstiles




msg:4425278
 10:58 pm on Mar 5, 2012 (gmt 0)

Most of the YCS hits I get seem to be from mobiles. At the moment I allow those, although there do seem to be a few that are non-mobile hits.

DeeCee




msg:4425281
 11:09 pm on Mar 5, 2012 (gmt 0)

Most of mine appear as a Yahoo host/domain that looks like it is caching mobile.. But I see no purpose behind that caching.. It still merely introduces a second connection slow-down behind any mobile using the caching. My site -> Yahoo -> Mobile.

Unless there is a ton of mobiles all using that same cache to service that exact page before the page expires in the caching system, the cache serves no real purpose. An individual user gets no benefit unless either many others have visited that page shortly before, or that same user is reloading that same page over and over. In the latter case, that is even only under the assumption that Yahoo can load up my content faster than I can.

And honestly, I don't quite believe in it.. Again, because it is a proxy hiding the real user.

DeeCee




msg:4425282
 11:15 pm on Mar 5, 2012 (gmt 0)

BTW.. A second reason I believe that those page loads are scams is that the user apparently is not really caching, merely lifting selective content. The only thing loaded is my main html for the page and favicon.ico (which I believe is an automatic).

So apparently these "users" do not need neither my CSS, JS, or my images to "see" my content. That makes it look fake to me.

keyplyr




msg:4425288
 11:34 pm on Mar 5, 2012 (gmt 0)

I filter all IPs and verify authenticity. I catch the occasional imposter, but most all hits using the YCS UA are from ycar13.mobile.bf1.yahoo.com - hence I do not block it, but only make sure it's from Yahoo.

wilderness




msg:4425308
 12:29 am on Mar 6, 2012 (gmt 0)

BTW.. A second reason I believe that those page loads are scams is that the user apparently is not really caching, merely lifting selective content. The only thing loaded is my main html for the page and favicon.ico (which I believe is an automatic).


same here and their eating 403's.

keyplyr




msg:4425317
 12:52 am on Mar 6, 2012 (gmt 0)

same here and their eating 403's.

So what are all those thousands of SBC-Yahoo mobile users seeing? Red X's?

wilderness




msg:4425323
 1:19 am on Mar 6, 2012 (gmt 0)

Green fu's ;)

My pages (generally speaking) are in excess of 1,000 words of text with accompanying images. Some pages are in the 3-5k in word count.

My sites were never intended for mobiles.

FWIW, and from the past few weeks on this reactivated site, I'm seeing 2-3 requests per day from this UA, hardly thousands.

FWIW2, I've spent a month of day night working to make this site presentable for both visitors and the SE's.
Should the visitors not like the restrictions, they may simply explore other options for the media (none).

keyplyr




msg:4425345
 2:38 am on Mar 6, 2012 (gmt 0)

I'm seeing 2-3 requests per day from this UA, hardly thousands

I wasn't referring to the UA, I was suggesting all the mobile users who depend on the cache loading quickly, otherwise waiting sometimes for a very slow download. Mobile phones don't have the CPU power nor broadband speed of PCs and depend on caches by their carriers.

Today's mobile rendering is astonishingly good. Just because you didn't optimize your pages for mobiles, your traffic has an increasing chance of being from a mobile appliance. For every PC or lap-top sold, there are 10x's that in mobile appliances.

wilderness




msg:4425352
 2:53 am on Mar 6, 2012 (gmt 0)

For every PC or lap-top sold, there are 10x's that in mobile appliances.


more power to 'em!
I hope they enjoy facebook and twitter without web pages ;)

DeeCee




msg:4425359
 3:08 am on Mar 6, 2012 (gmt 0)

@keyplyr,
See my earlier reply on why those caches have little to no impact, other than slowing them down even further by introcing a two-step process to get to actual web-pages..
Caches only work to speed things up on web-pages that are loaded over and over by many. A first load will always be slower than loading directly from the original source.

All the mobile phones I see access my sites directly.
Almost all the cache calls I see from Yahoo do not bother to load all pieces, but merely the text. It is a scam.

As an example, last night I moved a site to a new domain (only less than two hours old or so after registration), and the YahooCacheSystem came by immediately and started lifting text (only text, no CSS, JS, or images). Not through the 301 redirect from the old location, but directly going for the new domain which could ONLY have been found by an information tracker that follows new domains created and start lifting information.

lucy24




msg:4425385
 4:31 am on Mar 6, 2012 (gmt 0)

Always from 98.139.241.24n. UA always a bare "YahooCacheSystem". Have never asked for anything but front page plus favicon. Now that they're getting 403'd they no longer ask for the favicon-- which is kinda funny, since I let everyone have that. The front page changes about once every six months. (I am not a front-driven site.) D'you suppose they assume that all other pages change even less often?

Far as I know, they have never ever even once asked for robots.txt. Well, maybe back in 2007 when I wasn't paying attention.

DeeCee




msg:4425387
 4:44 am on Mar 6, 2012 (gmt 0)

@lucy24,
The crawlers that go for only the front-page (and sometimes the favicon) are typically info-scanners. Tracking for basic information about the site. Analytics IDs, affiliate IDs, site meta description, site title, a snapshot of the site, and other stuff to add to the information they already loaded from the domain registration information. Info to sell on their own site, about your site. Such as the D*****Tools scanner, which is illegal to mention here. :)

Full content scrapers typically arrive in the middle of the system, hitting on pages loaded out of the Search engine APIs, and they often never even touches the front-page.

keyplyr




msg:4425397
 5:25 am on Mar 6, 2012 (gmt 0)


@DeeCee

Yes, I'm aware of what a cache is and how it works ;)

All the mobile phones I see access my sites directly.

You'd not know this. Cached images, scripts, etc are loaded from mobile carrier prior to the request being fulfilled from the server.

Your scraping suspicions are likely due to bad agents not Yahoo. As I said earlier, whitelist filtering stops all that.

DeeCee




msg:4425399
 5:41 am on Mar 6, 2012 (gmt 0)

keyplyr,
As I mentioned Yahoos cached showed up to offload only html from a brand-new domain. Images/CSS/JS, could not have been pre-cached. There simply had never been any visitors to that domain before. Caching or otherwise.

lucy24




msg:4425456
 8:31 am on Mar 6, 2012 (gmt 0)

Such as the D*****Tools scanner, which is illegal to mention here

Yes, I remember when I discovered that it's an unprintable word. (The opposite of wmw, which is apparently an unabbreviable word.) Is there an explanation lurking somewhere that I'm expected to have read? :)

dstiles




msg:4425717
 9:44 pm on Mar 6, 2012 (gmt 0)

Has anyone actually used yahoomobile on a YCS-blocked site? If so, what do you see?

Key_Master




msg:4425738
 10:57 pm on Mar 6, 2012 (gmt 0)

The original IP requesting the content from your server isn't exactly hidden. YahooCacheMobile uses the X-Forwarded-For header:

HTTP_X_FORWARDED_FOR{'174.252.103.23, 98.137.80.20'}

Lots of mobile phone carriers use proxies. Almost all of them will forward the original IP using the same header. I use Opera Mini and Blackberry browsers. Both port my carrier's IP through a proxy.

dstiles




msg:4426184
 8:38 pm on Mar 7, 2012 (gmt 0)

Yes, I know that (assuming that was a response to me). :)

What I'm getting at is: if YCS is blocked and you access the site using yahoomobile, do you get the block notice (eg 403) or a web page. In other words, is blocking YCS causing problems to users or is it like G web preview - just another fish of a scarlet hue?

keyplyr




msg:4426211
 9:27 pm on Mar 7, 2012 (gmt 0)

if YCS is blocked and you access the site using yahoomobile, do you get the block notice (eg 403) or a web page


@dstiles

I think I covered that in my post above (msg:4425345)

wilderness




msg:4426216
 9:44 pm on Mar 7, 2012 (gmt 0)

dstiles,
4-5 years ago I spoke with a cousin at a funeral.
When he inquired abut the address for one of my websites and then immediately attempted to access with a T-Mobile (which I had denied).

The denial didn't show a 403, rather that the site simply didn't exist.
Rather that's the procedures for all cells or not, I haven't a clue.

keyplyr




msg:4426232
 11:05 pm on Mar 7, 2012 (gmt 0)



wilderness - mobile phones of today have no comparison to those 5 years ago.

dstiles




msg:4426681
 10:24 pm on Mar 8, 2012 (gmt 0)

So the cache isn't by-passed if it fails then? Great. :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved