Forum Moderators: Robert Charlton & goodroi
It is a G cache file appearing in the index which is effectively a DUPLICATE COPY of your web page! It is easy to find. Just search on the following:
"allinurl:yourdomain.com cache"
or
"allinurl: search?q=cache" to see all the 60K or so in the database.
You will see something in the resultant SERPs line the following:
[66.102.7.104...]
Anyone find it or NOT find it who has dropped drastically? Or find it and NOT dropped? This could be another possible important update bug.
According to ARIN, that IP is Google's:
[ws.arin.net...]
There are thousands of these listings, though that number is a very small part of an 8 billion page index:
[google.com...]
Google blocks /search URLs with Robots Exclusion Protocol:
[66.102.7.104...]
Disallow: /search
Google use internal rules for www.google.com, rather than rely on robots Exclusion Protocol, but it seems that in this case they just follow the /robots.txt and list the URLs without fetching them.
So far I can point to a number of affected sites that share this as a common standout of playing hunt the gremlins.
In addtion there are other funnies showing up in various searches.
It appears that the numbers of such are declining, but who knows.
If Google don't crawl the near duplicate page, then the near duplicate page won't affect you in Google. No need to hope. :-)
To cause a problem, Google would need to fetch the cache pages - it looks doesn't - most likely due to the /robots.txt
Like I said, I hope you are correct.
Been in the software world long enough to not discount that what could be, should be, and would be can be different or the same.
I keep saying I am not a Google expert etc. I just note things in common and things that are unique.
The chips can fall where they fall.
After you remove that which can't be, then you have that what is.
1 + 1 can be other than 2 depending. You can do the math or the biology ;).
However, I know of at least one search engine that does allow the cache to be crawled... (delibrately or not I dont know)
[google.com...]
Er (Not 100% sure if this link is allowed as it is a search engine that is sometimes discussed here - feel free to delete if you wish)
He he - I dont mean the link to Google - that is discussed here a lot :) Ansearch is sometimes discussed in the Oz/Asia forum I guess
I've emailed G several times and they have never responded.
Is this the same problem everyone is talking about here? If so, how can we give G the message that this is a big problem?
>it's a different case when Google finds near duplicate content to your page - that's where you can be affected.
Um, a CACHED page IS by definition A 100% duplicate (plus a little couple % extra at the top added by Google)
>If Google don't crawl the near duplicate page, then the near duplicate page won't affect you in Google. No need to hope. :-)
Here's where I'm a little more skeptical. Are you saying that you can GUARANTEE that just because Google doesn't DISPLAY the title/description that the page content does not exist ANYWHERE in the index for comparison for duplicate content? I find it hard to believe that once it has it, G would pass up the opportunity to retain that info for personal reference, especially now that they're doing their utmost to eliminate scrapers copying other sites.
I have to admit IT does NOT show up when searching for the unique first line on our site (although 80 OTHER scraper sites which contain it DO show up ahead of us)
> Um, a CACHED page IS by definition A 100% duplicate (plus a little couple % extra at the top added by Google)
And if the cached page isn't in the index, it isn't a duplicate in the index.
Google displays URL-only results because it has not fetched the URL. If Googlebot has not fetched the URL then it does not have the contents, and it does not have the opportunity to 'retain the info for personal reference'.
Maybe another way to put it is that the backlink data is separate. Once Google knows of the backlink, the backlink stays even after the page with the link on it goes URL only, until the next backlink update.
Therefore, a page previously cached, including a Google cache page itself which got accidently indexed as explained at the beginning of this thread, COULD cause a duplication penalty, since the fact that it is URL only indicates it was previously cached, Right?
I also just came across a page that they are indexing for us, which is of the format:
www.our-domain.com/?abc.htm
As per the rules, this resolves to the same page as:
www.our-domain.com/
but why isn't G smart enough to realize it is not supposed to be indexing it twice? I don't even know where they the link from, not US! When I click on the "similar pages" link next to it, our home page comes up first, so it is OBVIOUSLY considering them identical or pretty close to it.
Sounds like Bourbon has a LOT of bugs and I'm REALLY tired of being the target of them and penalized for something that is not my fault!
>That will sort out the listings properly.
What an excellent idea. I see there are about 4000 of these bad cache pages indexed for the single IP I was having a problem with.
I tried it. (At this point I'm desperate enough to try anything)
I submitted: 216.239.57.104/robots.txt to the URL console.
URL console said:
"We cannot process robots.txt files that contain Allow: lines. "
Next idea anyone?
You can see backlinks for URLs that were previously indexed, but are now URL-only. (I suspect that Steve meant to type "backlinks" instead of "URL only listings").
The canonicalisation for example.com/?whatever is for a separate thread, (one that we have had often).
> I also know of at least one site which has had noindex, nofollow set for ALL search engines since it's inception and it's home page still shows up as URL only in the index twice (www and non-www).
That's normal, if there's also a noindex directive in /robots.txt
The reason is that Google won't fetch the URL if it is excluded in the /robots.txt, so it won't see the noindex,nofollow META tag and doesn't know to remove the URL.
This indexing of the Ansearch cached pages by other search engines seems to be an increasing problem not just in AU but many other countries as they roll out more search engines & directories. UK US NZ etc etc.
I'm just wondering what the potential long term impact of this duplicate data is going to be for smaller websites (that are not strong in SE positioning) it's very possible that the likes of google could drop the real pages as duplicates.
Does anyone know if Ansearch is the only supposedly mainstream SE that caches & creates pages made up of other websites data that doesn't protect them from indexing with robots.txt? Or are there other similar that we should be aware of? (I don't mean typical scraper sites etc.)