Google Lists its Own Cache Pages

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Lists its Own Cache Pages

Anyone else recently dumped in G seeing the following?

MikeNoLastName

9:44 pm on Jun 27, 2005 (gmt 0)

The following has come to my attention as another possible common factor on at least two independent sites which have dropped drastically recently, and I've only been able to find it associated with domains of ours which have dropped and on NONE which stayed alive and well.

It is a G cache file appearing in the index which is effectively a DUPLICATE COPY of your web page! It is easy to find. Just search on the following:

"allinurl:yourdomain.com cache"
or
"allinurl: search?q=cache" to see all the 60K or so in the database.

You will see something in the resultant SERPs line the following:

[66.102.7.104...]

Anyone find it or NOT find it who has dropped drastically? Or find it and NOT dropped? This could be another possible important update bug.

ciml

1:05 pm on Jun 28, 2005 (gmt 0)

Well spotted MikeNoLastName, but this does not create a duplicate content risk, as URL only listings have no content to be too similar to yours.

According to ARIN, that IP is Google's:
[ws.arin.net...]

There are thousands of these listings, though that number is a very small part of an 8 billion page index:
[google.com...]

Google blocks /search URLs with Robots Exclusion Protocol:
[66.102.7.104...]

Disallow: /search

Google use internal rules for www.google.com, rather than rely on robots Exclusion Protocol, but it seems that in this case they just follow the /robots.txt and list the URLs without fetching them.

theBear

3:52 pm on Jun 28, 2005 (gmt 0)

ciml, the folks with the 67,100 affected pages hope that is really the case.

So far I can point to a number of affected sites that share this as a common standout of playing hunt the gremlins.

In addtion there are other funnies showing up in various searches.

It appears that the numbers of such are declining, but who knows.

ciml

6:33 pm on Jun 28, 2005 (gmt 0)

TheBear, it's a different case when Google finds near duplicate content to your page - that's where you can be affected.

If Google don't crawl the near duplicate page, then the near duplicate page won't affect you in Google. No need to hope. :-)

To cause a problem, Google would need to fetch the cache pages - it looks doesn't - most likely due to the /robots.txt

moltar

6:37 pm on Jun 28, 2005 (gmt 0)

The reason that those URLs are listed, is because someone linked to that URL from another page. Google knows that this URL exists, but cannot fetch it because it's banned by it's own robots.txt.

This comes useful sometimes. You can find the page even if the content was not indexed just by it's URL.

theBear

7:21 pm on Jun 28, 2005 (gmt 0)

Sorry ciml and moltar.

Like I said, I hope you are correct.

Been in the software world long enough to not discount that what could be, should be, and would be can be different or the same.

I keep saying I am not a Google expert etc. I just note things in common and things that are unique.

The chips can fall where they fall.

After you remove that which can't be, then you have that what is.

1 + 1 can be other than 2 depending. You can do the math or the biology ;).

Dayo_UK

7:29 pm on Jun 28, 2005 (gmt 0)

I agree that the above example of links to the Google cache dont cause a problem as they are excluded by robots.txt.

However, I know of at least one search engine that does allow the cache to be crawled... (delibrately or not I dont know)

[google.com...]

Er (Not 100% sure if this link is allowed as it is a search engine that is sometimes discussed here - feel free to delete if you wish)

He he - I dont mean the link to Google - that is discussed here a lot :) Ansearch is sometimes discussed in the Oz/Asia forum I guess

stroudtx

8:32 pm on Jun 28, 2005 (gmt 0)

Our site has recently dropped drastically in G and as I check the inurl, I see 2 at first and then as when I show more, I can see they have at least 6 different versions cached. Some have PRs of 3 and most are PR 0. Our real site has a PR of 4.

I've emailed G several times and they have never responded.

Is this the same problem everyone is talking about here? If so, how can we give G the message that this is a big problem?

stroudtx

8:46 pm on Jun 28, 2005 (gmt 0)

As a follow up to my last post, does it make sense to make IIS do a 301 forward to our home page from a non-www request that is one of the two cached on G?

g1smd

9:34 pm on Jun 28, 2005 (gmt 0)

301 redirect - yes

to home page - no, no, no, definately not

Use a 301 redirect from non-www to the www version of the same page.

Using Apache this is a three line command in the .htaccess file.

Google for "301 redirect ISAPI windows" for IIS information.

MikeNoLastName

1:37 am on Jun 29, 2005 (gmt 0)

ciml,

>it's a different case when Google finds near duplicate content to your page - that's where you can be affected.

Um, a CACHED page IS by definition A 100% duplicate (plus a little couple % extra at the top added by Google)

>If Google don't crawl the near duplicate page, then the near duplicate page won't affect you in Google. No need to hope. :-)

Here's where I'm a little more skeptical. Are you saying that you can GUARANTEE that just because Google doesn't DISPLAY the title/description that the page content does not exist ANYWHERE in the index for comparison for duplicate content? I find it hard to believe that once it has it, G would pass up the opportunity to retain that info for personal reference, especially now that they're doing their utmost to eliminate scrapers copying other sites.

I have to admit IT does NOT show up when searching for the unique first line on our site (although 80 OTHER scraper sites which contain it DO show up ahead of us)

stroudtx

3:18 am on Jun 29, 2005 (gmt 0)

I looked at the site that sells this product and has a light version that is free that can do the job. How difficult is it to set up?

All I'd want to do is forward my non www addresss to the correct www domain address.

Thanks - Great suggestion!

Mike

ciml

2:56 pm on Jun 29, 2005 (gmt 0)

Mike

> Um, a CACHED page IS by definition A 100% duplicate (plus a little couple % extra at the top added by Google)

And if the cached page isn't in the index, it isn't a duplicate in the index.

Google displays URL-only results because it has not fetched the URL. If Googlebot has not fetched the URL then it does not have the contents, and it does not have the opportunity to 'retain the info for personal reference'.

MikeNoLastName

7:27 pm on Jul 2, 2005 (gmt 0)

OK ciml,
I ALMOST bought into your explanation. However, I've just found what I believe to be proof to the contrary! If what you say is true, that if a listing is URL only, then G doesn't know what is on that page to compare for duplicates, then there should be NO URL only results in BACKLINK results, since by your definition G doesn't know what is on the URL only page in order to know that it links to another page. However, I have found numerous cases where there ARE URL only pages listed in backlinks! In fact, probably anyone can find them, they are quite common. So now what do you have to say?

steveb

9:01 pm on Jul 2, 2005 (gmt 0)

You won't find URL only listings for pages never cached, but you will find them for pages that were cached at one time.

Maybe another way to put it is that the backlink data is separate. Once Google knows of the backlink, the backlink stays even after the page with the link on it goes URL only, until the next backlink update.

g1smd

9:18 pm on Jul 2, 2005 (gmt 0)

Well, there is a simple way here:

Use the Google URL Console to submit Google's own robots.txt file to their own removal tool.

That will sort out the listings properly.

MikeNoLastName

8:05 pm on Jul 3, 2005 (gmt 0)

>You won't find URL only listings for pages never cached, but you will find them for pages that were cached at one time.

Therefore, a page previously cached, including a Google cache page itself which got accidently indexed as explained at the beginning of this thread, COULD cause a duplication penalty, since the fact that it is URL only indicates it was previously cached, Right?

I also just came across a page that they are indexing for us, which is of the format:
www.our-domain.com/?abc.htm

As per the rules, this resolves to the same page as:

www.our-domain.com/

but why isn't G smart enough to realize it is not supposed to be indexing it twice? I don't even know where they the link from, not US! When I click on the "similar pages" link next to it, our home page comes up first, so it is OBVIOUSLY considering them identical or pretty close to it.

Sounds like Bourbon has a LOT of bugs and I'm REALLY tired of being the target of them and penalized for something that is not my fault!

g1smd

8:09 pm on Jul 3, 2005 (gmt 0)

>You won't find URL only listings for pages never cached

Yes you can. I know of a site that has a lot of pages that are password protected. They all showed as URL-only entries until we excluded them with robots.txt.

MikeNoLastName

9:35 pm on Jul 3, 2005 (gmt 0)

I also know of at least one site which has had noindex, nofollow set for ALL search engines since it's inception and it's home page still shows up as URL only in the index twice (www and non-www).

MikeNoLastName

11:57 pm on Jul 3, 2005 (gmt 0)

>Well, there is a simple way here:
Use the Google URL Console to submit Google's own robots.txt file to their own removal tool.

>That will sort out the listings properly.

What an excellent idea. I see there are about 4000 of these bad cache pages indexed for the single IP I was having a problem with.

I tried it. (At this point I'm desperate enough to try anything)

I submitted: 216.239.57.104/robots.txt to the URL console.

URL console said:

"We cannot process robots.txt files that contain Allow: lines. "

Next idea anyone?

steveb

1:56 am on Jul 4, 2005 (gmt 0)

"Yes you can."

You have to read the question to understand the answer. The response concerns backlinks.

But I did not state it correctly. A page that was previously *indexed* but has now gone URL-only can still show in backlinks. A page that was never indexed will not show as a backlink.

ciml

10:58 am on Jul 4, 2005 (gmt 0)

> So now what do you have to say?

You can see backlinks for URLs that were previously indexed, but are now URL-only. (I suspect that Steve meant to type "backlinks" instead of "URL only listings").

The canonicalisation for example.com/?whatever is for a separate thread, (one that we have had often).

> I also know of at least one site which has had noindex, nofollow set for ALL search engines since it's inception and it's home page still shows up as URL only in the index twice (www and non-www).

That's normal, if there's also a noindex directive in /robots.txt

The reason is that Google won't fetch the URL if it is excluded in the /robots.txt, so it won't see the noindex,nofollow META tag and doesn't know to remove the URL.

excell

5:22 am on Jul 27, 2005 (gmt 0)

Dayo_UK mentioned:[quote]I agree that the above example of links to the Google cache dont cause a problem as they are excluded by robots.txt.
However, I know of at least one search engine that does allow the cache to be crawled... (delibrately or not I dont know)....
[quote]

This indexing of the Ansearch cached pages by other search engines seems to be an increasing problem not just in AU but many other countries as they roll out more search engines & directories. UK US NZ etc etc.

I'm just wondering what the potential long term impact of this duplicate data is going to be for smaller websites (that are not strong in SE positioning) it's very possible that the likes of google could drop the real pages as duplicates.

Does anyone know if Ansearch is the only supposedly mainstream SE that caches & creates pages made up of other websites data that doesn't protect them from indexing with robots.txt? Or are there other similar that we should be aware of? (I don't mean typical scraper sites etc.)