Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google & Accuracy Of Site Indexing

         

austtr

5:06 am on Jan 29, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google has always encouraged us to let deleted pages return a 404 status..... don't obsess over them, they are natural, it’s a sign of a healthy, normal site, they'll disappear from the idex in good time once Google is certain the 404 status is genuine… etc etc. In short, trust Google to know what they are doing. Fair enough… happy to do that.

But I am more than a little curious when using the site: command to check the indexing of a site, and Google says it is still indexing pages that went 404 years ago, in some cases more than 15 years ago… So what happened to the advice that 404’s drop from the index in good time?

On the same subject, same site…. why does the GSC sitemap report show that as many as 40% of the entries in the sitemap.xml are not being indexed?

Am I missing something?

FranticFish

8:39 am on Jan 29, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have noticed that Google can seem to ignore the pages you want indexed and then go out of its way to index non-existent pages (years back, result pages from forms that weren't locked down could seriously hurt a site's traffic by filling up the index with soft 404s).

Is it possible that they are getting signals from elsewhere that are confusing them about which pages are important and which aren't?
- links to pages that are now 404
- site architecture doesn't indicate that new pages are as important as you think they are

engine

11:27 am on Jan 29, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Could those 404s be linked from somewhere else? If so, google's just feeding off the links.

RedBar

12:58 pm on Jan 29, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



in some cases more than 15 years ago


I have the same.

Could those 404s be linked from somewhere else? If so, google's just feeding off the links.


This is precisely my situation with old-time hotlinkers and trade references that haven't been updated since before 2000.

I don't worry about it.

Andy Langton

1:07 pm on Jan 29, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google definitely has a record of older site iterations. What particular form that record takes is something of a unknown quantity. Even with just Webmaster Tools crawl error data you can follow the loop of old URL >> linked from >> another old URL >> linked from >> another old URL and never arrive at an external URL or live URL. This data seems more prominent in Webmaster Tools than it used to be, but Googlebot has always been requesting these URLs anyway.

I've always wondered how much impact is implied. Of course, you take a hit if you change site URLs and don't migrate them. But this is what the overwhelming majority of sites do, because migrating URLs is rarely a task undertaken by developers. There must be some value in maintaining all of this, presumably? Cool URIs Don't Change [w3.org], of course ;)

why does the GSC sitemap report show that as many as 40% of the entries in the sitemap.xml are not being indexed


Sitemaps are a "hint" to Google as to content that might be worth indexing - not a forceful measure to get content into their database. Google rarely indexes 100% of sitemap-linked pages. On Wordpress, for instance, most people have a generated sitemap that is practically guaranteed to submit useless pages via a sitemap that won't be indexed. To find out whether you have a practical or theoretical problem, you need to find out which pages in the sitemap aren't indexed.

austtr

5:18 am on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The site gets a clean bill of health from Screaming Frog, Integrity etc…. there are no broken links generating 404 status codes that might misdirect the indexing. There are hundreds of defunct inbound links out there on 3rd party sites but indexing of the site is not influenced (or shouldn’t be) by poor housekeeping of other webmasters. Those links are reported in GSC as crawl errors and are not related in any way to indexing.

Try and find the obsolete URL’s that Google insists are in the index and the search result shows nothing found. (text search of the URL in inverted commas) Great…. Google seems to be saying…”you know those URL’s we keep telling you about in your index… well, they don’t actually exist.”

Hence the OP… is there a question on the accuracy Google’s indexing? If they have built a profile of a site that is wrong, is there potential for that site to be adversely affected? Is there a way to "flush and refresh" the index?

aristotle

12:36 pm on Jan 30, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The site: operator tends to be unreliable. If you want to know for sure if a page is in google's index, search for a snippet of text ( in quotes) from its content.

Here's a phenomenon that I've observed occasionally over the years: an old well-established high-ranking page will completely disappear from google's search results for a day or two, then suddenly re-appear with its old rankings. I've also seen this happen with Bing. Does anyone have an explanation for this.