|301'd pages last forever as WMT internal linking pages|
First I'll say, for this particular site, every page links to the home page three times; two links with the text "Home" and one link with a keyword free, generic, brand name for the site. Why? Obviously for the convenience of the visitor. (Is this considered bad practice these days? Linking to Home?)
Webmaster tools (WMT) and a Google quirk.
Using Webmaster tools, "Search Traffic", "Internal Links", then clicking on the report for the Home Page of the site; this report shows more pages than exist on the site, as internally linking to the site. How many more? Very close to the number of removed pages that 301 redirect to a page on the site (whether the page exists or not).
I know Webmaster Tools has many quirks, but I believe when it is apparent that "tools" must be pulling data directly from Google's databases one has to believe the data is accurate. Everything in the story below is validated by actual log content from the site. So when I say a 301 was returned, that is what was reported by logs. Logs have been kept for this site since mid 2004.
A story of a couple 301 redirects.
Once upon a time (sometime before 2009) there was a page named a-b-c.htm, it was renamed A-B-c.htm (for reasons forgotten) and the original was redirected with a 301 code in htaccess. This 301 redirect is still in the htaccess today. There have been no links to a-b-c.htm on the site since 2009. The target page for the redirect, A-B-c.htm was removed from the site and a 410 GONE was reported for the page; this was done at the end of June 2013. Googlebot crawled the a-b-c.htm page two more times, was redirected to A-B-c.htm, where a 410 GONE was returned. Other bots still do crawl a-b-c.htm. I know some bots seem to have trouble with case (they ignore it!).
Googlebot has not crawled a-b-c.htm since Jul 7th 2013.
YET to this day the page a-b-c.htm (gone since 2008) is still reported as in internal page linking to the home page. Oddly, the page A-B-c.htm is not reported as an internally linked page today. Other pages (which no longer exist), with proper 301 redirects, are listed in the WMT internally linking pages report and even show a appropriate preview of the page redirected to. And in fact, the number of pages the internal links report indicates, is the actual number of pages on the site, plus, all the pages that are now 301 redirected to other pages. Google probably does keep track of these old (non-existent) pages to make sure the redirects aren't abused in some way. I suppose the person that designed the WMT "internal links" report may not have realized this database contained this basically outdated information when considering the "Internal Links" perspective. But then one also has to question; Is Google actually considering these non-existent pages and links? It's certainly likely Google has archived these old pages.
The Google site: command intermittently corroborates this incorrect internal links page count from WebMasterTools. If the site: command is used on this site, typically the number of pages reported is fairly accurate, but randomly, the number of pages reported for this command approximates the number of pages indicated by the internal links report. It's not something I can reproduce, but I have seen it.
My fix for this will be to set up a 410 Gone return for all these pages (they are GONE), and then, to make sure Google eliminates them, I will link internally to these non-existent pages until Google attempts to crawl them at least 3 times. 410 GONE does seem to reliably stop Google from crawling a page. But my goal is having these pages truly disappear from the WebMasterTools report.
Also, I just wanted to pass this observation about 301'd pages on.
P.S. I'm practicing run on sentences with BIG words and acronyms, I hear Google considers these at least intermediate? Hey, hey.... It's also astonishing how many pages on the web with virtually no content are "Advanced". But I digress.
It has been my expereince (since 1996) that no search engine (and I mean ALL OF THEM) ever forget a url they have crawled.
Whether it is an accident that these old urls get back into the crawl, or deliberate, that I can't say, but they are definitely still in their index. I can say that pages GONE 410'd "way back when" are still, from time to time, (re)appearing in my logs.
I don't think there is any way to make them go away.
Conversely, on the side of the search engine (any of them), I wouldn't take a website's word that the page was REALLY gone... it might come back and then what?
These days I just ignore it, keep the site(s) clean and move on.
I have no problem with Google remembering all these pages, but what is important is these non-existent pages are still considered by webmaster tools, and Google, as pages that "internally" link to the home page of the site; this is not the case!
In one case, this has not been true since 2008, in the others, since at least 5 months ago.
Is this a bug?
Of course there are many bugs, but the only way to eradicate them is to point them out.
Publishing them (bugs) in WebMasterWorld is more effective than contacting Google directly.
|there was a page named a-b-c.htm, it was renamed A-B-c.htm (for reasons forgotten) and the original was redirected with a 301 code in htaccess. This 301 redirect is still in the htaccess today. There have been no links to a-b-c.htm on the site since 2009. The target page for the redirect, A-B-c.htm was removed from the site and a 410 GONE was reported for the page |
Now, wait a minute, that's another version of the redirect chain.
a-b-c >> A-B-c
A-B-c >> 410
should be replaced by
a-b-c >> 410
A-B-c >> 410
in parallel. Or, if you prefer, (a-b|A-B)-c >> 410
"via this intermediate link" is weird. I've talked about this elsewhere in the context of moving sites. They will say both
onetwothree "via this intermediate link" fourfivesix
fourfivesix "via this intermediate link" onetwothree
even though fourfivesix has never redirected to onetwothree. Only in the other direction.
|it might come back and then what? |
Then there would be newly discovered links to it, wouldn't you think? But I do realize that a search engine's mind does not work like yours and mine.
If they are being redirected to pages that no longer exist, then the redirect can't work.
A redirect doesn't mean you go there. It only means the browser is instructed to make a new request. There's no prior information about whether the new request will be any more successful than the old one.
But if A redirects to B, and then later B is removed, then the 410 should be served at both A and B.
|But if A redirects to B, and then later B is removed, then the 410 should be served at both A and B. |
I agree entirely. This one case was simply an oversight, one that I imagine happens to webmasters frequently. But should Google report this page, a page that has been gone for 5 years, as a page that links internally to the site?
In addition to this one unusual case, correctly 301 redirected pages have been in this report for 5 months now. (I never really looked this deep into this report before)
I'd be interested to know if anyone else sees pages in this report that no longer exist (for a long period), listed as pages linking internally to another page.
Looking at this report:
|Webmaster tools, "Search Traffic", "Internal Links", then clicking on the report for the Home Page of the site; |
Should Google be listing non-existent pages as internally linking to other pages?
I'll certainly be fixing all of these cases with 410's, but most are just properly 301 redirected pages.
|A redirect doesn't mean you go there |
Well it means that the browser will try to go there. But in this case there's nowhere to go.
But this is a question of semantics. I'm sure that we both understand how it works.