|Removal Requests expiring - dupe pages reappearing|
Because of a very brief domain name duplication back in October 2012, Webmaster tools tells me I have thousands of incoming links from a bogus domain we own.
Briefly, widgetworld.com and mywidgets.co.uk were serving the same pages. The former was an error, but we quickly got on it. Any request for any page at widgetworld.com results in a 410 - Gone, and through webmaster tools we've done both a site removal and page removal requests on widgetworld.com for any page that links to mywidgets.co.uk.
So about every ten days the number of inbound links reported in Webmaster tools from widgetworld.com to mywidgets.co.uk drops from the high thousands to the low thousands. And we get better organic results and everyone's happy. No longer is the site dragged down by these bogus old duplicate links.
Then, the process reverses itself. Why? Because Google expires all the removal requests for widgetworld.com; widgetworld.com returns to its top spot for incoming (duplicate) links, and down we go. And the process starts again: all the url removal requests have to be resubmitted. It's a sort of splat-the-rat game. The removals generally last about ten days with the status 'removed' before they're 'expired'.
I don't really understand what 'expired' can mean, but Google's determination to keep these old, gone, dead, unwanted links alive is really hurting us.
Any suggestions would be gratefully received.
Have you thought about redirecting instead of 410?
welcome to WebmasterWorld, partyark!
have you checked to see if you have any inbound links to these widgetworld urls?
Let me get this right - even if I remove urls with Google webmaster tools and block them in robots.txt. They may still get indexed?
Sure. If someone links to them, or they're discovered some other way.
If you don't want them indexed, you need to slap a robots meta tag of NOINDEX on them.
And then you need to remove the disallow in robots.txt, because if Google isn't supposed to crawl them, they won't see the NOINDEX.
robots disallow is for crawling.
robots meta tag is for indexing.
Crawling and indexing are not the same thing.
If they serve a 410 will they not be dropped from the index? Provided robots.txt isn't blocking the crawlers from seeing that of course.
In theory, but you never know. I'm still occasionally getting URLS from a version of a site from four+ years ago in GWT. Google never really forgets.
Thanks for the replies.
To be clear, robots.txt is correcly disallowing any crawling on widgetworld.com (the bogus domain). Apart from robots.txt (200) it has been 410'ing all requests for anything some time.
According to webmaster tools, there are no incoming or external links to widgetworld.com. It has no keywords, and total indexed pages is zero. But despite this, it's clear that somewhere in the google engine it's very much alive.
I think what's happening is that the RemoveURL requests are "expired" because they're returning 410 - or perhaps because there's a site removal on widgetworld.com too. As soon that expiry is triggered, which usually takes about ten days from the remove request, they re-appear as incoming links to my real domain... and then my domain takes a pasting in the organic results and the whole process of requesting removals starts again.
One extra phenomenon - on widgetworld.com (which I'm really trying hard to remove from the index) there is a list of 'Crawl Errors' in webmaster tools.
This has a bunch of 410'd pages like I'd expect ... but it has got stuck. It hasn't added any since early May, when it reached 1,000, and the graph is completely flat-lined since that time. It looks like there's some sort of hard limit on the number of pages in error.
UPDATE: The failure to grow the list of 410s co-incides with adding a robots.txt directive to disallow crawling - that sort of makes sense. Except that it directly contradicts google's own guidelines on removing urls which state that a robots.txt entry is a good thing.
I'm going to try to allow crawling and see if I can get more pages added to the Crawl Errors list.
[edited by: partyark at 7:57 pm (utc) on Jun 10, 2013]
What do you see in your server logs? Does googlebot fetch those pages?
As netmeg writes above: if you disallow crawling of all pages in robots.txt then googlebot never will try to fetch the pages, and thus never will see the 410. You need to remove the disallow in robots.txt so it can fetch the pages and then it will see the 410 status code.
I have a vague memory that Google is supposed to not play well with 410s.
If the crawl-and-noindex doesn't do it, perhaps try a 404 for a bit and see if it helps?
But, unless widgetworld.com has bad links that you don't want affecting the correct domain, I would have 301ed each page to the correct page, myself. :)
Thanks, we did 301 for about 4 months. This is a sensible thing to do if you've never been indexed improperly. However, if these pages have been indexed with 200s previously, they still remain as dupes in the internals of google. And you still get penalised. And worse of all, the 301 doesn't do anything to clear the old page out of whatever corner of google the original page is stored.
What we need is some way of telling google 'please take this page out of your link graph'. RemoveUrl won't do it, that's just for the results pages. In fact the only way I can see is to wait for google to crawl your page and if it consistently returns a 410/404 and/or NoIndex then after a few months you might be lucky enough to see it pulled out. But there seems very little you can do to encourage this crawling, and for a site that you're trying to 'demote' the crawling can take years.
And even then it's not really out. I can look at each of the 1,000 pages that have now been crawled as 410, and see what links to that page.... and there are thousands more internal pages even though 'internal links' are reported as zero.
The lesson is: once in, (almost) never forgotten.
In my experience, proper 301s disappear a lot faster than 404s or 410s.
I just looked in my logs for a domain that I replaced 5 or 6 years ago, with a 301 on everything to the matching page - googlebot is still hitting the old domain. Hmmm... not what I expected.
Have you thought about a Change of Address in Webmaster Tools?
netmeg - thanks, do you mean 'disappear from search results' or 'disappear altogether'? The former is easily done, the latter rather less so. I'm also now pretty certain that a 301'd page that was previously indexed will still be retained in some form within the google index, even if it doesn't appear in the results.
leadegroot - yes we did 'Change of Address'. Didn't make any difference. Interesting that you are getting requests for pages that have been extinct for ages.
The only thing I can say with certainty is that pages will stop showing up as incoming external links to my core domain if they return a 410 enough times (or NoIndex). The question is then whether they'll still be harming my organic rankings because of duplicate issues.
|I have a vague memory that Google is supposed to not play well with 410s. |
If the crawl-and-noindex doesn't do it, perhaps try a 404 for a bit and see if it helps?
No, it plays very well with 410s [webmasterworld.com], always has.
partyyak, if the 410s are being correctly served then the pages should remain out of the index. Exactly how are you serving the 410 status?
They're being served with a 410 http response; when they're crawled they end up in Crawl Errors with a 410. So that's working ok.
Returning a 410 (or 404, or 301) does do what you'd expect: the page no longer shows up, and perhaps some of it's 'juice' flows to where you want it to if you're 301'ing. Except it's not the whole story...
What I know so far:
RemoveURL's documentation is misleading. It says that you should back up any RemoveURL request with any or all of a) a robots.txt exclusion; b) a 410/404; c) a meta NoIndex. However, if you do a) your page will just be added to the Uncrawlable stack. It will go from the results, but Google will say "I can't see this page any more, so I'll assume it's still linking to whatever it did". If you do b) Google says "Hey this page is GONE I better remove any links, but I'll still hang on to the page content just in case."
So how do I encourage Google to go and look again at the pages and completely remove them from its index? I've actually got quite a few of these bogus domains (yeah, yeah) so I can do a bit of testing.
What I've done for one bogus domain is to allow crawling through robots.txt, but to continue to return 410s. What I think will happen is that crawling will happen very slowly, and once Google is sure the page is "gone" its links seem to get culled. However, I'm reasonably sure that the page content is still stored, so the dupe issue might not be solved.
On the second domain, I've allowed crawling and all pages now return 200 with a NoIndex, and an empty BODY. I'm hoping that this new, empty content will replace the old stuff, and that the NoIndex will have the effect of removing links.
On both domains I've put in some entries into sitemaps to see if that encourages crawl rate.
|If you do b) Google says "Hey this page is GONE I better remove any links, but I'll still hang on to the page content just in case." |
Yeah, Google does like to hang on to stuff, but that does not mean keeping things in its index if not necessary. If you click the link I posted above you'll see what Google's John Mueller has to say about 410s.
It's a simple process that seems to have become complicated in this case. All you have to do is return a *valid* 410 response. There's no need to muss around with robots.txt because if the pages are gone, well, they can't be crawled.
What I was trying to get to when I asked "Exactly how are you serving the 410 status?" was technically, exactly how are you returning that response and that's because way up above you seemed to conflate robots.txt and server responses:
|To be clear, robots.txt is correcly disallowing any crawling on widgetworld.com (the bogus domain). Apart from robots.txt (200) it has been 410'ing all requests for anything some time. |
Look into sitemap assisted redirects for clearing out the old URL's of pages that have been 301'd.
Basically, you feed a sitemap to the 301'd URLs so Google finds them... once your old pages are out of the serps, you then update to the correct sitemap.
|So how do I encourage Google to go and look again at the pages and completely remove them from its index? |
Remove the robots.txt block and step aside quickly. That's all it takes. Well, it worked for me on a slew of microscopic-traffic pages that had been roboted-out for a year. So it really shouldn't be necessary to do anything more.
Yes, google's own documentation is misleading. In this case you have to translate as "do as we DO, not as we say" :(
If there's no page then 404 or 410 is the reality for me. I use the 410 when I want people to know definitely - it's gone.
10 days? Thought it was 90. Option: search results and cache?
Any page still live - <meta name="robots" content="noarchive" />
In my case I find someone's linking to the page from somewhere. Some times, depending on where the link is coming from, I redirect the referer, to an index or some page with explanation.