Forum Moderators: Robert Charlton & goodroi
Unfortunately, after the 6 months are up those pages and sites which I removed from Google have reappeared in the index in their old format - some of the cache dates go back to September last year. But the robots.txt files still disallow Googlebot from either specific pages or entire sites, and Google seems to be totally ignoring this. I would have thought that, after the temporary time-frame has elapsed, Google would check the robots.txt files again and if they still didn't allow access, then the pages would remain out of the index.
But apparently they never left the index at all, but only the SERPs. Google retained the information that they had at the time of requesting removal, and now some of those pages are re-appearing in SERPs.
Has anyone else had this happen to them? Does anyone have any advice on how to permanently remove URLs from Google, or I am I going to have to manually submit a list of robots.txt files to their URL removal tool twice a year for the rest of my life?
Cheers
John
I deleted the old pages using url-removal tool and now they are completely back and rank higher in the SERPs than the real pages.
I attempted to re remove them and the request is totally ignored in the tool.
so now I'm waiting for further dup content penalties... as if there could be more than I already have :(
When six months have passed, my URLs reappeared - these that were still online, had the most recent snippet immediately, but these returning 404 since six months ago reappeared with last snippets known to Google.
I re-removed them with URL Console, and they're gone again. I'm afraid they'll be back in November, unless Google engineers do something about this.
They appear as full entries, with cache, title and description, but all from a very old version of the page as Googlebot can no longer fetch the pages due to robots.txt exclusions. Some of the caches go back to March 2004.
While I've been able to go through the removal tool again and take most out, since I originally removed the pages 6 months I have set up 301 redirects on some sites, which means that I cannot use the removal tool for pages on those sites - Google will be unable to retrieve a robots.txt file for the domain as it will be redirected to the new domain.
Any suggestions on this? I've contacted Google through their "contact us" forms and heard nothing back, but I'd really rather not wait for GoogleBot to notice that the pages in its index are either redirected, or actually barred using robots.txt.
cheers
John
the new google sitemaps feature seems to be the answer for these types of problems. it's like a robots.txt on steroids that you register with google, showing how you want your site indexed.
Wizard, good tip on the RewriteRule.
Reid, I can't use a 410 because the pages are not actually gone, I just don't want them in Google's index. It does work though, I've used it on pages that actually are "gone" in the past.
I agree that the new sitemaps facility is a great step forward, but I think that it's more for telling Google about new pages, or pages that may be more difficult for it to find, than about pages that you don't want indexed.
cheers
John
Wonder if this is something to do with fixing the redirect hell they created a few months ago?
I'm almost certain it is the cause of a duplication penalty which is bringing our entire domain-b.com down. I've written G support, tried resubmitting domain-a.com/, putting links to it from heavily spidered pages, even replacing a real page there and removing the redirect to domain-b.com. This caused the old home page www.domain-a.com/ to get re-indexed as itself (as expected) in less than 2 days as a PR6, but domain-a.com/ still remains! Googlebot appears to visit it daily and get the 301, but never wants to remove it! We've tried everything short of url console remove. I'm really scared to try a url console remove since I've heard there are some issues with G removing BOTH the domain-a.com AND the www.domain-a.com when one or the other is specified, which we definitely CAN'T risk at any cost. Also I suspect a 404 error caused by removing the page temporarily will do little more than lose us the PR generated by all our old links to this page.
Any suggestions for SAFELY getting it recognized as being redirected. If not, I'm thinking of pointing it at a competitor, or perhaps G and see if it brings them down instead!
Things were perfect at the start of May. Google had picked up all the redirects and things were all listed correctly. A fake sitemap, installed on another site, pointing to URL versions for pages that we didn't want listed had helped greatly.
In late May, right at the start of update, Google suddenly listed all four versions of every page for the site (with and without www, and with and without the trailing / on the URL - every page of the site is an index page in a folder) and it stayed that way for several weeks. The three extra versions were all without title and description. The 118 "real" pages were all with title and description.
The problem fixed itself about a week ago. Nothing on the site or in the links was changed
In the last few days that 64.233.167.104 datacentre has reverted to showing the "broken" listings again - a mix of www and non-www, many duplicates, many pages missing, etc - and the cache date for all those pages is 6 months ago.
For other searches, old pages are appearing in the SERPs, pages that should not appear for that search term because the content of those pages no longer contains that information. Again, each of those pages has a six-month old cache.
For pages that are new online (first published in the last 3 or 4 months or so), pages that have been "fully indexed" {full title and description, and cached at least weekly} since being online, the pages now appear as URL-only entries in the 64.233.167.104 datacentre SERPs.
The basis for the data at that IP is an index from early January, with some newly spidered information added in. The data predates the time that much of the redirect problem was seen, but for sites that have fixed their redirect problems the older data reverts back too far in time.
Back a short time ago during this update a huge pile of old stuff started surfacing under our IP addy.
If you were to multiply the number of real pages by the number of server aliases that were exposed prior to March 12 the number might be close to correct. I can't view most of it. The fact it is there at all has the boss very concerned.
There have been others point this out including a post in the dealing with bourbon thread [webmasterworld.com...] msg 545 earlier today.
I see that dmoz still has a mess although it doesn't look nearly as bad as it did.
Also a good suggestion on the global domain-a.com/* -> domain-b.com rewrite, unfortunately at this point it's only the home page and a handful of other pages being redirected to domain-b.com (lucky thing since G seems to hate domain-b in this update while domain-a is flying right where it used to be.)
If this doesn't clear up in the next week or so, I'm going to try redirecting domain-b.com/index.htm back to domain-a/index.htm INSTEAD and see if that'll make a difference. I really don't have much to lose at this point since I'm already penalized.
>>Do a 301 from domain.com/ to www.domain.com/ so that
>> each domain is set up properly and then do a global
>>redirect from www.domaina.com/* to www.domainb.com/*
Actually I'm doing a global 301 rewrite from domain.com/* to www.domain.com/* THEN a 301 redirect of www.domaina.com/index.html and index.htm to www.domainb.com/
BTW, when redirecting the homepage (i.e. www.domain-a.com/), I've never seen a way in the docs to redirect "/" itself in the .htaccess, you have to actually redirect "/index.html -> [domain-b.com"...] right? (or am I missing something?). Doing the former causes an infinite loop error or is ignored if I recall because the server OS is already redirecting / -> /index.html before it gets to the .htaccess.
On the other subject mentioned a couple posts back, which might shed some light on some of the checks G is doing, we had our server management subdomain URL pop up on google as a URL-only listing a couple days ago! This subdomain was until recently only accessable by IP address and WE didn't even know it was now accessible by subdomain name (something to the effect of: admin.domain-a.com). It's never been linked to from ANYWHERE or even known about by ANYONE. The only way G could have found out about it was by an exhaustive IP address search, or maybe by backtracking previously collected IPs of websites. The current admin.domain-a.com IP is separate from the rest of our IP block and HAD previously been used by our primary domain (www.domain-a.com) which was switched to another IP almost a year ago when we changed server hardware and our ISP made us change the way our server and nameserver was configured and assigned to IPs. Is this is how they found the domain-a.com/ URL (which also is listing as only a URL), and the reason they are so determined to keep it in the database since it has been confirmed by them as the primary ptr record response to an IP lookup? Way to detect multiple domains on an IP? Just an idea.