Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages Reappear 6 Months After URL Removal Tool

Even with /robots.txt exclusion?

         

johnt

9:11 am on Jun 14, 2005 (gmt 0)

10+ Year Member



A little over 6 months ago I used Google's URL removal tool, in conjunction with robots.txt files, to remove a number of pages & sites from Google's index, or so I thought.
At the time, I'm pretty sure that Google said this was temporary for 90 days. Now it says 6 months. Fine.

Unfortunately, after the 6 months are up those pages and sites which I removed from Google have reappeared in the index in their old format - some of the cache dates go back to September last year. But the robots.txt files still disallow Googlebot from either specific pages or entire sites, and Google seems to be totally ignoring this. I would have thought that, after the temporary time-frame has elapsed, Google would check the robots.txt files again and if they still didn't allow access, then the pages would remain out of the index.
But apparently they never left the index at all, but only the SERPs. Google retained the information that they had at the time of requesting removal, and now some of those pages are re-appearing in SERPs.

Has anyone else had this happen to them? Does anyone have any advice on how to permanently remove URLs from Google, or I am I going to have to manually submit a list of robots.txt files to their URL removal tool twice a year for the rest of my life?

Cheers

John

ciml

5:23 pm on Jun 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do they appear as URLs only (which I'd expect with /robots.txt exclusion) or with their titles and descriptions (which I wouldn't expect after Google had seen the /robots.txt)?

sailorjwd

6:00 pm on Jun 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got the same thing happening.. last Jan I reorganzied a few folders and their names changed.

I deleted the old pages using url-removal tool and now they are completely back and rank higher in the SERPs than the real pages.

I attempted to re remove them and the request is totally ignored in the tool.

so now I'm waiting for further dup content penalties... as if there could be more than I already have :(

Wizard

6:17 pm on Jun 14, 2005 (gmt 0)

10+ Year Member



I have noticed it in my thread: [webmasterworld.com...]

When six months have passed, my URLs reappeared - these that were still online, had the most recent snippet immediately, but these returning 404 since six months ago reappeared with last snippets known to Google.

I re-removed them with URL Console, and they're gone again. I'm afraid they'll be back in November, unless Google engineers do something about this.

johnt

8:18 am on Jun 15, 2005 (gmt 0)

10+ Year Member



> Do they appear as URLs only (which I'd expect with /robots.txt exclusion) or with their titles and descriptions (which I wouldn't expect after Google had seen the /robots.txt)?

They appear as full entries, with cache, title and description, but all from a very old version of the page as Googlebot can no longer fetch the pages due to robots.txt exclusions. Some of the caches go back to March 2004.

While I've been able to go through the removal tool again and take most out, since I originally removed the pages 6 months I have set up 301 redirects on some sites, which means that I cannot use the removal tool for pages on those sites - Google will be unable to retrieve a robots.txt file for the domain as it will be redirected to the new domain.

Any suggestions on this? I've contacted Google through their "contact us" forms and heard nothing back, but I'd really rather not wait for GoogleBot to notice that the pages in its index are either redirected, or actually barred using robots.txt.

cheers

John

Wizard

12:42 pm on Jun 15, 2005 (gmt 0)

10+ Year Member



I have set up 301 redirects on some sites, which means that I cannot use the removal tool for pages on those sites - Google will be unable to retrieve a robots.txt file for the domain as it will be redirected to the new domain.

Use RewriteRule to disable 301 redirect for robots.txt

Reid

3:44 pm on Jun 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if you return a 410 Gone then they will be removed permanently. The robots.txt submission will remove them immediately but i didn't know they would come back again.

the new google sitemaps feature seems to be the answer for these types of problems. it's like a robots.txt on steroids that you register with google, showing how you want your site indexed.

johnt

3:55 pm on Jun 15, 2005 (gmt 0)

10+ Year Member



Thanks for the tips guys.

Wizard, good tip on the RewriteRule.

Reid, I can't use a 410 because the pages are not actually gone, I just don't want them in Google's index. It does work though, I've used it on pages that actually are "gone" in the past.

I agree that the new sitemaps facility is a great step forward, but I think that it's more for telling Google about new pages, or pages that may be more difficult for it to find, than about pages that you don't want indexed.

cheers

John

g1smd

4:34 pm on Jun 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Look at 216.239.37.99 and see that it is full of pages cached 6 and 7 months ago (even contains pages that no longer exist, or have completely changed their content since) - looks like someone at the 'plex is comparing old data with new algo against new data with new algo.

Wonder if this is something to do with fixing the redirect hell they created a few months ago?

MikeNoLastName

8:04 am on Jun 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We have a closely related problem.
About 6 months ago we 301 redirected the home page of www.domain-a.com to www.domain-b.com. Everything was fine for 6 months. We also have 301 redirects from all non-www to www domains. However for some stupid reason at the start of boubon G found an old reference to [domain-a.com...] which shows up as ONLY a URL, no cache, no supplemental, no description, just plain URL listing: [domain-a.com...]
It is triple 301 redirected:
domain-a.com -> domain-a.com/ -> www.domain-a.com/ -> www.domain-b.com/
but it is only the second stage which is getting indexed.

I'm almost certain it is the cause of a duplication penalty which is bringing our entire domain-b.com down. I've written G support, tried resubmitting domain-a.com/, putting links to it from heavily spidered pages, even replacing a real page there and removing the redirect to domain-b.com. This caused the old home page www.domain-a.com/ to get re-indexed as itself (as expected) in less than 2 days as a PR6, but domain-a.com/ still remains! Googlebot appears to visit it daily and get the 301, but never wants to remove it! We've tried everything short of url console remove. I'm really scared to try a url console remove since I've heard there are some issues with G removing BOTH the domain-a.com AND the www.domain-a.com when one or the other is specified, which we definitely CAN'T risk at any cost. Also I suspect a 404 error caused by removing the page temporarily will do little more than lose us the PR generated by all our old links to this page.

Any suggestions for SAFELY getting it recognized as being redirected. If not, I'm thinking of pointing it at a competitor, or perhaps G and see if it brings them down instead!

johnt

8:39 am on Jun 16, 2005 (gmt 0)

10+ Year Member



Maybe GoogleBot is getting confused by the number of redirects. You could try redirecting domain-a.com directly to www.domain-b.com without all the intervening steps

g1smd

11:04 am on Jun 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I installed a 301 redirect in mid-March on a friends site (of 118 pages) and it took 4 or 5 weeks for Google to sort things out. Previously some pages were attached to www and some to non-www and many were without title and description. Many pages were duplicate listed, and many pages were not listed at all.

Things were perfect at the start of May. Google had picked up all the redirects and things were all listed correctly. A fake sitemap, installed on another site, pointing to URL versions for pages that we didn't want listed had helped greatly.

In late May, right at the start of update, Google suddenly listed all four versions of every page for the site (with and without www, and with and without the trailing / on the URL - every page of the site is an index page in a folder) and it stayed that way for several weeks. The three extra versions were all without title and description. The 118 "real" pages were all with title and description.

The problem fixed itself about a week ago. Nothing on the site or in the links was changed

In the last few days that 64.233.167.104 datacentre has reverted to showing the "broken" listings again - a mix of www and non-www, many duplicates, many pages missing, etc - and the cache date for all those pages is 6 months ago.

For other searches, old pages are appearing in the SERPs, pages that should not appear for that search term because the content of those pages no longer contains that information. Again, each of those pages has a six-month old cache.

For pages that are new online (first published in the last 3 or 4 months or so), pages that have been "fully indexed" {full title and description, and cached at least weekly} since being online, the pages now appear as URL-only entries in the 64.233.167.104 datacentre SERPs.

The basis for the data at that IP is an index from early January, with some newly spidered information added in. The data predates the time that much of the redirect problem was seen, but for sites that have fixed their redirect problems the older data reverts back too far in time.

Reid

5:24 pm on Jun 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mikenolastname - first i would go to the apache forum (I assume thats what you are using) and find out the proper standard way to deal with domain.com VS domain.com/ my guess is that this is already handled by the server and may be redundant to do a re-direct on that.
I have never heard of re-directing domain.com to domain.com/
Do a 301 from domain.com/ to www.domain.com/ so that each domain is set up properly and then do a global redirect from www.domaina.com/* to www.domainb.com/*

Reid

5:27 pm on Jun 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



g1smd - I would take a wait and see approach to this.
Things were working fine and then went back to before the fix on certain datacenter.
Google is messing around still. maybe comparing algo on old data with algo on new data. I would wait and see.

theBear

7:02 pm on Jun 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Reid and g1smd,

Back a short time ago during this update a huge pile of old stuff started surfacing under our IP addy.

If you were to multiply the number of real pages by the number of server aliases that were exposed prior to March 12 the number might be close to correct. I can't view most of it. The fact it is there at all has the boss very concerned.

There have been others point this out including a post in the dealing with bourbon thread [webmasterworld.com...] msg 545 earlier today.

I see that dmoz still has a mess although it doesn't look nearly as bad as it did.

MikeNoLastName

10:05 pm on Jun 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi
Reid, Yes the domain-a.com -> domain-a.com/ IS internal and already handled automatically by the Apache default server configuration, not in the .htaccess, I just included it in case anyone knew of this maybe being a problem. I have no simple control over it without manually going into the low level config files and changing it for everyone. It's impossible to tell from the logs exactly what version Googlebot is using to access it, but I DO see 301's coming up for " / ".

Also a good suggestion on the global domain-a.com/* -> domain-b.com rewrite, unfortunately at this point it's only the home page and a handful of other pages being redirected to domain-b.com (lucky thing since G seems to hate domain-b in this update while domain-a is flying right where it used to be.)

If this doesn't clear up in the next week or so, I'm going to try redirecting domain-b.com/index.htm back to domain-a/index.htm INSTEAD and see if that'll make a difference. I really don't have much to lose at this point since I'm already penalized.

>>Do a 301 from domain.com/ to www.domain.com/ so that
>> each domain is set up properly and then do a global
>>redirect from www.domaina.com/* to www.domainb.com/*

Actually I'm doing a global 301 rewrite from domain.com/* to www.domain.com/* THEN a 301 redirect of www.domaina.com/index.html and index.htm to www.domainb.com/

BTW, when redirecting the homepage (i.e. www.domain-a.com/), I've never seen a way in the docs to redirect "/" itself in the .htaccess, you have to actually redirect "/index.html -> [domain-b.com"...] right? (or am I missing something?). Doing the former causes an infinite loop error or is ignored if I recall because the server OS is already redirecting / -> /index.html before it gets to the .htaccess.

On the other subject mentioned a couple posts back, which might shed some light on some of the checks G is doing, we had our server management subdomain URL pop up on google as a URL-only listing a couple days ago! This subdomain was until recently only accessable by IP address and WE didn't even know it was now accessible by subdomain name (something to the effect of: admin.domain-a.com). It's never been linked to from ANYWHERE or even known about by ANYONE. The only way G could have found out about it was by an exhaustive IP address search, or maybe by backtracking previously collected IPs of websites. The current admin.domain-a.com IP is separate from the rest of our IP block and HAD previously been used by our primary domain (www.domain-a.com) which was switched to another IP almost a year ago when we changed server hardware and our ISP made us change the way our server and nameserver was configured and assigned to IPs. Is this is how they found the domain-a.com/ URL (which also is listing as only a URL), and the reason they are so determined to keep it in the database since it has been confirmed by them as the primary ptr record response to an IP lookup? Way to detect multiple domains on an IP? Just an idea.

Reid

4:08 pm on Jun 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mikenolastname - back in may - after the unnamed update before burboun - when google apparently was fixing the 302 problem - me and some others were seeing old forgotten 301's re-apearing in the SERP's, one person had an old domain show up from over 2 yrs ago that was no longer 301 - it was gone but still it showed up in the SERP's. I just used the removal tool (submit robots.txt) to nuke them. I wouldn't recommend this to you though - not at this point.
What you should do - since you are having some q's and wonderings about low-level apache workings (or maybe some stray "-" or something in .htaccess)
Go to the apache forum and talk to JDMorgan about the way you have it set up. This guy is the apache wizard. He may point out some slight syntax problem you may have. Even a space in the wrong place or something like that could be sending the wrong message. Or the way these several redirect are interpreted or should be set up. He is the only one I know who would have that kind of knowledge definitivly.