having old pages removed from google

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

having old pages removed from google

8 months old and are still there!

flex55

10:35 am on Jan 27, 2005 (gmt 0)

I had about 160,000 pages that included old and no-more-accurate information - all are about 8 months old.

I have redirected (301) all those pages to newer pages (completely different url's), with different, newer content, hoping to keep any bwlinks of the old pages to the newer pages.
However, even though google followed the redirect to the newer pages as i can see from my logs file (followed 3 and 4 months ago), still google holds an old copy of about 80,000 pages- none were updated, even though google took it. These copies are 8 months old!

Does anyone know how to get around it? Is there any way to have these old pages completely booted out of google?

flex55

10:16 am on Jan 30, 2005 (gmt 0)

Anyone?

lammert

11:45 am on Jan 30, 2005 (gmt 0)

I have seen the same for pages on my site. Some were deleted in April 2004, but are still not removed from the index. Google needs some time to delete old pages at the moment. Maybe other sites are linking to your pages which can be a reason for Google to keep a placeholder of the page in their index. In that case there is only a line with the URL, no acutal content.

You can check if Google will delete your pages soon:

Is there a cache? If not, the pages will probably be removed
Are the search results marked as "supplemental results", then the pages will probably be removed
Is the cache dated Dec 31. 1969, then the pages will probably be removed.

There is a recent thread about Google taking measures against scraper sites at [webmasterworld.com...] If the assumption is correct and Google is currently experimenting with the optimal solution to wipe out scraper sites, they might have suspended deletion of files from their index, because the more garbage is in the index, the better they can test if the algorithm is working correctly.

macdave

1:58 pm on Jan 30, 2005 (gmt 0)

[google.com ]

flex55

9:54 am on Jan 31, 2005 (gmt 0)

lammert, all pages are supplemental results.
there is a cache vesrion, dated somewhere 8-10 months ago

any idea how long would it take to get these pages off google?

macdave: robots.txt has been disallowing those pages for 5 months now. Google just doesn't take it off.

irishaff

10:10 am on Jan 31, 2005 (gmt 0)

This is something to do with duplication i " think " .

I moved content from one domain to the other. I could never get the same content in the new domain to rank. But it ranked in the old domain as a " supplemental result " . PR was still assigned to pages that were not there and the content showed in google cache for the old content. There was entirely new content in the old domain but the swop was done overnight.

Have you moved the content, is there some reason google would flag that content " set "? It is in some way not clean for google, perhaps somebody else has copied it ..have you done a foot print search selecting chunks of text in "" to see if it resides somewhere else on the web.

Hope this helps.

David
ps. i have not bothered to address the issue as the domains are not important to me. I am going to this week and i think that changing the content in the new domain ( away from that of the " set " that was transferred from the old one ( now showing as supplemental ) may help

Marval

11:20 am on Jan 31, 2005 (gmt 0)

macdave - the problem with Googles removal system is that if you have a 301 on the page/site already in htaccess, the page removal tool will not be able to access it to remove

macdave

2:55 pm on Jan 31, 2005 (gmt 0)

flex55- robots.txt just tells Google not to spider specific URLs, not to remove URLs that are already in the index. In this case, robots.txt is hurting you because it's not allowing Googlebot to see the 301s. If it could see the 301s, the old URLs would eventually drop out of the index as G-bot crawled your site. (Depending on crawl frequency, it might take a while). A quicker alternative:

1. Put the URLs you want to have removed into robots.txt. (you've already done this)
2. Submit your robots.txt to the Remove URL tool: [services.google.com:8882...]
3. Have a beer or two -- in 24-36 hours or so those URLs will be out of the index.

marval - The remove tool doesn't need to see the 301 in order to remove the URL. It just looks at robots.txt and matches it against files in the index. When using robots.txt the remove tool doesn't do any spidering (except for grabbing robots.txt).

Marval

3:50 pm on Jan 31, 2005 (gmt 0)

I guess I wasnt real clear - if you have an htaccess to set the 301s of the pages instead of using the 301 refresh on the page itself, the removal tool will never get to files and will error out. Ive tried many times to get entire domains removed and it errors out every time due to this.

Maybe if you could post an example of what you mean by putting the urls in the robots.txt file I can try a test to see if I can get that to work

macdave

4:45 pm on Jan 31, 2005 (gmt 0)

301 redirects are understood by Googlebot but not by the Remove URL tool.

If you 301 a URL, the change will will be picked up by Googlebot in the course of its normal crawling. Once it's seen the 301, it will take a day or two for the URL to drop out of the index, and possibly longer for Googlebot to come back and index the new URL. In flex55's case, robots.txt is preventing Googlebot from crawling those URLs to even see the redirects, so the URLs are staying in the index.

There are 3 ways to use the Remove URL tool:

1) "Remove pages, subdirectories or images using a robots.txt file." Set up your robots.txt to disallow the URLs you want to remove, then tell the Remove URL tool to read your robots.txt. This is by far the easiest way to remove large number of URLs from the index.

2) "Remove a single page using meta tags." Add meta "noindex" to individual pages and submit those URLs to the Remove URL tool one-by-one. This works well, but can be tedious if you have more than a few URLs, because you must add meta tags to each page and submit each individually.

3) "Remove an outdated link." Submit individual URLs that return a 404 status code. Easier than 2, but you can't provide redirects for your visitors. The page must really be gone.

Methods 2 and 3 retrive the URLs you feed them and require specific responses in order to operate. Neither understands redirects of any kind, so it's no surprise that you'd get an error. But using method 1 (robots.txt), the tool doesn't even try to spider your site, so it doesn't matter how or if you've redirected a URL. The tool just looks at robots.txt for a list of URLs to remove, and matches that against what's already in the index.

If you're not familiar with robots.txt, see [robotstxt.org ] or read up in the robots.txt forum [webmasterworld.com ]

flex55

2:09 pm on Feb 1, 2005 (gmt 0)

macdave: thanks. i just submitted my robots.txt to g.
i hope to see the pages out soon. thanks again!

olias

3:14 pm on Feb 1, 2005 (gmt 0)

Once it's seen the 301, it will take a day or two for the URL to drop out of the index, and possibly longer for Googlebot to come back and index the new URL

That is not quite the case, I 301'd a site last year - the entire old site still exists in Google with cache dates of around March 16th 2004, so just another 6 weeks and I will have year old dead pages still showing up.

It is amazing how they managed to double their index size a few months ago isn't it?!

macdave

4:38 pm on Feb 1, 2005 (gmt 0)

My recent experience is that 301d URLs will drop from the index within 48-72 hours of being spidered. So the time to drop will depend on how frequently your site is spidered. Have your 301s been hit by Googlebot?

n.b. I'm referring to 301s within a single site. Google may treat cross-domain 301s differently.

Marval

2:10 am on Feb 2, 2005 (gmt 0)

thats where Im seeing the problem - Im trying to get 5 domains out of Google that they think are dupes of my real site - and the 301s have been followed - I get spidered normally every day at least 10 times - the 301s and the robot file have been there for a month now and every time I do a site:mydomainnotwantedingoogle its still showing all of the pages - but I guess Ill just have to wait