homepage Welcome to WebmasterWorld Guest from 54.226.213.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Removing Lots of Pages from Google Index
PPC_Chris




msg:4317609
 8:44 pm on May 25, 2011 (gmt 0)

Does anyone know of the best way to get a lot of pages removed from Google's Index. Pre-Panda, my site had about 100K pages indexed. Then we got hit by Panda 2 on April 11. We decided to get rid of the vast majority of these (say 97K of the 100K) that Google might consider thin pages.

So even though these pages accounted for a very low percentage of our traffic and an even lower percentage of our revenue, they did make up 97% of total pages in our site. So, on April 25, we used 301 redirects to point all of these pages to the appropriate higher level pages in our site and updated our sitemap in GWT to show only the 3K pages still live.

After reading some comments from the Google support forums and on these boards, earlier this week we decided to 404/410 all of the thin content pages, as this would be a better signal to Google that these pages have been permanently deleted. The problem is that Google still has about half of the pages that we have removed in its index.

If my thinking is correct then Google's algorithm could still consider the vast majority of the pages in our site to be of low quality... 50K 404 pages still indexed compared to only 3K of pages Google theoretically considers to be high quality pages.

According to GWT, Googlebot is crawling about 6-10K pages/day, but only de-indexing about 1-2K/day. I worry that if and when Google decides to re-evaluate sites that were Pandalized, it is going to appear as though the majority of our pages are still thin.

So, bottom line is does anyone know of any way for me to speed up the de-indexing of these 50K pages? The only crazy idea I can think of is to resubmit our sitemap with the full 100K and hope that this speeds up Google's crawling of these pages. But it seems insane to submit a sitemap where 97% of the URLs are 404s, especially when it seems we are already in the doghouse with Google. I can't imagine I would do this.

Does anyone have any ideas that can help us get rid of these pages from the index at a faster rate?

 

tedster




msg:4317688
 11:27 pm on May 25, 2011 (gmt 0)

You can and should submit a new sitemap that lists ONLY the current URLs. You can also use your WebmasterTools account and make use of the new "Immediate Temporary URL Removal" tool [webmasterworld.com]. It's good for 90 days and in that time, Google should have a good chance to process at least most of your changes.

walkman




msg:4317713
 12:34 am on May 26, 2011 (gmt 0)

Your best hope is to have them in a certain directory and remove that directory, otherwise it's going to be painful. Very painful with 97,000 pages.

There's another way: pinging each 410 page and there ways of automating but not sure how Google is going to like 97,000 pings. And when if G comes, I'm not sure if it really notes in the db that the page is gone.

CouponUs




msg:4320372
 8:11 am on Jun 1, 2011 (gmt 0)

You can also try to remove all pages and 301 the useful content to new address. It will take some time for google to reindex the content and there a little bit weight lost through the transfer.

Sgt_Kickaxe




msg:4320521
 1:56 pm on Jun 1, 2011 (gmt 0)

Removing them and letting them be 404 would have worked too, there is very little Google relies on webmasters to provide and they would have(and will) act upon what the crawler finds first and foremost(exclusively?). There is a chance of losing trust no matter what signals you send Google and since they don't require any signals that visitors can't see anymore... :P

redirecting pages that had backlinks so that visitors don't see 404's is the main concern.

freejung




msg:4320591
 3:35 pm on Jun 1, 2011 (gmt 0)

Why would returning 404/410 be better than redirecting to the category page with a 301? With the 301, any juice you have from backlinks to those pages is preserved, visitors who click on outdated links or SERP listings will still get to a real page, and I expect Google will eventually sort out which pages are still on the site.

With just letting them go 404, I would be concerned about losing both link juice and visitors, a high price to pay for the possibility that the pages might be de-indexed a little quicker, which doesn't seem to be happening anyway.

goodroi




msg:4320609
 4:20 pm on Jun 1, 2011 (gmt 0)

Managing 97,000 301 redirects can cause real system issues if its not setup properly. I also suspect many of the 97,000 pages have no link juice to preserve so 301 redirect is not going to help them.

tedster




msg:4320610
 4:22 pm on Jun 1, 2011 (gmt 0)

Whether to 301 or 404/410 should be a decision that rests on actual data, and nothing else. Here are the two data-driven questions about the URL being deleted:

1. Does it have a decent backlink profile - something worth preserving?
2. Does it get ENTRY traffic?

If a URL is one of 97,000 being considered for removal, I'd expect less than 1% will get a "yes" to one of those two questions. In a similar situation on a recent project, our team found 80 URLs worth redirecting out of 30,000.

I prefer a 301 redirect that goes to a well chosen URL, one that matches the value and function of the removed page. I do not use a broad brush approach that redirects them all to the home page. I even avoid a high level category page as much as possible. 301 says "this content has permanently moved to a new location" and I take that rather strictly.

A 404/410 status is much quicker for a search engine to process. Processing 97,000 URLs that use a 301 redirect requires a whole lot of trust checking because 301 has historically been a major spamming tool.

My guiding rule is to redirect as little as possible. I've seen many sites that have generated their own ranking troubles over time by being very casual with the 301 status.

freejung




msg:4320716
 8:04 pm on Jun 1, 2011 (gmt 0)

OK. The reason I ask is that I'm considering similar measures -- though in my case the number of URLs is on the order of 1000, and many of them have links or entry traffic.

Managing 97,000 301 redirects can cause real system issues if its not setup properly

Well, that would depend on your platform and what the URLs are like -- it might be possible to do it with a good regex in .htaccess, or you could set up your CMS to return the redirect only when such a URL is actually requested. In my situation I would do the latter, as it would be fairly easy to set up in my CMS and should perform quite well if done right.

I prefer a 301 redirect that goes to a well chosen URL, one that matches the value and function of the removed page.

Totally agree, and maybe in Chris' situation that doesn't make sense. In my case the redirect would be to a subcategory page which would now have all of the unique content from the redirected page -- in other words, what I'm talking about is putting the text from the individual pages, which is fairly short, directly on the subcategory pages and then redirecting the individual pages to the subcategories. So the content actually will have moved to the new location verbatim.

g1smd




msg:4320737
 8:53 pm on Jun 1, 2011 (gmt 0)

With 97 000 redirects there are two viable approaches.

If the "mapping" of old URLs to new URLs is "simple", you might be able to do the whole thing with just a small number of rules, one each for "products", "categories", "reviews", etc.

If the "mapping" is "complex" and especially if there are "parts" that appear in the new URL that cannot be extracted from the old URL path and/or parameters, then you will need to internally rewrite requests for old URLs to a script. That script will then look up the new URL in an array or in a database, and then return the redirect headers.

However, is it appropriate to redirect 97 000 requests?

If there is no direct individual new match for each of a bunch of old URLs, many people are tempted to redirect them all to a single new URL. It is not a good idea to funnel a large number of old pages to a single new URL. In this case a "410 Gone" response is more appropriate.

On that note, I recently made some changes to return "410 Gone" for several thousand pages on a site, rather than redirect elsewhere. I'll be interested to see how quick Google gets to grips with that these days. Google's reaction to 3xx and 4xx codes isn't what it used to be.

freejung




msg:4320755
 9:31 pm on Jun 1, 2011 (gmt 0)

look up the new URL in an array or in a database

Yeah, that's essentially what I'm talking about. Let me elaborate a bit on what I'm thinking of doing, because I'd love to hear feedback on it.

The pages I'm thinking of removing are image detail pages. They basically have a high-res version of an image (the thumbnail of which appears on a subcategory page) and a caption and nav to the next and previous images. Formerly, it made sense to have each of these on its own URL, now not so much. So what I'm thinking of doing is:

-Put the full caption text on the subcategory page where the thumbnail of the image appears. Set these pages up to display the full images in a lightbox using JS as this will not create a separate URL for each image (with JS off, just link straight to the image file).

-Replace each image page with (essentially) a script that determines the appropriate page to redirect to and returns the redirect. This redirect will point to the subcategory page with a query string parameter that causes the image in question to open in the lightbox display, so that a visitor to the old URL will see the same image at the same resolution as it was previously on the image page.

-Use a canonical element to suggest that Google consider the subcategory page as a single page rather than separate pages for each query string, since it will contain nearly identical HTML content regardless of the query string.

Thus there will be several (maybe a dozen or so) old URLs redirecting to the same base URL, which I don't think is excessive especially because the exact same content will be there. A visitor to the old page will see content nearly identical to what they would have seen before, just in a lightbox rather than on a separate page. Link juice is preserved and hopefully consolidated into the single subcategory URL, which now has lots of unique content and causes visitors to spend much more time on the same "page" while they scroll through the images.

With smart preloading it should perform better and feel smoother to the visitor, preserve site functionality, and consolidate superfluous URLs without sacrificing juice or clicks in a way that should appear pretty much seamless to the average visitor. Any downside you guys can think of?

grimmer




msg:4321247
 7:29 pm on Jun 2, 2011 (gmt 0)

My guiding rule is to redirect as little as possible. I've seen many sites that have generated their own ranking troubles over time by being very casual with the 301 status.

tedster, can you elaborate this a little bit more? What kind of troubles have you seen?

We also have over 100,000 possibly duplicate pages, we have tagged those as "NOINDEX" since Panda. We do have traffic and possibly link juice from those pages. After reviewing this thread, we think we shall do "301" redirect on those pages to corresponding pages with similar functions.

walkman




msg:4321257
 7:47 pm on Jun 2, 2011 (gmt 0)

grimmer, 100,000 redirects will almost certainly flag your site or place your pages on a 'let's wait and see' sandbox. 301-ing a domain with xx,000 backlinks to yours will most likely call for a manual review. Then, all bets are off.

grimmer




msg:4321293
 9:04 pm on Jun 2, 2011 (gmt 0)

walkman, thanks for the advise.

We are ecommerce site, and have about 8,000 products. Each product has over 20 colors. Before Panda, we have a page for each product color, and therefore, we have a lot quite similar pages for each product. Since Panda, we have redesigned our product page to show all colors, and the previous product/color pages are not needed anymore. We had put "NOINDEX" on those pages. But after review the thread, we thought "301" redirect the product/color page to the new product page may be a better option, because it will carry the link juice to the new page.

But, will it cause any troubles?

whatson




msg:4321306
 9:51 pm on Jun 2, 2011 (gmt 0)

404 the pages, and eventually they will drop out of the index. However, some people are reporting in further losses in rankings when dropping so many pages of their site. I am guessing this might be something to do with PR, i.e. the PR has been spread to all the extra pages, but when they go down they are no longer able to feed the PR back into the site. I am guessing you just need to wait for a big re-indexing update from Google until all is well again.

walkman




msg:4321325
 10:29 pm on Jun 2, 2011 (gmt 0)

But after review the thread, we thought "301" redirect the product/color page to the new product page may be a better option, because it will carry the link juice to the new page.


Use rel canonical. [google.com...]

Perfect for you

PPC_Chris




msg:4324014
 2:00 pm on Jun 9, 2011 (gmt 0)

Here is an update on this issue if anyone is interested:

April 11: Hit by Panda 2.0, organic traffic drops off tremendously
April 25: Decided to remove 97% (97K) of indexed pages via 301 redirects (about half of pages de-indexed by May 23 at a rate of 1-2K/day)
May 23: We switch the pages from 301 redirects to 410 (de-indexing of pages continues at 1-2K/day for about a week)
June 1: De-indexing of pages slows to a few hundred pages/day with some days seeing no change
June 9: A huge drop of indexed pages, down to about 8K

I'm not exactly sure what to make of this, but this has been my experience trying to get pages that Google could see as low quality out of the index.

EmptyRoom




msg:4331546
 1:09 pm on Jun 27, 2011 (gmt 0)

Can you give us an update on this, please? Any increase in traffic for "good" pages?

g1smd




msg:4331553
 1:22 pm on Jun 27, 2011 (gmt 0)

@PPC: I've recently removed 50K pages from Google in a similar way (for a different reason: duplicate content due to botched site configuration). Removal from SERPs is quite slow these days.

I do wonder if the size of their index will decrease if lots of sites are doing this.

PPC_Chris




msg:4331588
 2:46 pm on Jun 27, 2011 (gmt 0)

Here is an update... a few days after we saw our indexed pages drop to 8K, our indexed pages shot back up to about 30K. And now Google is removing only 100-200 pages per day.

Right now we have 3K total live pages in our site - pages that are not returning a 404/410 error. According to Google Webmaster Tools, they are crawling an average of 10K pages/day. How Google could be crawling a minimum 6K 404 pages per day yet removing only 100-200 from their index is baffling.

At this point, I am about ready to submit a sitemap that includes all of our currently indexed 404 pages.

g1smd




msg:4331595
 2:55 pm on Jun 27, 2011 (gmt 0)

There is a lag between what they crawl and what they remove from SERPs.

They need to be sure that the 404 isn't just a temporary glitch on your site; using 410 might be better.

Sitemap files must include ONLY valid URLs. Don't link to non-valid URLs from your sitemap.

potentialgeek




msg:4331616
 3:38 pm on Jun 27, 2011 (gmt 0)

Does anyone know of the best way to get a lot of pages removed from Google's Index. Pre-Panda, my site had about 100K pages indexed. Then we got hit by Panda 2 on April 11. We decided to get rid of the vast majority of these (say 97K of the 100K) that Google might consider thin pages.

What's the update on your rankings?

PPC_Chris




msg:4331629
 4:01 pm on Jun 27, 2011 (gmt 0)

g1smd: we are using 410
potentialgeek: our rankings are basically unchanged since Panda hit on April 11

lucy24




msg:4331634
 4:04 pm on Jun 27, 2011 (gmt 0)

At this point, I am about ready to submit a sitemap that includes all of our currently indexed 404 pages.

Google's sitemap handling seems to be cumulative. (Bing's documentation implies that they do it differently.*) That is, once a page is on a sitemap they crawl it forever, even if you feed them a new sitemap that's entirely different. Once they've decided a page exists, there's almost nothing you can do to purge it from their memory.


* They say that if a lot of pages return a 301, they "don't trust" the sitemap. It would take a hell of a nerve to say this if they don't revise their database when an up-to-date sitemap is submitted.

g1smd




msg:4331637
 4:06 pm on Jun 27, 2011 (gmt 0)

Once they've decided a page exists, there's almost nothing you can do to purge it from their memory.

We'll see how the site with 25 000 URLs now returning "410 Gone" fares, shall we? Game on.

PPC_Chris




msg:4333638
 2:30 pm on Jul 1, 2011 (gmt 0)

Well, Google is still only de-indexing these pages at a rate of about 100-200/day. However, I did notice something very interesting... if I do a site:{domain.com} search at AOL, all of our 410 pages are gone from their index!

So Google crawls thousands of 410 error pages everyday, yet refuses to remove the vast majority of them from their index. But the Google-powered AOL search has correctly de-indexed these pages.

And we still (of course) have not recovered in the least from Panda despite following all Google suggestions to fix any thin-content issues we may have had...

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved