homepage Welcome to WebmasterWorld Guest from 54.211.95.201
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Best method of getting pages out of Google's index
realmaverick




msg:4480538
 12:29 pm on Jul 31, 2012 (gmt 0)

I'm struggling to get a few hundred thousand pages out of Googles index.

I recently discovered several hundred thousand indexed pages, that were coming from my members status updates. These were page with nothing more than "hello", "good morning" etc etc.

I quickly added noindex, follow to them and after a few weeks, the count went down to 1,000 but has since gone back up to 85,000 and has stayed there for about a week now.

All of the pages appear under /statuses/, would it be a better idea to remove that DIR via Webmaster Tools and then disallow via robots.txt?

After the introduction of statuses in December, the site was hit by Panda in January. Nothing is ever 100% certain, but I think there's a good chance these pages did the damage. So I want to get them removed asap, hopefully to catch the next Panda refresh.

Thanks.

 

robzilla




msg:4480560
 1:11 pm on Jul 31, 2012 (gmt 0)

Since there's no value-add for these pages, I would probably request removal via WMT. That's the quickest and easiest way to get rid of so many zero-value pages. When you simply return 404 or 410, as I have been doing, Google will continue to try to crawl them, and they will sit in the index for quite some time. I noticed a large portion of deleted pages disappeared right before the last Panda update, but there are still about a dozen left, even though I have been returning 410 Gone for a month and a half now. I was scared to remove them via WMT as I feared I might lose some (historical) value, somewhere, somehow, I don't know. I doubt that applies to your pages.

netmeg




msg:4480580
 2:08 pm on Jul 31, 2012 (gmt 0)

I don't think you'll suffer by removing the directory in GWT if it's really mostly low content pages, and personally, I would just slap NOINDEX on everything in it going forward because I find that to be a little more effective than robots.txt. Less chance of random URLs with no snippets showing up.

realmaverick




msg:4480618
 5:15 pm on Jul 31, 2012 (gmt 0)

I've requested the /statuses/ directory be removed. I have blocked the dir via robots.txt and every one of them have noindex, follow.

I realise with robots.txt in place, Google won't see the noindex tag, it was just a failsafe incase some stray pages get through.

Should do the trick.

netmeg




msg:4480665
 7:11 pm on Jul 31, 2012 (gmt 0)

(unless someone links to them or something)

realmaverick




msg:4480706
 9:16 pm on Jul 31, 2012 (gmt 0)

True netmeg.

Now that many of the urls are disappearing. I've noticed there are some others, that have not been caught. The urls start with index.php?app=members

These pages are noindexed, I'm guessing I cannot use the same method as above, because this is not a directory? I know I can block it via robots.txt but can I remove all URL's containing that string, via WMT?

seoholic




msg:4480722
 10:35 pm on Jul 31, 2012 (gmt 0)

I don't see how it could hurt to create a HMTL sitemap with links to this pages and use WMT-->Health-->Fetch as Googlebot-->Submit to index-->URL and all linked pages
This could speed up the crawling.
"Select if your page is new or has been recently updated. Google doesn't guarantee to index all submitted URLs."

I have no experience with the URL removal tool.

lucy24




msg:4480733
 11:27 pm on Jul 31, 2012 (gmt 0)

The urls start with index.php?app=members


Do you mean, literally, "app=members" or is it "app={some number}? Or "app=members" with an obligatory follow-up like "app=members&memberid={some number}"? You can ask g### to disregard specific parameters, so everything containing "memberid=" collapses to a single page.

afaik, you can't tell it to disregard combinations of parameters: "disregard id= if the url also contains number=" or "disregard memid= and name= if the url contains both" that kind of thing.

klark0




msg:4480763
 1:04 am on Aug 1, 2012 (gmt 0)

maverick, you probably shouldn't use robots.txt plus noindex,follow.

blocking via robots.txt doesn't prevent urls from being indexed ..it also doesn't remove existing urls from the index. If you have add noindex,follow to the pages, then google will never know about it since they can't read the page. They'll only know about the URL.

At least that's what I understand and have experienced.

Just do the removal inside WMT, and ensure noindex,follow on all pages you want out of the index. There's really no need for anything in your robots.txt, besides an Allow all and a link to your sitemap index.

realmaverick




msg:4480940
 2:45 pm on Aug 1, 2012 (gmt 0)

Klark, I first requested removal of the DIR via WMT. Google specify that to make this request, you MUST block via robots.txt. Which makes sense.

The noindex, follow was just a fail safe. But I'm pretty confident removing via WMT plus blocking via robots.txt will do the trick.

realmaverick




msg:4480942
 2:46 pm on Aug 1, 2012 (gmt 0)

@seoholic, the pages are now gone from the index.

@lucy, the urls are app=members but there is a member ID associated later on in the URL.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved