Welcome to WebmasterWorld Guest from 54.196.188.52

Message Too Old, No Replies

Best method of getting pages out of Google's index

     
12:29 pm on Jul 31, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 9, 2007
posts:876
votes: 0


I'm struggling to get a few hundred thousand pages out of Googles index.

I recently discovered several hundred thousand indexed pages, that were coming from my members status updates. These were page with nothing more than "hello", "good morning" etc etc.

I quickly added noindex, follow to them and after a few weeks, the count went down to 1,000 but has since gone back up to 85,000 and has stayed there for about a week now.

All of the pages appear under /statuses/, would it be a better idea to remove that DIR via Webmaster Tools and then disallow via robots.txt?

After the introduction of statuses in December, the site was hit by Panda in January. Nothing is ever 100% certain, but I think there's a good chance these pages did the damage. So I want to get them removed asap, hopefully to catch the next Panda refresh.

Thanks.
1:11 pm on July 31, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:965
votes: 68


Since there's no value-add for these pages, I would probably request removal via WMT. That's the quickest and easiest way to get rid of so many zero-value pages. When you simply return 404 or 410, as I have been doing, Google will continue to try to crawl them, and they will sit in the index for quite some time. I noticed a large portion of deleted pages disappeared right before the last Panda update, but there are still about a dozen left, even though I have been returning 410 Gone for a month and a half now. I was scared to remove them via WMT as I feared I might lose some (historical) value, somewhere, somehow, I don't know. I doubt that applies to your pages.
2:08 pm on July 31, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12670
votes: 141


I don't think you'll suffer by removing the directory in GWT if it's really mostly low content pages, and personally, I would just slap NOINDEX on everything in it going forward because I find that to be a little more effective than robots.txt. Less chance of random URLs with no snippets showing up.
5:15 pm on July 31, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 9, 2007
posts:876
votes: 0


I've requested the /statuses/ directory be removed. I have blocked the dir via robots.txt and every one of them have noindex, follow.

I realise with robots.txt in place, Google won't see the noindex tag, it was just a failsafe incase some stray pages get through.

Should do the trick.
7:11 pm on July 31, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12670
votes: 141


(unless someone links to them or something)
9:16 pm on July 31, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 9, 2007
posts:876
votes: 0


True netmeg.

Now that many of the urls are disappearing. I've noticed there are some others, that have not been caught. The urls start with index.php?app=members

These pages are noindexed, I'm guessing I cannot use the same method as above, because this is not a directory? I know I can block it via robots.txt but can I remove all URL's containing that string, via WMT?
10:35 pm on July 31, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:June 29, 2010
posts: 87
votes: 0


I don't see how it could hurt to create a HMTL sitemap with links to this pages and use WMT-->Health-->Fetch as Googlebot-->Submit to index-->URL and all linked pages
This could speed up the crawling.
"Select if your page is new or has been recently updated. Google doesn't guarantee to index all submitted URLs."

I have no experience with the URL removal tool.
11:27 pm on July 31, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12693
votes: 244


The urls start with index.php?app=members


Do you mean, literally, "app=members" or is it "app={some number}? Or "app=members" with an obligatory follow-up like "app=members&memberid={some number}"? You can ask g### to disregard specific parameters, so everything containing "memberid=" collapses to a single page.

afaik, you can't tell it to disregard combinations of parameters: "disregard id= if the url also contains number=" or "disregard memid= and name= if the url contains both" that kind of thing.
1:04 am on Aug 1, 2012 (gmt 0)

Junior Member

joined:Jan 9, 2012
posts: 192
votes: 0


maverick, you probably shouldn't use robots.txt plus noindex,follow.

blocking via robots.txt doesn't prevent urls from being indexed ..it also doesn't remove existing urls from the index. If you have add noindex,follow to the pages, then google will never know about it since they can't read the page. They'll only know about the URL.

At least that's what I understand and have experienced.

Just do the removal inside WMT, and ensure noindex,follow on all pages you want out of the index. There's really no need for anything in your robots.txt, besides an Allow all and a link to your sitemap index.
2:45 pm on Aug 1, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 9, 2007
posts:876
votes: 0


Klark, I first requested removal of the DIR via WMT. Google specify that to make this request, you MUST block via robots.txt. Which makes sense.

The noindex, follow was just a fail safe. But I'm pretty confident removing via WMT plus blocking via robots.txt will do the trick.
2:46 pm on Aug 1, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 9, 2007
posts:876
votes: 0


@seoholic, the pages are now gone from the index.

@lucy, the urls are app=members but there is a member ID associated later on in the URL.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members