homepage Welcome to WebmasterWorld Guest from 184.72.82.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Best way to *remove* pages from Google
Nice 404 page, or 301 to home page?
sublime1




msg:748365
 2:04 am on Jun 8, 2005 (gmt 0)

We tried an experiment last winter that resulted in a whole lot of new pages for a site that was rather large already. The new pages are thin and really don't do anything for users, and based on traffic, Google seems to agree; pages are indexed, but no real SERPs.

We would like to remove these pages and restore focus to the quality content on the site.

We could return a 404 "Not Found" or a 301 "Moved Permanently". In either case, the resulting page could know how to get users close to what they were looking for by examining the URL (e.g. "WidgetMaster 2003 is no longer available, here are other widgets from Widgets, Inc.").

Any suggestions?

Thanks!

 

Reid




msg:748366
 11:37 am on Jun 8, 2005 (gmt 0)

I went through the same process.
Don't do 404. google will keep requesting it, pages related to it will go supplemental, google referrals will plummet. Not good.

here is what you do.

Disallow the removed pages in robots.txt
Use a 301 via .htaccess to direct any referrals to an alternate page.

Submit your robots.txt file to the google url removal tool.
It has an option where you submit your robots.txt
within 48 hrs those pages will be wiped from google.
Those url's will not be requested again for 6 months (or maybe never) so write-off those filenames for future use.
Make sure you validate your robots.txt first, anything disallowed which appears in the index will be removed.

sublime1




msg:748367
 12:40 pm on Jun 8, 2005 (gmt 0)

Reid --

Thanks very much! I am glad I asked and appreciate your reply.

sublime1

yanyading




msg:748368
 3:23 pm on Jun 8, 2005 (gmt 0)

I requested to remove 2 directory of my China travel website, but I met the "Significant" Google Update, only 300 URLs left in Google, all web page content disappeared, but I am not sure whether it is caused by my removal request...

Hope my English writing is good enough to understand, I am not English native spokenĄ­ :-)

Reid




msg:748369
 8:09 pm on Jun 8, 2005 (gmt 0)

yanyading - when you remove files or directories by submitting robots.txt you can see which files have been removed by looking at the 'options' page.
There will be a list of files requested to be removed and their status.
Status will be
Pending - requested to be removed but not done yet
Comlete - removed files
request denied - if you change your robots.txt before the removal so that they are not disallowed then your request for removal will be denied.

If you make a mistake and disallow too much
example
user-agent: googlebot
disallow: /
this robots.txt will cause your whole website to be removed. So be careful and make sure you know your robots.txt exactly.

If you make a mistake and there is files pending which you don't want removed then change your robots.txt immediately before the removal bot comes.

user-agent: *
disallow:

or empty robots.txt will cause nothing to be removed and all your requests will be denied. Once they are 'complete' then it is too late to change it.

grippo




msg:748370
 8:35 pm on Jun 9, 2005 (gmt 0)

Make your web server return response code 410 HTTP_GONE for that pages. This will cause googlebot not to request those pages anymore, and also deleting them from the database.

sublime1




msg:748371
 9:18 pm on Jun 9, 2005 (gmt 0)

Grippo --

Are you certain that Google will respond correctly to 410? That sure sounds like the way to go if it does.

Thanks!

sailorjwd




msg:748372
 10:12 pm on Jun 9, 2005 (gmt 0)

You could put adsense on them...
From my experience those pages will be gone from the SERPs within the next update.

Ohhh, and make the titles similar to the filename.

grippo




msg:748373
 11:14 pm on Jun 9, 2005 (gmt 0)


Grippo --

Are you certain that Google will respond correctly to 410? That sure sounds like the way to go if it does.

Thanks!

Yes, 100%. I have had for ages thoushands of pages wich responded REDIRECT 302, and after REDIRECT 301, just because I decided to move foo.org/dir to dir.foo.org, and foo.org/dir/* were listed for years (most of them without title) until I manged to respond 410 HTTP_GONE. The beauty of all this, is that it's just common sense.

BillyS




msg:748374
 12:00 am on Jun 10, 2005 (gmt 0)

Make your web server return response code 410 HTTP_GONE for that pages. This will cause googlebot not to request those pages anymore, and also deleting them from the database.

Technically, this is saying...

Yup, the page used to be here, but now it's gone.

I've used this technique before and both Yahoo and Google handle it correctly.

Reid




msg:748375
 12:37 am on Jun 10, 2005 (gmt 0)

You could put adsense on them...
From my experience those pages will be gone from the SERPs within the next update.
Ohhh, and make the titles similar to the filename.


Please save the sarcasm for less tecnical threads like PR or last update. Some people have a hard time with this stuff and are easily confused.

I have heard from others on WW that 410 works too. I just was in the 'remove pages from google' mode.

Either method will work but the robots.txt will do it instantly. I'm not sure how long 410 takes to actually see the URL removed.
robots meta tag on the page - some say - others not- will also remove the page but it takes a month or so.

Wizard




msg:748376
 8:45 am on Jun 10, 2005 (gmt 0)

In my experience, using 301 is good way, but slow. Even if I put a link to old URL from frequently spidered page, it takes at least a few weeks.

URL Console is fast, I often use it to remove pages returning 404, and this takes e few days, but there is a downside - after six months, removed URLs tend to reappear in the index, despite the fact they have been returning 404 ever since removal. Currently, I have a lot of trouble with removing again the outdated pages I removed in November.

Johan007




msg:748377
 9:09 am on Jun 10, 2005 (gmt 0)

Guys I want to remove all the pages from my "MoviePrints" folder (considered to be spam, but is a decent affiliate shop). Does this look ok?

User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.pdf$
Disallow: /*.avi$

Natashka




msg:748378
 10:37 am on Jun 10, 2005 (gmt 0)

well, I think you can just put one line in your htaccess file:

RedirectMatch gone /MoviePrints/.*

and everything inside that folder will be gone.
maybe I am wrong, I am not a UNIX guru, but that's how I did it, and it works so far.

Reid




msg:748379
 4:16 pm on Jun 10, 2005 (gmt 0)

User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.pdf$
Disallow: /*.avi$


This will remove all Moviprints,images,banners,products,.gif,.jpg,.pdf and .avi files.

A 404 will keep coming back, unless you disallow it with robots.txt or return a 410.
google does not interpret 404 as a removed page it treats it as a temporary error.

The proper way would be to return a 410 with .htaccess and then use robots.txt to remove it quick if you like.

Reid




msg:748380
 4:30 pm on Jun 10, 2005 (gmt 0)

wait there is a better way.

if you are still getting traffic to those URL's (404's) then what I like to do is do a 301 redirect to pick up stray traffic. Disalow in robots.txt and remove it from google. Leave it that way for 3-4 months until there are no more requests for that URL, there are other search engines and the disallow should cause them to also remove it eventually.
After the requests peter out then change it to 410 and remove it from robots.txt file.

Trisha




msg:748381
 6:26 pm on Jun 11, 2005 (gmt 0)

I have a number of pages to remove also, so how do you 'return a 410 with .htaccess' anyway?

And what should you do if you have a situation of individual pages - lots of them, - that no longer exist but there are other files in the directory they were in that do still exist so that you can't just block the whole directory?

confused ellie




msg:748382
 8:44 pm on Jun 11, 2005 (gmt 0)

We had several hundred, maybe even more, pages we just switched to 404's because they were old and in some cases had duplicate content because we updated our site to a new look and still had the old site. I take it this was the wrong way? We set the 404's about 2 evenings ago.

I am feeling rather concerned right now. :/

Also, I am confused a little about where this "'options' page" is that we can watch if we submit a robots.txt file. Where would we submit that?

Thanks!

Ellie

Reid




msg:748383
 5:56 am on Jun 12, 2005 (gmt 0)

so how do you 'return a 410 with .htaccess' anyway?
And what should you do if you have a situation of individual pages - lots of them, - that no longer exist but there are other files in the directory they were in that do still exist so that you can't just block the whole directory?


how to do 410 - you should goto the apache forum to learn about .htaccess never just cut and paste stuff you don't understand into .htaccess - know what you are doing you will find help there. Other server have different methods of sending a 410 their respective forums will help.

each individual page would have to be disallowed - unless you want to remove the whole directory.
If you can get them returning 410 then you could just let it ride - they will be removed. .htaccess can do wildcards if that helps.
We had several hundred, maybe even more, pages we just switched to 404's because they were old and in some cases had duplicate content because we updated our site to a new look and still had the old site. I take it this was the wrong way? We set the 404's about 2 evenings ago.
I am feeling rather concerned right now. :/


You should be concerned - google chokes on 404's with several hundred of them your website should be in the supplemental index within a month or 2.
Also, I am confused a little about where this "'options' page" is that we can watch if we submit a robots.txt file. Where would we submit that?


[services.google.com:8882...]

once you sign up you get into the 'options' page where you are given 3 options.
The first option is submit robots.txt file
There is a large grey area on the right side of the 'options' page. That is where your requests and their status will appear.

Before you submit your robots.txt file it is critical that you understand your robots.txt file and validate it. This tool is able to remove your entire domain from google for 6 months (if you disallow: /).

confused ellie




msg:748384
 3:14 pm on Jun 12, 2005 (gmt 0)

Thanks Reid, I appreciate the help.

confused ellie




msg:748385
 5:18 pm on Jun 12, 2005 (gmt 0)

Ok, robots.txt file all set up and uploaded. Hopefully it's not completely too late but at least it's there now.

We did exactly per Google's instructions for:

"Remove part of your website"

Thanks again for the tips!

Ellie

Trisha




msg:748386
 10:45 pm on Jun 13, 2005 (gmt 0)

Thanks Reid! I have set up the 410's and haven't yet decided if I will also do the robots/url removal thing or not.

Reid




msg:748387
 12:44 am on Jun 14, 2005 (gmt 0)

confused - you should see on the 'options' page what files google intends to remove, if it's wrong you have a little time 24-48 hrs to edit robots.txt before it actually happens. if you changed robots.txt to allow everything then all your requests would be denied and then you could try again.

Trish - even just returning 410 is good enough, just not sure how long it takes - I would guess within a crawl but if not it would be on the next update.

Johan007




msg:748388
 11:10 am on Jun 14, 2005 (gmt 0)

Thanks guys but do note Windows servers do NOT have .htaccess

Anyway I re-wrote it cos not all spiders do wild cards:


User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/

User-agent: Googlebot-Image
Disallow: /*.gif$
Disallow: /*.jpg$

Source: [google.com...]

Reid




msg:748389
 5:13 pm on Jun 14, 2005 (gmt 0)

only apache uses .htaccess other servers can return a 410 but through different tools - check the forum for your server for the method.

user-agent: * should be the last directive in robots.txt because all robots (or most) will follow the directives of their own or * whichever comes first.
In the above robots.txt googlbot-image may follow the user-agent: * directives without ever seeing user-agent: googlebot-image

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved