Best way to *remove* pages from Google

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Best way to remove pages from Google

Nice 404 page, or 301 to home page?

sublime1

2:04 am on Jun 8, 2005 (gmt 0)

We tried an experiment last winter that resulted in a whole lot of new pages for a site that was rather large already. The new pages are thin and really don't do anything for users, and based on traffic, Google seems to agree; pages are indexed, but no real SERPs.

We would like to remove these pages and restore focus to the quality content on the site.

We could return a 404 "Not Found" or a 301 "Moved Permanently". In either case, the resulting page could know how to get users close to what they were looking for by examining the URL (e.g. "WidgetMaster 2003 is no longer available, here are other widgets from Widgets, Inc.").

Any suggestions?

Thanks!

Reid

11:37 am on Jun 8, 2005 (gmt 0)

I went through the same process.
Don't do 404. google will keep requesting it, pages related to it will go supplemental, google referrals will plummet. Not good.

here is what you do.

Disallow the removed pages in robots.txt
Use a 301 via .htaccess to direct any referrals to an alternate page.

Submit your robots.txt file to the google url removal tool.
It has an option where you submit your robots.txt
within 48 hrs those pages will be wiped from google.
Those url's will not be requested again for 6 months (or maybe never) so write-off those filenames for future use.
Make sure you validate your robots.txt first, anything disallowed which appears in the index will be removed.

sublime1

12:40 pm on Jun 8, 2005 (gmt 0)

Reid --

Thanks very much! I am glad I asked and appreciate your reply.

sublime1

yanyading

3:23 pm on Jun 8, 2005 (gmt 0)

I requested to remove 2 directory of my China travel website, but I met the "Significant" Google Update, only 300 URLs left in Google, all web page content disappeared, but I am not sure whether it is caused by my removal request...

Hope my English writing is good enough to understand, I am not English native spoken�� :-)

Reid

8:09 pm on Jun 8, 2005 (gmt 0)

yanyading - when you remove files or directories by submitting robots.txt you can see which files have been removed by looking at the 'options' page.
There will be a list of files requested to be removed and their status.
Status will be
Pending - requested to be removed but not done yet
Comlete - removed files
request denied - if you change your robots.txt before the removal so that they are not disallowed then your request for removal will be denied.

If you make a mistake and disallow too much
example
user-agent: googlebot
disallow: /
this robots.txt will cause your whole website to be removed. So be careful and make sure you know your robots.txt exactly.

If you make a mistake and there is files pending which you don't want removed then change your robots.txt immediately before the removal bot comes.

user-agent: *
disallow:

or empty robots.txt will cause nothing to be removed and all your requests will be denied. Once they are 'complete' then it is too late to change it.

grippo

8:35 pm on Jun 9, 2005 (gmt 0)

Make your web server return response code 410 HTTP_GONE for that pages. This will cause googlebot not to request those pages anymore, and also deleting them from the database.

sublime1

9:18 pm on Jun 9, 2005 (gmt 0)

Grippo --

Are you certain that Google will respond correctly to 410? That sure sounds like the way to go if it does.

Thanks!

sailorjwd

10:12 pm on Jun 9, 2005 (gmt 0)

You could put adsense on them...
From my experience those pages will be gone from the SERPs within the next update.

Ohhh, and make the titles similar to the filename.

grippo

11:14 pm on Jun 9, 2005 (gmt 0)

Grippo --
Are you certain that Google will respond correctly to 410? That sure sounds like the way to go if it does.
Thanks!

Yes, 100%. I have had for ages thoushands of pages wich responded REDIRECT 302, and after REDIRECT 301, just because I decided to move foo.org/dir to dir.foo.org, and foo.org/dir/* were listed for years (most of them without title) until I manged to respond 410 HTTP_GONE. The beauty of all this, is that it's just common sense.

BillyS

12:00 am on Jun 10, 2005 (gmt 0)

Make your web server return response code 410 HTTP_GONE for that pages. This will cause googlebot not to request those pages anymore, and also deleting them from the database.

Technically, this is saying...

Yup, the page used to be here, but now it's gone.

I've used this technique before and both Yahoo and Google handle it correctly.

Reid

12:37 am on Jun 10, 2005 (gmt 0)

You could put adsense on them...
From my experience those pages will be gone from the SERPs within the next update.
Ohhh, and make the titles similar to the filename.

Please save the sarcasm for less tecnical threads like PR or last update. Some people have a hard time with this stuff and are easily confused.

I have heard from others on WW that 410 works too. I just was in the 'remove pages from google' mode.

Either method will work but the robots.txt will do it instantly. I'm not sure how long 410 takes to actually see the URL removed.
robots meta tag on the page - some say - others not- will also remove the page but it takes a month or so.

Wizard

8:45 am on Jun 10, 2005 (gmt 0)

In my experience, using 301 is good way, but slow. Even if I put a link to old URL from frequently spidered page, it takes at least a few weeks.

URL Console is fast, I often use it to remove pages returning 404, and this takes e few days, but there is a downside - after six months, removed URLs tend to reappear in the index, despite the fact they have been returning 404 ever since removal. Currently, I have a lot of trouble with removing again the outdated pages I removed in November.

Johan007

9:09 am on Jun 10, 2005 (gmt 0)

Guys I want to remove all the pages from my "MoviePrints" folder (considered to be spam, but is a decent affiliate shop). Does this look ok?

User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.pdf$
Disallow: /*.avi$

Natashka

10:37 am on Jun 10, 2005 (gmt 0)

well, I think you can just put one line in your htaccess file:

RedirectMatch gone /MoviePrints/.*

and everything inside that folder will be gone.
maybe I am wrong, I am not a UNIX guru, but that's how I did it, and it works so far.

Reid

4:16 pm on Jun 10, 2005 (gmt 0)

User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.pdf$
Disallow: /*.avi$

This will remove all Moviprints,images,banners,products,.gif,.jpg,.pdf and .avi files.

A 404 will keep coming back, unless you disallow it with robots.txt or return a 410.
google does not interpret 404 as a removed page it treats it as a temporary error.

The proper way would be to return a 410 with .htaccess and then use robots.txt to remove it quick if you like.

Reid

4:30 pm on Jun 10, 2005 (gmt 0)

wait there is a better way.

if you are still getting traffic to those URL's (404's) then what I like to do is do a 301 redirect to pick up stray traffic. Disalow in robots.txt and remove it from google. Leave it that way for 3-4 months until there are no more requests for that URL, there are other search engines and the disallow should cause them to also remove it eventually.
After the requests peter out then change it to 410 and remove it from robots.txt file.

Trisha

6:26 pm on Jun 11, 2005 (gmt 0)

I have a number of pages to remove also, so how do you 'return a 410 with .htaccess' anyway?

And what should you do if you have a situation of individual pages - lots of them, - that no longer exist but there are other files in the directory they were in that do still exist so that you can't just block the whole directory?

confused ellie

8:44 pm on Jun 11, 2005 (gmt 0)

We had several hundred, maybe even more, pages we just switched to 404's because they were old and in some cases had duplicate content because we updated our site to a new look and still had the old site. I take it this was the wrong way? We set the 404's about 2 evenings ago.

I am feeling rather concerned right now. :/

Also, I am confused a little about where this "'options' page" is that we can watch if we submit a robots.txt file. Where would we submit that?

Thanks!

Ellie

Reid

5:56 am on Jun 12, 2005 (gmt 0)

so how do you 'return a 410 with .htaccess' anyway?
And what should you do if you have a situation of individual pages - lots of them, - that no longer exist but there are other files in the directory they were in that do still exist so that you can't just block the whole directory?

how to do 410 - you should goto the apache forum to learn about .htaccess never just cut and paste stuff you don't understand into .htaccess - know what you are doing you will find help there. Other server have different methods of sending a 410 their respective forums will help.

each individual page would have to be disallowed - unless you want to remove the whole directory.
If you can get them returning 410 then you could just let it ride - they will be removed. .htaccess can do wildcards if that helps.

We had several hundred, maybe even more, pages we just switched to 404's because they were old and in some cases had duplicate content because we updated our site to a new look and still had the old site. I take it this was the wrong way? We set the 404's about 2 evenings ago.
I am feeling rather concerned right now. :/

You should be concerned - google chokes on 404's with several hundred of them your website should be in the supplemental index within a month or 2.

Also, I am confused a little about where this "'options' page" is that we can watch if we submit a robots.txt file. Where would we submit that?

[services.google.com:8882...]

once you sign up you get into the 'options' page where you are given 3 options.
The first option is submit robots.txt file
There is a large grey area on the right side of the 'options' page. That is where your requests and their status will appear.

Before you submit your robots.txt file it is critical that you understand your robots.txt file and validate it. This tool is able to remove your entire domain from google for 6 months (if you disallow: /).

confused ellie

3:14 pm on Jun 12, 2005 (gmt 0)

Thanks Reid, I appreciate the help.

confused ellie

5:18 pm on Jun 12, 2005 (gmt 0)

Ok, robots.txt file all set up and uploaded. Hopefully it's not completely too late but at least it's there now.

We did exactly per Google's instructions for:

"Remove part of your website"

Thanks again for the tips!

Ellie

Trisha

10:45 pm on Jun 13, 2005 (gmt 0)

Thanks Reid! I have set up the 410's and haven't yet decided if I will also do the robots/url removal thing or not.

Reid

12:44 am on Jun 14, 2005 (gmt 0)

confused - you should see on the 'options' page what files google intends to remove, if it's wrong you have a little time 24-48 hrs to edit robots.txt before it actually happens. if you changed robots.txt to allow everything then all your requests would be denied and then you could try again.

Trish - even just returning 410 is good enough, just not sure how long it takes - I would guess within a crawl but if not it would be on the next update.

Johan007

11:10 am on Jun 14, 2005 (gmt 0)

Thanks guys but do note Windows servers do NOT have .htaccess

Anyway I re-wrote it cos not all spiders do wild cards:

User-agent: * Disallow: /MoviePrints/ Disallow: /images/ Disallow: /banners/ Disallow: /products/

User-agent: Googlebot-Image Disallow: /*.gif$ Disallow: /*.jpg$

Source: [google.com...]

Reid

5:13 pm on Jun 14, 2005 (gmt 0)

only apache uses .htaccess other servers can return a 410 but through different tools - check the forum for your server for the method.

user-agent: * should be the last directive in robots.txt because all robots (or most) will follow the directives of their own or * whichever comes first.
In the above robots.txt googlbot-image may follow the user-agent: * directives without ever seeing user-agent: googlebot-image