Forum Moderators: Robert Charlton & goodroi
We would like to remove these pages and restore focus to the quality content on the site.
We could return a 404 "Not Found" or a 301 "Moved Permanently". In either case, the resulting page could know how to get users close to what they were looking for by examining the URL (e.g. "WidgetMaster 2003 is no longer available, here are other widgets from Widgets, Inc.").
Any suggestions?
Thanks!
here is what you do.
Disallow the removed pages in robots.txt
Use a 301 via .htaccess to direct any referrals to an alternate page.
Submit your robots.txt file to the google url removal tool.
It has an option where you submit your robots.txt
within 48 hrs those pages will be wiped from google.
Those url's will not be requested again for 6 months (or maybe never) so write-off those filenames for future use.
Make sure you validate your robots.txt first, anything disallowed which appears in the index will be removed.
Hope my English writing is good enough to understand, I am not English native spokenĄ :-)
If you make a mistake and disallow too much
example
user-agent: googlebot
disallow: /
this robots.txt will cause your whole website to be removed. So be careful and make sure you know your robots.txt exactly.
If you make a mistake and there is files pending which you don't want removed then change your robots.txt immediately before the removal bot comes.
user-agent: *
disallow:
or empty robots.txt will cause nothing to be removed and all your requests will be denied. Once they are 'complete' then it is too late to change it.
Grippo --Are you certain that Google will respond correctly to 410? That sure sounds like the way to go if it does.
Thanks!
Yes, 100%. I have had for ages thoushands of pages wich responded REDIRECT 302, and after REDIRECT 301, just because I decided to move foo.org/dir to dir.foo.org, and foo.org/dir/* were listed for years (most of them without title) until I manged to respond 410 HTTP_GONE. The beauty of all this, is that it's just common sense.
Make your web server return response code 410 HTTP_GONE for that pages. This will cause googlebot not to request those pages anymore, and also deleting them from the database.
Technically, this is saying...
Yup, the page used to be here, but now it's gone.
I've used this technique before and both Yahoo and Google handle it correctly.
You could put adsense on them...
From my experience those pages will be gone from the SERPs within the next update.
Ohhh, and make the titles similar to the filename.
I have heard from others on WW that 410 works too. I just was in the 'remove pages from google' mode.
Either method will work but the robots.txt will do it instantly. I'm not sure how long 410 takes to actually see the URL removed.
robots meta tag on the page - some say - others not- will also remove the page but it takes a month or so.
URL Console is fast, I often use it to remove pages returning 404, and this takes e few days, but there is a downside - after six months, removed URLs tend to reappear in the index, despite the fact they have been returning 404 ever since removal. Currently, I have a lot of trouble with removing again the outdated pages I removed in November.
User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.pdf$
Disallow: /*.avi$
User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.pdf$
Disallow: /*.avi$
A 404 will keep coming back, unless you disallow it with robots.txt or return a 410.
google does not interpret 404 as a removed page it treats it as a temporary error.
The proper way would be to return a 410 with .htaccess and then use robots.txt to remove it quick if you like.
if you are still getting traffic to those URL's (404's) then what I like to do is do a 301 redirect to pick up stray traffic. Disalow in robots.txt and remove it from google. Leave it that way for 3-4 months until there are no more requests for that URL, there are other search engines and the disallow should cause them to also remove it eventually.
After the requests peter out then change it to 410 and remove it from robots.txt file.
And what should you do if you have a situation of individual pages - lots of them, - that no longer exist but there are other files in the directory they were in that do still exist so that you can't just block the whole directory?
I am feeling rather concerned right now. :/
Also, I am confused a little about where this "'options' page" is that we can watch if we submit a robots.txt file. Where would we submit that?
Thanks!
Ellie
so how do you 'return a 410 with .htaccess' anyway?
And what should you do if you have a situation of individual pages - lots of them, - that no longer exist but there are other files in the directory they were in that do still exist so that you can't just block the whole directory?
each individual page would have to be disallowed - unless you want to remove the whole directory.
If you can get them returning 410 then you could just let it ride - they will be removed. .htaccess can do wildcards if that helps.
We had several hundred, maybe even more, pages we just switched to 404's because they were old and in some cases had duplicate content because we updated our site to a new look and still had the old site. I take it this was the wrong way? We set the 404's about 2 evenings ago.
I am feeling rather concerned right now. :/
Also, I am confused a little about where this "'options' page" is that we can watch if we submit a robots.txt file. Where would we submit that?
once you sign up you get into the 'options' page where you are given 3 options.
The first option is submit robots.txt file
There is a large grey area on the right side of the 'options' page. That is where your requests and their status will appear.
Before you submit your robots.txt file it is critical that you understand your robots.txt file and validate it. This tool is able to remove your entire domain from google for 6 months (if you disallow: /).
Trish - even just returning 410 is good enough, just not sure how long it takes - I would guess within a crawl but if not it would be on the next update.
Anyway I re-wrote it cos not all spiders do wild cards:
User-agent: *
Disallow: /MoviePrints/
Disallow: /images/
Disallow: /banners/
Disallow: /products/User-agent: Googlebot-Image
Disallow: /*.gif$
Disallow: /*.jpg$
Source: [google.com...]
user-agent: * should be the last directive in robots.txt because all robots (or most) will follow the directives of their own or * whichever comes first.
In the above robots.txt googlbot-image may follow the user-agent: * directives without ever seeing user-agent: googlebot-image