Forum Moderators: Robert Charlton & goodroi
My question is, should I remove the pages and just issue a standard 404 not found response, or should I issue a 301 redirect back to my www.example.com page to capture any potential external links that might be going to these pages?
I think the "right" thing to do is issue the 404 and just ignore any links that are going here, but the SEO side of me says I should issue the 301 redirect from /products/widget.html to www.example.com/ to funnel any potential link weight those old pages have from external links to the main index.
It sounds like doing a 404 is the right thing, and throwing away any potential external links that might be pointing to these pages since the pages to the main index are worth more than to the individual products?
they are accessible now, but really buried and it takes work to find them
[mattcutts.com...]
Here is the important info from it:
"The most useful tidbit ... is that Google treats a 404 HTTP status code ... and a 410 HTTP status code ... in the same way ... once Googlebot has seen a 404 at that location, I think we assume that the document is gone forever. Given how many people use 404 instead of 410, that’s probably a good call for the time being."
Thanks for the replies BTW.
But I wouldn't sweat it; the 'risks' associated with just binning the pages are small - it'll all a matter of doing what is most helpful to the occasional visitor who may otherwise be confused.
Even with a (user friendly) 404, the pages will simply drop out of the index eventually, with no real harm to man or beast.
Another issue is the sloppy way MSN Search handles 301's (which can't be ignored) - with a page saying 301 Moved in the cache, which usually shows up in a site: search at the very beginning. There's more than just Google in life for sites, and I personally see no reason not to do exactly what Apache documentation indicates is the correct handling for removed pages when it's easy enough to do.
Ok, though the point, as far as Google is concerned, is to tell Googlebot a page no longer exists. Googlebot doesn't need to see a 404 or a 410 if the url is disallowed in robots.txt.
According to Vanessa Fox,
Content Removal:
Individual URLs
Choose this option if you'd like to remove a URL or image. In order for the URL to be eligible for removal, one of the following must be true:
* The URL must return a status code of either 404 or 410.
* The URL must be blocked by the site's robots.txt file.
* The URL must be blocked by a robots meta tag.
That way, the page's content will no longer be available for fetching or scraping by rogue bots and site scrapers, who could possibly find the URL through some unknown or forgotten link someplace - or else, it'll no longer be a valid dup if there's any unauthorized copies floating around someplace.
Done the right way very easily, with all bases covered.
Googlebot doesn't need to see a 404 or a 410 if the url is disallowed in robots.txt.
[edited by: Marcia at 10:02 pm (utc) on July 24, 2007]
I am fairly sure that what Vanessa Fox was talking about there was simply the criteria for URLs that you submit to their fast-track URL Removal Tool.
For general web spidering, a URL blocked by robots.txt can hang around in the SERPs for years as a URL-only entry - especially if something somewhere still links to it.
Use the HTTP status codes to let everyone see the real status of the page. Don't hide it behind a spidering block.
Is also unlikely to have had sites of her own that were unmercifully ravished by hijackers, scrapers and outright content thieves who swipe entire pages and put their Adsense on them.
I have, and now I'll either have to re-do and rewrite a couple of sites entirely, or ditch them altogether and let them sit someplace parked. Or maybe sell them, domains a few years old should be able to get me ten bucks or so, which is worth more than those sites are now.
No, she was talking about getting rid of URLs from the SERPs. Look up "vanessa fox 404" for details.
"For general web spidering, a URL blocked by robots.txt can hang around in the SERPs for years as a URL-only entry - especially if something somewhere still links to it."
Obviously.
"Don't hide it behind a spidering block."
No one suggested relying solely on a disallow directive.
Make it easier on Googlebot and install a robots disallow alongside the 404. Checking a URL path against a disallow directive is easier than checking response headers.
"Is also unlikely to have had sites of her own that were unmercifully ravished by hijackers, scrapers and outright content thieves who swipe entire pages and put their Adsense on them."
How is that related to 404s or robots disallow? Attacking the credibility or authority of the messanger doesn't invalidate the message.
Again. If you have a robots.txt exclusion for a URL then the bot never gets to "see" the 404. The robots.txt exclusion hides the real status of the page from the bot. That is not a good idea.
Let the bot see the 404. That 404 reply to the bot really does say that there is nothing there.
The robots.txt exclusion is especially weak. It merely says "do not access this URL". It says nothing about whether a page exisits at that URL or not, nor should it.
The robots.txt exclusion still allows that "excluded" URL to show in the SERPs as a URL-only entry. If it continues to show, then it hasn't been removed.
A "404" HTTP status code returned for a URL categorically states that there is nothing there, and since there is nothing there, the URL gets completely removed far more quickly.
Don't hide the 404 status by using robots.txt as well. Use only the 404 page.
"Is also unlikely to have had sites of her own that were unmercifully ravished by hijackers, scrapers and outright content thieves who swipe entire pages and put their Adsense on them."How is that related to 404s or robots disallow?
Halfdeck, it's already been explained in this thread how it's related. And no, the messenger like have to deal certain other issues and has no cause or obligation to address them. The messenger was addressing an issue from a certain perspective, but there are other considerations related to the issue that was being addressed that really concern no one but the webmaster who's being affected.
If a webmaster wants a page of theirs OUT of the index and wants it totally GONE and unavailable in any way, then 410 is the answer. g1 is 100% correct that robots.txt exclusion will hinder that - for well behaved bots. But if the page hasn't been physically removed, it'll still be accessible to those who don't honor robots.txt - or don't honor anything else, for that matter.
robots.txt exclusion is for excluding pages that are there from being spidered.
[edited by: Marcia at 3:37 am (utc) on July 25, 2007]
Halfdeck, it's already been explained in this thread how it's related.
My bad, I missed your previous message. It wasn't me; it was the beer.
A little over a week ago, I used both methods for two different sections of a site:
1) Physically remove a set of URLs and issue 404s. These are top level pages, frequently crawled. No internal links point to these pages now.
2) robots.txt disallow another set of URLs. These are also top level pages, about equal in PageRank to 404ed URLs, probably with similar crawl frequency. Some internal links still point to this set of URLs.
Now the 404 pages, all 16 of them, are gone from the index. The robots.txt disallowed pages are mostly URL-only.
I have a bunch of pages to remove (some still ranked others not) as I want to reorganize my files. After 5 years, things get really messy.
I understand that I can just submit to google pages that are no longer valid. Don't know the link though but have seen it before. Honestly I don't know much about 410 or 404s. Can I just submit the pages that I want remove and be done with it. Will I be penalize in anyway or be at risk in my overall rankings (rankings that matter - not from the pages I want to remove). Really need help to do it right. As I am a SEO challenged, I'd appreciate a step by step guide. Please help.
If the content has moved to a new URL then set up a 301 redirect from the old URL to the new URL so that wherever the old URL is still publicised the visitor to your site will still be able to get to that content.
If the pages still exist, but you do not want them to be indexed, then add the <meta name="robots" content="noindex"> tag to each of the pages that you do not want to show up in the SERPs.
1) Pages to be removed completely. The pages are currently still on my site.
...so I should have the serve up a 404 or 410 response right? Which is better?
2) Pages that I want to rename, edit a little and yet have the PR and traffic from the old page transfered to the new page name.
...use a 301 redirect right? Should I remove the old file from the server immediately or wait a little while? Also will the page rank and traffic picked up from the Search Engines be transferred eventually to the new page or will it have to work its way up again?
3) Pages that I want retire adn removed yet want all traffic transfered to some remaining old pages
...what should I do for these ones?
Thanks a lot.
3) Isn't 100% clear. Are there inbound links to those pages?
2) I figure these pages that I want to reorganize are not very heavily visited and so I guess a 301 will be best as suggested. Ok so I should remove them so they don't hang around.
3) I don't think there are inbound link but i'd like to still transfer the traffic and PR these pages have to some existing pages.
A 404 response is just the default server behaviour if a file cannot be found which corresponds to the requested URL.
The problem with a 404 is that it could be your fault (Webmaster linking or scripting error), the server's fault (hardware/software error), or the client's fault (user/spider requesting invalid/corrupt URLs). That's why Google et al repeatedly request URLs that 404: They cannot tell if the 404 error condition is temporary or permanent because they do not know the cause. They can choose to re-request the 404'ed URL repeatedly forever, immediately dump it from their index, or choose some arbitrary time period or retry count after which they dump the URL.
But in two out of those three cases, you can suffer if they dump the URL before you can fix the problem; That page won't rank again for quite some time. Or if they won't dump it when you want them to, then you lose the bandwidth they consume re-requesting that obsolete URL again and again (I used to call this the "Inktomi syndrome").
Using a 410 to specifically say "This page has been intentionally removed" may be advantageous in the future, because SEs that support it properly can eliminate the unnecessary attempts to re-fetch URLs that don't exist any more.
Jim
What do you mean by "completely removed"?
* Do you want to remove the pages from your site AND from the SERPs?
If yes, then delete the file from the server and let the URL serve a 404 response - or delete the files and configure the server to return a 410 response.
* OR, do you just want to remove the pages from the SERPs but allow them to still exist on your site?
I get the feeling you want this latter option. If that is the case, then you need a <meta name="robots" content="noindex"> tag on each of those pages.