Forum Moderators: phranque
This is what I did to fix it:
In my httpd.conf, I used a RewriteRule to 301 the pages into [mysite.com...]
It works. When I use Webmaster Tools - Fetch as Googlebot, I get this:
HTTP/1.1 301 Moved Permanently
Date: Sun, 18 Oct 2009 15:03:25 GMT
Server: Apache/2.0.52 (CentOS)
Location: [mysite.com...]
Content-Length: 322
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="http://www.mysite.com/404.html">here</a>.</p>
<hr>
<address>Apache/2.0.52 (CentOS) Server at www.mysite.com Port 80</address>
</body></html>
+++++++++++++++++++++++++++++
Of course, 404.html doesn't actualy exist, so, when you try THAT URL, you get:
HTTP/1.1 404 Not Found
Date: Tue, 20 Oct 2009 14:37:51 GMT
Server: Apache/2.0.52 (CentOS)
Content-Length: 288
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /404.html was not found on this server.</p>
<hr>
<address>Apache/2.0.52 (CentOS) Server at www.mysite.com Port 80</address>
</body></html>
+++++++++++++++++++++++++++++++++++++++++++++++++
So, in the SERPs, I had just under 5000 pages which were indexed, and were showing my home page content. Now, it appears that some of these URLs are losing their titles and descriptions in the SERPs, but they are still listed in the SERPs, with just the URL, and a "Similar" link underneath.
+++++++++++++++++++++++++++++++++++++++++++++++++
I am aware of the URL Removal Request form. However, I am nervous about using it. I theorized that I need to make the fix, and wait for Google to crawl the fix, and 404 all of these pages. I was hoping that I would see this occuring by watching the number of affected URLs start going down from 5000. However, the number of affected URLs is still 5000, they are simply losing their titles/descs. I assume this is good from the standpoint of getting all the duplicates of my home page out of the SERPs. However, I still have all of these dead empty URLs now in the SERPs.
So, this has caused me to lose my 100% confidence in my solution. Am I dealing with this correctly?
1. Do I simply wait for google to update all of these URLs, like I am doing?
2. If so, when all 5000 are updated and none display my home page title/desc, I assume at that point I could do some housekeeping and use the URL Removal Request form and delete the now-empty-URLs once and for all? (I would at that point implement nofollow/noindex to make sure they never got back in. I can't do that yet because they won't get my 301->404 fix if I block googlebot from getting to them...)
3. Am I barking up the wrong tree? Am I solving this the wrong way, or in an unnecessary fashion?
4. Do I need to wait for the URLs to all 301 into the 404? Can I just do the URL Removal Request now, and delete them all right now, regardless if they are showign my home page content or not? Would a URL Removal Request for the 5000 URLs right now solve my dupe content problem in one quick move? Or would that be dangerous?
++++++++++++++
Long post. Sorry. I've read all threads I can, my fix is the culmination of much of what I've read here, I think I have this fixed, my headers look right, the SERPs are updating, but I just want to get a 2nd opinion while I sit here and wait that I am doing this right.
Thank you!
The titles and descriptions disappearing means the URLs have been shoved into one of the Supplemental Results databases and those URLs will appear as URL-only entries in the SERPs for some time to come.
I would not externally redirect these URLs to a non-existent URL. That's a two step action for the bot to try to follow.
On the first visit it finds a 301, and it makes a note of the destination URL. Later, during the next indexing cycle, it visits the new URL, and now finds it is 404. Later on, it has to go back and make a note in the index that the redirected URL points to a 404 location.
The redirect is slowing the bot down in correctly dealing with these URLs.
What you should do here is either directly 301 redirect to the correct new content (but beware to NOT funnel multiple old URLs to one new URL) to retain the traffic -OR- directly return the 404 response for any and all URLs that no longer exist. The 404 page should give helpful links pointing to the new URLs for the content.
ErrorDocument 404 /errors/error404.html <-- You need to ceate this file. RewriteRule [i]pattern[/i] /this-path-does-not-exist [L] It uses a internal rewrite, not an external redirect - and that difference is crucial as to what happens.
[edited by: g1smd at 4:55 pm (utc) on Oct. 20, 2009]
First, thank you for responding!
These pages are all pages which would add a product to the users wishlist, so, there really isn't a page to 301 to. I need to kill them, then bar google from getting these urls again in the future.
I am currently doing a RewriteRule in my httpd.conf, but to a 404.html, like this:
RewriteRule (.*) [mysite.com...] [R=301]
Sorry, I am unsure what you mean by "/this-does-not-exist [L]"
I think this is better!
Now my 1st header is this:
HTTP/1.1 404 Not Found
Date: Tue, 20 Oct 2009 16:46:23 GMT
Server: Apache/2.0.52 (CentOS)
Last-Modified: Tue, 20 Oct 2009 16:46:23 GMT
ETag: W/"8a0db8-122-bd514300"
Accept-Ranges: bytes
Content-Length: 290
Connection: close
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<head>
<title>404 Error</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<h1>404 Error<br><br> This page no longer exists.
</body>
</html>
The titles and descriptions disappearing means the URLs have been shoved into one of the Supplemental Results databases and those URLs will appear as URL-only entries in the SERPs for some time to come.
Just relax? Resist the URL Removal Request temptation? Will my (read: your) new proper fix now get those URLs out of the supplemental results too? I only had my old incomplete fix in place for 2.5 days.
P.S.
(but beware to NOT funnel multiple old URLs to one new URL)This is precisely what I did to get myself into this jam.
What I was earlier referring to, was to not funnel multiple URLs to one still-valid URL as redirects. Google might see that as 'dodgy'.
What you have now is correct. When asking for URL 'X', the server directly returns '404'. This will see Google fix up their mis-indexing at a faster rate than previously. No need for any 'removal tool'.
Do make sure that the error page that is shown to the user contains a friendly error message explaining what has happened.
I assume that you funneled them as rewrites to a real content file, and therefore all the URLs directly returned '200 OK' - and that's what got you in the mess.
Exactly. Total disaster.
What I was earlier referring to, was to not funnel multiple URLs to one still-valid URL as redirects. Google might see that as 'dodgy'.
Ah, I see now your distinction between rewrite and redirect. I must confess until now I used these two terms interchangeably. No, I was doing rewrites, not redirects.
What you have now is correct. When asking for URL 'X', the server directly returns '404'. This will see Google fix up their mis-indexing at a faster rate than previously. No need for any 'removal tool'.
10-4. I will leave everything as it is now, and be patient.
Do make sure that the error page that is shown to the user contains a friendly error message explaining what has happened.
You've said this a couple times now, and I sense you mean it more than just "in passing". Why is this so critical? I have a simple link in there now back to my home page, nothing fancy. Your mentioning this more than once leads me to believe I am missing the severity of what you trying to point out. Why is this so criticial? These particular pages were just 'Add to your wish list" pages for the user that took them to an acknowledgement page in their account - useless for SERPs - the truth is, they still exist for the user, but I am rewriting the url to the 404 if the user-agent is googlebot. Sooo... theoretically, only bots will hit the 404, so it shouldn't be too critical what is on my 404 page, no?
"I'm sorry but that red rotating widget is out of stock, maybe you'd like to browse our selection of [left-handed green gadgets] instead".
:)
Your last post worries me a lot. You should NOT be rewriting to a 404 page. If you do, that URL request will return '200 OK' because the file will be found on the server.
Maybe I misunderstand your terminology. On that note the difference between a redirect and a rewrite is massive, even if the syntax changes are very small.
Especially don't do "something special for Google".
The fix you already implemented is correct: rewrite to a non-existent internal path, and that automatically causes the server to respond with a '404 header' and the contents of the file matching that defined by the server's ErrorDocument directive.
That is what is required, both for users and for Google.
I now understand that there's nothing much that users need to be told when hitting this URL, so no worries about making the 404 error message more useful.
You should NOT be rewriting to a 404 page. If you do, that URL request will return '200 OK' because the file will be found on the server.
Sorry, I was referring to this:
RewriteRule (.*) /this-path-does-not-exist [L]
It rewrites the URL so it returns a 404 code in the header. For sure, I am not doing a redirect.
That is what is required, both for users and for Google.
Well, I can't do it for users, because it is their Add to WishList page. My understanding is that going forward, I can use nofollow/noindex. What I am doing right now is just a stop-gap solution to get these URLs out of the SERPs. That's why I was kicking the tires on URL Removal Request, because I wish I could just do it all in a few clicks and be done with it. ; )
+++++++++++++++++++++++
-sigh- I have another problem and I want to ask you because it is related and it may help me better understand the nuances of all of this.
I have another smaller batch of bad URLs google got it's hands on that I need to fix. They show up in my WMT as Crawl Errors - Not Followed, so not critical, but I want to clean up. They are a collection of bad rewrites from the past, and the result is a batch of URLs that look like this:
[mysite.com...]
[mysite.com...]
[mysite.com...]
[mysite.com...]
[mysite.com...]
[mysite.com...]
How do I fix these? 301 rewrite into the correct page, or RewriteRule pattern /this-path-does-not-exist [L]?
The real URL for all of the above is:
[mysite.com...]
From all you've said above, I am scared of 301ing them into
[mysite.com...]
But perhaps because they've not been followed yet, they're just bad URLs, a 301 into the right URL might be fine. Note that this is not bot-specific code in my httpd.conf, it is for all users and google.
If they do, then redirect the requests, to preserve the traffic.
If they do not, then fail them with a 404 response.
Do make sure than NOTHING on your site links to those.
If they do, then redirect the requests, to preserve the traffic.
They don't, but so I can put this all into perspective and learn, if they did, how would I do the redirect and not have a problem with this:
What I was earlier referring to, was to not funnel multiple URLs to one still-valid URL as redirects. Google might see that as 'dodgy'.
That is the wrong move. All the URLs should serve a 404 header, and the visible error page content should explain what happened and provide a selction of links to click.
For a site where products go out of stock, the single (or a small number up to, say, a few dozen) URL for the old product should redirect to the single URL for the new product: old product A452 redirects to B563, and old product A274 redirects to F612 and so on.
There's nothing worse than having you go off with half an understanding and then make the wrong move based on 'information you got from a forum'.
This stuff isn't trivial, or easy, and a minor mistake can trash your rankings, traffic and earnings, with few clues as to what the real problem is.
Correct terminology is key, as is having a clear definition of exactly what you want to do (defined in terms of both external URLs and internal filepaths) before you start any coding at all is also a very good move.
More than once in this forum we have almost provided "the right answer to the wrong question", where people have only explained part of what they were trying to do and then later once the full picture emerged, sometimes what they wanted to do was entirely the wrong thing.
With this stuff the devil really is in the details. :)
So, it is now a few days later, and I just got an update in my WMT. I know have 3600 URLs "Not Found" listed under Crawl Errors, with "404 (Not Found)", as the Detail message. These are indeed all the ones I need to get out of the index. So, that tells me that my directives are all working.
However, I can still see them all listed in the SERPs when I do a site:mysite.com.
I would expect them to dissappear from site:mysite.com, no?
And since these are all duplicates of my home page, and this is critically important to fix, what do I do now? Still wait for google to remove them from the SERPs, as evidenced with site:mysite.com?
First, the 'bots see the 404's. Then after awhile, the 404s get displayed in GWMT. After another delay, the index is re-calculated and some of the 404ed URLs will be declared as 'disappeared'. Finally, the new index is pushed out to servers, and over the next few days starts showing up in Google servers world-wide. Then you see it.
Google search results show up in less than one second. Google updates do not.
Unfortunately, Google treats 410-Gone responses and 404-Not Found as identical. They don't trust 404 responses fully though, and appear to want to check many times that the resource is gone. OK, it was gone, is it still gone? Yup, still gone, but how about now? Yup, still gone... How about now? They may check your 404 URLs for several years...
If they treated 410-Gone as intended by the HTTP spec, then maybe they wouldn't have to check many times before they decided a resource is really, really gone. Unfortunately they don't, and the official meaning of 404 is that the resource was not found; It doesn't mean the resource has been removed, and it doesn't say why the resource wasn't found or for how long it might be gone. So you can understand that they think that a 404ed resource might 'come back' and want to re-check many times.
Treating a 410-Gone as a "410-Gone and gone for good" would be most helpful in your situation. But alas, 'tis not to be...
Jim
So, I gotta ask, can I short circuit the whole process (here he goes again!) and do a URL Removal Request on these URLs?
Or should I block the URLs in robots.txt again? Or Nofollow? Or a Noindex?
(My main #1 priority was to get my home page content off of these pages so I could fix a dupe content issue and get my home page back, so at least I hope I am on my way to solving that. I guess having the 404 empty URLs in the SERPs won't really hurt me as much. I hope. I'd really like to get my rankings back... ; ))
I just want to make sure I have done all I can... ; )
Google doesn't react too well to URLs that keep on changing their status. You've been 200 then 301 or 302 then 404 in the last few weeks.
Absolutely do NOT add these URLs to the robots.txt file. You want Google to request them and be served a 404 response for each one.
No need for the removal tool. Google is well on the way to fixing this now that they have the data that the URLs are 404.
Actually, it's interesting to watch stuff being de-indexed. There's several stages to the process, and a few things to be learned as you watch how they do it.
Over the years you guys have always been there for me. Thank you. It's Friday night, I'm stressin', I didn't expect anyone to be around, but in minutes you have both replied and helped me out. Thank you.
OK, I will leave it all as is! It is a relief to have confirmation that I am on the right track. I'll have to decide if it is going to be fun to watch the rest of the process, or too nerve-wracking to watch the rest of the process. ; )
Cheers!
In the case of the URL Removal Tool, I avoid it because of the many threads I've seen here that say, "Help, I just made one typo and removed my whole site! -- What can I do?" And the answers that vary from "Cry" to "Jump out the window", etc. The useful ones say things like "Well, your site may return in a few months. Go fix the errors and then take half a year off."
Again, forget about this for a couple of weeks. If you cannot forget about it, then look into another line of work, because this one is gonna kill you by hypertension... :o (I'm not kidding; I've had a job like that, and was grateful to get out before it killed me.)
Oh, and it could be worse... Sometimes 301 redirects take nine months to take full effect in the SERPs... :)
Jim
So, I'd like to pose 2 scenarios, and ask for comments:
1. If someone was to have the above problem with lots of URLs that they need to delete, would all of the above advice still apply as is, except now, use a 410 instead of a 404?
2. For anyone who had the above problem and has tried to fix it with a 404, should they leave it as is now, or should they change the 404s into 410s? ; )
If the URl can be replaced by another very-relevant one, then a 301 would be best. I replied in more detail in the thread describing Google's new approach to 410 responses [webmasterworld.com].
Jim