Forum Moderators: phranque

Message Too Old, No Replies

Rewriting URLs to tell search engines its gone?

removing bad URLs from search engine index

         

mikesz

6:15 am on Apr 22, 2009 (gmt 0)

10+ Year Member



I was running a test on my forum site using a hack that claims that rewriting URLs to be more Search Engine friendly will help your PR etc.

I have removed the hack from my site and reconfigured it to run in pretty much the default mode. I didn't give any thought to the impact that might have on the search engine index but found out soon enough that a couple of thousand 404 error messages resulted in that "minor" change, LOL.

I am thinking that I can use the following technique to tell the search engines that the URLs are gone (about 900 of them)total:

RewriteCond %{REQUEST_URI} ^/nogood.html$ [NC]
RewriteRule .* - [G,L]

Has anyone done this? I am thinking that it will remove them from the index immediately instead of waiting weeks or months for the search engine to drop the bad links.

Appreciate any feedback you might have on this or ideas to solve this problem.

TIA

WebWalla

8:10 am on Apr 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For Google at least, you can request immediate URL removal via Webmaster Tools. Probably the other engines have something similar.

mikesz

8:28 am on Apr 22, 2009 (gmt 0)

10+ Year Member



I know about that one, but it seems to only take single entries and no lists, I was thinking more automated. Google is listing just under 900 URLs that returned a 404 error.

Thanks for the reply. I appreciate it.

SteveWh

10:31 am on Apr 22, 2009 (gmt 0)

10+ Year Member



I didn't check to see if it got the pages deindexed immediately, but when I used a [G,L] rewrite the same as the one you posted, it did immediately stop Yahoo and Google (and probably the others, too) from spidering the page. It explicitly informs the requestor that the page is Gone and has no replacement.

Your purpose might be better served by creating 301 redirects from the old page names to the new ones [R=301,L]. It has a better chance (better than none, anyway) of crediting the new pages with the Pagerank earned by the old ones. (I don't know; someone more knowledgable can comment on whether that's a certainty.) But it might be more trouble than it's worth to do that for 900 pages unless they were very popular.

...except, having said that,... my first attempt, before the [G,L] was to do a 301, and that did not stop search engines from crawling the original page URL. After waiting a couple of months and seeing that the page was still being crawled, that's when I switched to [G], and it stopped them immediately.

[edited by: SteveWh at 10:38 am (utc) on April 22, 2009]

mikesz

10:59 am on Apr 22, 2009 (gmt 0)

10+ Year Member



Thanks, Steve, I appreciate the reply. The Page Rank for this site has been 2 for a long time and I installed the rewrite hack thinking that it would start to move in the right direction, it didn't and incredibly with all the 404 craziness, it hasn't gone in the wrong direction either so I am thinking I have nothing to lose by making the 404s just gone. With a PR at this level I think it may not be an issue and hopefully getting rid of the 404 will have a positive affect on it. The other odd thing is that a lot of my pages are ending up on the first page of specific keyword phrase searches even with a low PR which I don't quite understand either.

SteveWh

1:08 pm on Apr 22, 2009 (gmt 0)

10+ Year Member



The Pagerank measure is based on number of backlinks, and might be based only on that. It was only one factor of many that Google used for assessing the value of pages. It seems not to be an important factor in determining where pages rank in search results. Even if it ever was a factor, my impression is that Google itself has lately been downplaying its importance. As you've discovered, low PR pages can be at the top of SERPs.

jdMorgan

3:47 pm on Apr 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You don't need a separate RewriteCond. Just put the localized URL-path (i.e. no leading slash) into the RewriteRule itself:

RewriteRule ^nogood\.html$ - [G]

Note that [L] used with [G] is redundant, so I dropped the [L].

For URLs which *do* have a direct or reasonably-relevant replacement, use a 301, by all means. Even though you mark URLs as "404-missing" or "410-Gone," the search engines will continue to ask for them as long as they find links to them on the Web (and quite a bit longer, too -- sometimes, for years), and you are throwing away all the backlinks to the old URLs and their associated PageRank/Link-popularity if you use a 4xx response.


RewriteRule ^nogood\.html$ http://www.example.com/good.html [R=301,L]

Also, if possible, do take advantage of similarities in the old URLs, so that you don't have to have one rule for each and every old URL. For example, for to redirect the old URLs widgets-blue.html, widgets-green.html and widgets-<any-color>.html to new-widgets-blue.html, new-widgets-green.html, etc. :

RewriteRule ^widgets-([a-z]+)\.html$ http://www.example.com/new-widgets-$1.html [R=301,L]

Where you need to be more restrictive, for example, if widgets-red.html is not to be redirected, but only blue and green, use something like:

RewriteRule ^widgets-(blue¦green)\.html$ http://www.example.com/new-widgets-$1.html [R=301,L]

replacing the broken pipe "¦" character with a solid pipe character before use.

Jim

mikesz

4:00 am on Apr 24, 2009 (gmt 0)

10+ Year Member



Thanks for the update Jim.

I just checked site: on Google and found about 500 links to my site and most of the are 404 because the contain links like www.mysite.com/graphics-design/ when the actual URL is www.mysite.com/forumdisplay.php?f=16. Unfortunately, every one of them would need to be looked up to see what the real URL is so that a redirect can be constructed.

This search engine friendly URL rewriting seemed like a good idea at the time but its starting to look like it was a bad idea in lieu of the fact that I am reading now that search engines don't real care about "readable" URL and the advantage, if one exists, is in the "user friendly" plain English URL.

I don't see a way to "capture" all the broken URLs that are indexed now either though I think it might be the same data that Google Webmaster Tools is reporting as 404.

So, the real question now is what real advantage do I get from doing manual redirects versus page gone?

TIA for any ideas about this (quick fix would be great, lol).

mikesz

1:32 am on Apr 27, 2009 (gmt 0)

10+ Year Member



Hello Jim,

rewrite instruction isn't working for some reason?

Here is what I am doing per your suggestion. The URL that is in google is:

[mysite.com...]

I added to my .htaccess file:

RewriteRule ^http://www.mysite.com/debating-room/$ [mysite.com...] [R=301,L]

Comes back with 404 like it wasn't even there.

Any ideas?

jdMorgan

1:57 am on Apr 27, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You don't need a separate RewriteCond. Just put the localized URL-path (i.e. no leading slash) into the RewriteRule itself:

I don't like to repeat myself, but that statement was very specific. Use a localized URL-path --not a protocol plus canonical URL-- in the RewriteRule pattern.


RewriteRule ^debating-room/$ http://www.example.com/forumdisplay.php?f=78 [R=301,L]

Jim

mikesz

2:35 am on Apr 27, 2009 (gmt 0)

10+ Year Member



Oops! I totally missed that one. Works great when you follow the instructions 8-) Sorry, should have paid more attention to the "localized URL-path" without the protocol plus canonical URL. That indeed was the problem.

Thank you very much

mikesz

3:53 am on Apr 28, 2009 (gmt 0)

10+ Year Member



Hello Jim,

Believe it or not, I am almost finished redirecting all the 404s on my site. One issue keeps showing up on the googlebot.

For example my redirect is:

RewriteRule ^free-mods/$ forumdisplay.php?f=57 [R=301,L]

But the googlebot is also trying to find free-mods without the forward slash and gets a 404. I have just been adding a statement but should I be able to add it automatically if its missing when the googlebot queries or just continue adding another rule?

jdMorgan

4:21 pm on Apr 28, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't understand the question, but the simplest fix would be to put a question mark after the trailing slash in your rewriterule pattern, thus making that trailing slash optional. Having done that, both variants will be redirected to your forumdisplay script.

Based on the previous discussion in this thread, I believe you could save yourself a lot of wasted time and frustration by spending some time with a print-out of this document [httpd.apache.org] and the regular-expressions tutorial cited in our Forum Charter. These are fundamental to proper implementation and trouble-free use of mod_rewrite rules. They are basic Webmaster kit.

Jim

g1smd

7:07 pm on Apr 28, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule ^free-mods/[b]?[/b]$ forumdisplay.php?f=57 [R=301,L]

would redirect requests either with or without the trailing slash.

However, do check what happens when you request a non-www URL and what happens when you request a www URL.

Yep, you end up on the same sub-domain that you started on - unless you have another rule that then fixes that in a separate redirect.

Both 'not fixing it' and 'fixing it with a second redirect' are non-optimum.

When you redirect you need to include the full target domain name in the redirect:

RewriteRule ^free-mods/[b]?[/b]$ [b]http://www.example.com/[/b]forumdisplay.php?f=57 [R=301,L]

mikesz

2:43 am on Apr 29, 2009 (gmt 0)

10+ Year Member



Thank you, Jim. I appreciate all your help and thanks for the pointers.

Hello G1smd, thank you as well, that is exactly what I needed to know, answered my question perfectly. I appreciate it.