Forum Moderators: phranque

Message Too Old, No Replies

I want to force a 404 error

is redirection to a non-existent page best practise?

         

taps

6:08 pm on Sep 27, 2005 (gmt 0)

10+ Year Member



I want to exclude certain pages from being crawled by msnbot. Since the pattern is too complicated I cannot do that via robots.txt.

Normally you would use a [F] to send a 403 - Forbidden or a [G] for gone to the Bot. However i think it is more natural to send a 404 error.

But afaik there is no flag to raise that error. Thus I'll try it this way by redirecting the request to an non existant page.

RewriteCond %{REQUEST_URI} __s(_p[0-9]+)?\.html$
RewriteCond %{HTTP_USER_AGENT} msnbot
RewriteRule ^.*$ /nowhere.html

I do not feel very comfortable with that. I think the bot will receive a 302 first and after that it will receive a 404.

What can I do?

jdMorgan

11:58 pm on Sep 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can 'compress' that code into two lines:

RewriteCond %{HTTP_USER_AGENT} msnbot/
RewriteRule __s(_p[0-9]+)?\.html$ /nowhere.html [L]

I added the [L] flag to save CPU time as well.

This code will not result in a 302 redirect response, because it does not do an external redirect, it does an internal rewrite. As such, it will return a 404 only, and is 'safe' as far as the server response code is concerned.

However, I would suggest you make it a goal to re-arrange your page URLs as soon as possible so that you can use robots.txt to exclude msbot -- for two reasons: First, who knows what msnbot will think about your site's quality when you return so many 404s. Second, anything more than a very small number of 'expected' 404s makes your server error log and stats hard to process or almost useless.

Jim

taps

7:29 am on Sep 28, 2005 (gmt 0)

10+ Year Member



jd

thanks a lot for your help. Those pages mentioned are in fact internal redirects to one php file which look like this:

RewriteRule ^(.+)__s.html$ /query.php?string=$1 [L]
RewriteRule ^(.+)__s_p([0-9]+).html$ /query.php?string=$1&intPageNum=$2 [L]

Regarding your suggestion: Would it make sense to change those redirects to

RewriteRule ^(.+)__s.html$ /query.php?string=$1 [R=301,L]
[...]

After that 'query.php' will be loaded and can be excluded via robots.txt.

Does this work?

jdMorgan

2:40 pm on Sep 28, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, but using a 30x redirect would result in the removal of all of your static URLs from the search engine, and destroy your rankings.

So, I can't recommend that at all.

The kind of re-architecting I'm talking about is to rename your static URLs to something like
/products/category/product.html or /products/category-product.html. This in contrast to /images/category/product.gif or /images/category-product.gif. In other words, name the static pages in ways that allow for easy control of robots, convenient cache-control header settings by directory name, etc.

It's hard to give good examples, since I'm not intimately familiar with your site, but the idea is that a well-laid-out hierarchical system makes it easy to implement robots and cache control, while still making sense to you and your visitors, and while keeping the URLs relatively short, ie. "/prods/widgets/blue-widget.html"

Jim

taps

5:16 pm on Sep 28, 2005 (gmt 0)

10+ Year Member



Thanks again Jim,

In this case I really want to remove those URLs. I want my site to leave a smaller footprint in search engines. With the pages I'm going to remove now, my site produces a lot of redunant content. I think these pages are one cause why my site doesn't rank good in MSN and Yahoo.

Restructuring my site in that directory manner is a great idea. I think, I'll do that.