Forum Moderators: phranque
Normally you would use a [F] to send a 403 - Forbidden or a [G] for gone to the Bot. However i think it is more natural to send a 404 error.
But afaik there is no flag to raise that error. Thus I'll try it this way by redirecting the request to an non existant page.
RewriteCond %{REQUEST_URI} __s(_p[0-9]+)?\.html$
RewriteCond %{HTTP_USER_AGENT} msnbot
RewriteRule ^.*$ /nowhere.html
I do not feel very comfortable with that. I think the bot will receive a 302 first and after that it will receive a 404.
What can I do?
RewriteCond %{HTTP_USER_AGENT} msnbot/
RewriteRule __s(_p[0-9]+)?\.html$ /nowhere.html [L]
This code will not result in a 302 redirect response, because it does not do an external redirect, it does an internal rewrite. As such, it will return a 404 only, and is 'safe' as far as the server response code is concerned.
However, I would suggest you make it a goal to re-arrange your page URLs as soon as possible so that you can use robots.txt to exclude msbot -- for two reasons: First, who knows what msnbot will think about your site's quality when you return so many 404s. Second, anything more than a very small number of 'expected' 404s makes your server error log and stats hard to process or almost useless.
Jim
thanks a lot for your help. Those pages mentioned are in fact internal redirects to one php file which look like this:
RewriteRule ^(.+)__s.html$ /query.php?string=$1 [L]
RewriteRule ^(.+)__s_p([0-9]+).html$ /query.php?string=$1&intPageNum=$2 [L]
Regarding your suggestion: Would it make sense to change those redirects to
RewriteRule ^(.+)__s.html$ /query.php?string=$1 [R=301,L]
[...]
After that 'query.php' will be loaded and can be excluded via robots.txt.
Does this work?
So, I can't recommend that at all.
The kind of re-architecting I'm talking about is to rename your static URLs to something like
/products/category/product.html or /products/category-product.html. This in contrast to /images/category/product.gif or /images/category-product.gif. In other words, name the static pages in ways that allow for easy control of robots, convenient cache-control header settings by directory name, etc.
It's hard to give good examples, since I'm not intimately familiar with your site, but the idea is that a well-laid-out hierarchical system makes it easy to implement robots and cache control, while still making sense to you and your visitors, and while keeping the URLs relatively short, ie. "/prods/widgets/blue-widget.html"
Jim
In this case I really want to remove those URLs. I want my site to leave a smaller footprint in search engines. With the pages I'm going to remove now, my site produces a lot of redunant content. I think these pages are one cause why my site doesn't rank good in MSN and Yahoo.
Restructuring my site in that directory manner is a great idea. I think, I'll do that.