Forum Moderators: phranque
My biggest concern is that because of the CMS I'm using and the way mod_rewrite is set up a 302 is being returned rather than a 404.
Any suggestions as to what to look for as a cause?
To force a 404 in a general case, you can simply rewrite to a file the does not exist.
In the case of your site, how to force a 404 depends on exactly how you use your CMS. For example, if you use mod_rewrite to invoke a script to serve all content, then simply set up the mod_rewrite code to detect these bogus URLs, and exit from mod_rewrite before executing the rule that invokes your script.
The whole furball might look like this:
# Set variable "bogus_file" to "true" if incorrect URL for this site
RewriteCond %{REQUEST_URI} ^/bogus_file1 [OR]
RewriteCond %{REQUEST_URI} ^/bogus_file2 [OR]
RewriteCond %{REQUEST_URI} ^/bogus_file3 [OR]
RewriteCond %{REQUEST_URI} ^/bogus_file4 [OR]
RewriteRule . - [E=bogus_file:true]
#
# Detect HTTP/1.1 - compatible clients and return 410 response for bogus files
RewriteCond %{HTTP_HOST} .
RewriteCond %{ENV:bogus_file} ^true$
RewriteRule . - [G]
#
# Else quit mod_rewrite, and let them go 404 before invoking script to generate page.
RewriteCOnd %{ENV:bogus_file} ^true$
RewriteRule . - [L]
#
# Rewrite all other requests to script for CMS page generation
# (I assume that you have something similar to this example)
RewriteCond %{REQUEST_URI} !^/script\.php$
RewriteRule (.*) /script.php?page=$1 [L]
Jim
Now that that's working, I still am curious as to why Slurp was pulling these pages to begin with. I understand that Slurp pulls random pages occasionally just to see how the server handles 404's, but why files from another domain on the same server?
Jim
jd - You mention that the ErrorDocument could be misconfigured. What should I look for there? I also have 3 domains 301'd to my main domain if that could make a difference. I don't ever link to those 3 domains, they're only used over the phone, so I don't see an issue with search engine stuff there.
My problem now is that query strings are being appended to the root - /?D=A /?D=N and /?N=N - and one of them is the highest rank in Yahoo for one of my main keyword sets. Because these variables are unused by the index script, it just shows the main page. It's not a problem to visitors, but I fear that having the same page indexed with differing query strings may ultimately cause ranking problems.
Incorrect (produces 302 redirect):
ErrorDocument 404 http://www.yourdomain.com/404page.html
ErrorDocument 404 /404page.html