Forum Moderators: phranque

Message Too Old, No Replies

Yahoo - Slurp pulling from wrong host

         

Nutter

2:44 am on Jun 8, 2005 (gmt 0)

10+ Year Member



I have several domains on a VPS account. Looking back through my log files I see the Yahoo Slurp agent pulling files that don't exist one of the domains. Problem is - they do exist on one of the other domains on that server. It appears to be just the Yahoo bot and it doesn't seem to be any of the other domains.

My biggest concern is that because of the CMS I'm using and the way mod_rewrite is set up a 302 is being returned rather than a 404.

Any suggestions as to what to look for as a cause?

jdMorgan

3:04 am on Jun 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd do a search on Yahoo for links to the invalid URLs as the cause,

> a 302 is being returned rather than a 404

and this will perpetuate the problem... :(

Jim

Nutter

3:12 am on Jun 8, 2005 (gmt 0)

10+ Year Member



Will RewriteRule ^/the/file/being/requested$ [L,G] as one of the first rules take care of it, or is there a better way to send a 404? I'm pretty sure the 'G' sends a 410 which should be about the same.

jdMorgan

3:46 am on Jun 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[L] used with [G] is redundant, and you should only send 410 to HTTP/1.1 clients.

To force a 404 in a general case, you can simply rewrite to a file the does not exist.

In the case of your site, how to force a 404 depends on exactly how you use your CMS. For example, if you use mod_rewrite to invoke a script to serve all content, then simply set up the mod_rewrite code to detect these bogus URLs, and exit from mod_rewrite before executing the rule that invokes your script.

The whole furball might look like this:


# Set variable "bogus_file" to "true" if incorrect URL for this site
RewriteCond %{REQUEST_URI} ^/bogus_file1 [OR]
RewriteCond %{REQUEST_URI} ^/bogus_file2 [OR]
RewriteCond %{REQUEST_URI} ^/bogus_file3 [OR]
RewriteCond %{REQUEST_URI} ^/bogus_file4 [OR]
RewriteRule . - [E=bogus_file:true]
#
# Detect HTTP/1.1 - compatible clients and return 410 response for bogus files
RewriteCond %{HTTP_HOST} .
RewriteCond %{ENV:bogus_file} ^true$
RewriteRule . - [G]
#
# Else quit mod_rewrite, and let them go 404 before invoking script to generate page.
RewriteCOnd %{ENV:bogus_file} ^true$
RewriteRule . - [L]
#
# Rewrite all other requests to script for CMS page generation
# (I assume that you have something similar to this example)
RewriteCond %{REQUEST_URI} !^/script\.php$
RewriteRule (.*) /script.php?page=$1 [L]

This won't work if you are using a 404 ErrorDocument to invoke the page-generation script, or if you use the same script to serve 404 responses that you use to serve pages.

Jim

Nutter

3:57 am on Jun 8, 2005 (gmt 0)

10+ Year Member



jd, I appreciate the help with rewrites, but that would be a band-aid type solution. I would really like to know what I could have set wrong to have Slurp requesting these files. I've emailed Yahoo hoping they can tell me, but I expect that'll take several days to hear back.

Nutter

3:13 am on Jun 10, 2005 (gmt 0)

10+ Year Member



Well, I had to go a little deeper to get it to work the way I wanted. I went in to the PHP for the CMS and found where it did the 302 for pages it can't find. Added a Location: whatever... to have it return a 404 and then include()'ed the error page that should be showing. So, robots get the 404 they need (or, more accurately the 404 that I need them to get) and users get the nice "File not found, but here's some more stuff" page.

Now that that's working, I still am curious as to why Slurp was pulling these pages to begin with. I understand that Slurp pulls random pages occasionally just to see how the server handles 404's, but why files from another domain on the same server?

jdMorgan

3:30 pm on Jun 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I can't guess as to why slurp confused the domains. It could be some sort of test (the paranoid might say it's part of a check for cross-linked sites), possibly a 404-response check, it might have been an error, or it may be that somehow, somewhere, you've "exposed" the structure of your server. This sometimes happens if custom ErrorDocuments are misconfigured, and obviously can happen if there are errors in a shared dynamic-page-generation script.

Jim

Nutter

1:22 pm on Jul 6, 2005 (gmt 0)

10+ Year Member



I hate to be the guy to bump an old thread, but this keeps getting worse.

jd - You mention that the ErrorDocument could be misconfigured. What should I look for there? I also have 3 domains 301'd to my main domain if that could make a difference. I don't ever link to those 3 domains, they're only used over the phone, so I don't see an issue with search engine stuff there.

My problem now is that query strings are being appended to the root - /?D=A /?D=N and /?N=N - and one of them is the highest rank in Yahoo for one of my main keyword sets. Because these variables are unused by the index script, it just shows the main page. It's not a problem to visitors, but I fear that having the same page indexed with differing query strings may ultimately cause ranking problems.

jdMorgan

6:33 pm on Jul 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It is critical that you use only local URL-paths in ErrorDocument directives. If you use a full canonical URL, then the server will produce a 302 redirect response:

Incorrect (produces 302 redirect):


ErrorDocument 404 http://www.yourdomain.com/404page.html

Correct (produces 404 response):

ErrorDocument 404 /404page.html

Jim