Google generating lots of 404's - "guess" spidering - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google generating lots of 404's - "guess" spidering

caribguy

6:53 am on Mar 2, 2008 (gmt 0)

Google is "guess" spidering my dynamic site for non-existing, not linked to urls in the following format:

Existing urls:
http://www.example.com/eggs/ham
/eggs/over-easy
/eggs/benedict
/eggs/benedict/sausage
http://www.example.com/eggs - 301 redirect to http://www.example.com

"guessed" urls that generate 404's follow this pattern:
http://www.example.com/ham
/over-easy
/benedict
and even:
/sausage

I can somewhat follow the logic that since /eggs generates a "Moved permanently" Google might think that this would be true for pages that are in sub-directories.

How do I prevent this from happening?

[edited by: tedster at 12:34 pm (utc) on Mar. 2, 2008]
[edit reason] use example.com - it can never be owned [/edit]

tedster

12:42 pm on Mar 2, 2008 (gmt 0)

Google sometimes does do "test" spidering to look for common problems on a server. For instance, a website might be using a url rewriting scheme that is not technically sound and would allow any characters at all in one of the pseudo directories. Or it might be generating a 404 error page that does not actually return a 404 http status in the server header.

If your server response really returns a 404 http status (and not just an "error page" with a 200 status), then there's no issue I can see. Also, sometimes other sites feed bad urls into their links, either testing Google, or accidentally or even in an attempt to do something malicious. But again, returning a 404 status should be a sufficient guard against any trouble.

I'm not sure why this would be a concern. If you're certain that the bad urls are not coming from links on your own site, then there's no issue I can think of. There's no way I know of to stop any user agent from making any request, including googlebot. What you can do is make sure your server handles a bad request properly, and that's about it.

[edited by: tedster at 3:19 am (utc) on Mar. 3, 2008]

caribguy

2:51 pm on Mar 2, 2008 (gmt 0)

Thanks for the quick answer. I'm still puzzled though.


www.example.com 66.249.73.92 - - [01/Mar/2008:19:24:00 -0600] 
"GET /benedict/over-easy HTTP/1.1" 404 896 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.example.com 66.249.73.92 - - [01/Mar/2008:19:27:34 -0600] 
"GET /benedict/benedict HTTP/1.1" 404 884 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
www.example.com 66.249.73.92 - - [01/Mar/2008:19:29:04 -0600] 
"GET /sausage/toast HTTP/1.1" 404 893 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
www.example.com 66.249.73.92 - - [01/Mar/2008:19:30:45 -0600] 
"GET /over-easy/toast HTTP/1.1" 404 884 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
www.example.com 66.249.73.92 - - [01/Mar/2008:19:32:41 -0600] 
"GET /benedict/over-easy HTTP/1.1" 404 884 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Where the site always follows this pattern: /ingredient/type-of-preparation/side-dish

Google Webmaster Tools shows 106 URL's with 404 status, up to yesterday. Last night's crawl brought up approximately 350 new ones...

I'm baffled!

[edited by: tedster at 5:58 pm (utc) on Mar. 2, 2008]
[edit reason] fix side-scroll [/edit]

ecmedia

4:08 pm on Mar 2, 2008 (gmt 0)

Could your problem be the same issue being discussed here?
[webmasterworld.com...]

jdMorgan

4:11 pm on Mar 2, 2008 (gmt 0)

> http://www.example.com/eggs - 301 redirect to http://www.example.com

I agree that this might be the problem. If this were my site, I'd do this:
http://www.example.com/eggs (individual extension-less page URL) - 404-Not found
http://www.example.com/eggs/ (eggs subdirectory's index URL) - 403-Forbidden (unless this index page exists)

In both cases, you can provide a link on the custom error page back to the home page. But it's important to give the client the correct server response code, and to maintain the correct distinction between a "page URL" and a "directory URL" in order to avoid making SE bots "work too hard" and making them tired and confused. :)

This could also be an indicator of a problem in whatever mechanisms you use to generate static URLs on your pages, and then to rewrite those static URLs, when requested by a client, to the proper script for page generation. It's possible that if things are done in the wrong order, an intermediate-step URL may be getting "exposed" to HTTP clients. (Trying to keep the terminology general, since I don't know what specific mechanism you are using.)

So I'd say a thorough Y! search for strange incoming links, a link checker* run, and a thorough inspection of your server response headers might be in order -- and especially if you're doing user-agent or IP-based content delivery. *e.g. Xenu Link Sleuth

Also make sure that for any given "unique page" on your site, that there is one and only one URL that can be used to reach it. If this is not the case, then it may be very easy for one bad link somewhere/anywhere on the Web to put Googlebot onto the wrong track and start it spidering "odd" URLs. Not to mention the potential for duplicate-content issues (which is OT for this thread).

Jim

caribguy

5:43 pm on Mar 2, 2008 (gmt 0)

@ecmedia: agree that there are similarities, and read that whole thread before posting.

Main difference is that all forms on my site generate post requests, and I don't use query parameters to generate pages (there might in the future as content grows, but I intend to block for spiders and rewrite them). There is some ajax magic going on with query strings, but no problems there either.

@jdmorgan: will definitely try your suggested status messages, seems like the semantically correct thing to do. Will also run some requests through WireShark to see if any weird redirection takes place that I'm not aware of.

I use a Zope app server with page templates behind Apache. Will have to look into the distinction between "folders" and "content" - since Zope treats either type of object as a piece of content (possibly having children - subfolders) and allows me to run methods on the object by calling it on an URL (i.e. /eggs/benedict/sides ("sides" being a method) could give me a list of side dishes that can be served with eggs benedict). Adding or omitting a trailing slash does not change the content that is rendered.

The site has been public for less than a month, my server logs don't show any inbound links that should not be there. Nor any broken internal links. I did run the homepage and some subpages through the W3 link checker - looked good except for a "Method not allowed" response for HEAD requests on one of my document types (a templated presentation of data). Unfortunately I can't change that one, but I believe it's unrelated to the 404's. Will try your suggestion to cover all bases.

I have been running other websites with the same technology for years, and search engines have always been very kind - none of these were written to be SE optimized from the ground up, almost looks like it's biting me...

Thanks again for your help!

Norbert

(edited to clarify Zope/Apache)

jdMorgan

6:32 pm on Mar 2, 2008 (gmt 0)

> Adding or omitting a trailing slash does not change the content that is rendered.

That's a problem then, both from a semantics view and from a duplicate-content view. You can use Apache's mod_rewrite to force or remove a trailing slash using a 301-Moved Permanently redirect as required to canonicalize your URLs and to put the 'bots back on the right track.

The basic rule is: A trailing slash means it is a directory, which may or may not have a "directory index listing" enumerating its contents (or with the the semantic "extension" provided by the DirectoryIndex directive, a "home page" at that level), while no trailing slash means its definitely a discrete object -- a page, image, stylesheet script, or file, for lack of better terms. The two URL-forms are not interchangeable, and should not locate the same content.

It is true that search engines should be able to "figure it out," but as your problem demonstrates, they cannot always figure out the corner-cases and it is best not to rely on them to do so. Also, the more obscure the "deep back-end processing" required to "figure things out" gets, the more likely it is to have bugs or omissions, and the less priority will be given to its completeness and correctness. So best practice is to fully-control the "rules of engagement" on your site, and leave nothing to chance.

Jim

caribguy

12:54 am on Mar 4, 2008 (gmt 0)

http://www.example.com/eggs (individual extension-less page URL) - 404-Not found

I am moderately optimistic this did the trick. Will keep my eye out for any oddities: Google stopped requesting the 404 url's before I changed the redirect and has drastically lowered the number of overall requests. But: the pages it has been looking for today do all exist...

Thanks!