Forum Moderators: open

Message Too Old, No Replies

Yahoo slurp poorly coded

notorious index.html crawling

         

SEOPTI

11:37 pm on Jul 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Slurp always tries to get the index.html of a directory:

/folder/folder/ HTTP/1.0" 404 1049 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

It thinks if their is a file in a directory, their must be index.html - guess what it gets, a 404.

encyclo

3:16 am on Jul 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know about being poorly-coded, it could well be a programming decision to try to work backwards through the URL structure to search for further, unlinked, content. Often you'll get at least an Apache-generated directory listing, from which Slurp could access other pages.

In itself, I don't see it as an illegitimate spidering technique, although there would be a risk of exposing otherwise unlinked items, perhaps against the better wishes of an inexperienced webmaster.

Gibble

3:25 am on Jul 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not by chance doing it because that's set as the default page in the web server settings is it?

jdMorgan

3:46 am on Jul 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think the whole "unlinked content" issue makes it an illegitimate technique... Not to mention an annoyance. If I want it spidered, I'll link to it, thanks.

In .htaccess on Apache

 Options -Indexes 

Produces a 403-Forbidden response for any directory request where this is no index document as defined by DirectoryIndex.

Jim