homepage Welcome to WebmasterWorld Guest from 54.198.130.203
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Y! and robots.txt
Partial ignore
Staffa




msg:3642811
 1:23 pm on May 6, 2008 (gmt 0)

My robots.txt file is more than a year old and has always been respected.
Lately I added some *.swf files to a disallowed directory and Y! goes after them. This directory also holds *.jpg and *.gif files which it ignores.

What does Y! not understand in :

User-agent: *
Disallow: /somedirectory/

Anyone else seen this before ?

 

bilalseo




msg:3642836
 1:52 pm on May 6, 2008 (gmt 0)

the code is fine.. then whats wrong with Y!.. hey sfaffa.. I have been caught in the same condition and found zlich after experimenting robots.txt file... these days crawler make no sense while reading a robots.txt...

jdMorgan




msg:3642854
 2:09 pm on May 6, 2008 (gmt 0)

Slurp, like all robots, will obey the first robots.txt record that applies to its User-agent string. So be sure that you don't have any other records specific to Slurp, because if you do, it will honor only the first one.

Also, make sure the syntax of your file is 100% correct; All comments on separate lines starting with "#", and one and only one blank line after each record (including the last one).

Also, it's possible that Slurp has not yet processed your new robots.txt file. I prefer to post a new robots.txt file at least 24 hours before adding any content that I don't want spidered.

None of this may be applicable -- Just taking some guesses based on what you posted.

I have noticed that Slurp tries to fetch indexes for directories in which it has found content. That is, if it finds a link to /pages/foo.html, it may try to fetch /pages/. On Apache servers with "Options -Indexes" set, this results in a 403-Forbidden response. Similarly-configured IIS servers probably do the same. However, Slurp does seem to honor robots.txt even when doing this -- I've only seen it when fetching pages in that directory is allowed, and the only strange thing about it is that it's trying to fetch a directory index which is not linked-to anywhere on the Web.

Jim

Staffa




msg:3643011
 4:16 pm on May 6, 2008 (gmt 0)

Thank you Jim, as always, a most explicit reply.

...you don't have any other records specific to Slurp, because if you do, it will honor only the first one.

No mention of Slurp in the entire robots.txt file

the syntax of your file is 100% correct

checked and correct

...Slurp has not yet processed your new robots.txt file

File date : # Last Updated: 19/05/2007
Since then no new rules added to the robots.txt file, only new *.swf files added to the restricted directory

...if it finds a link to /pages/foo.html, it may try to fetch /pages/

The directory /images/ contains no "text" files, ie html, asp, etc only image files, jpg, gif and swf which are displayed on pages outside that directory and it's only the *.swf type of files that Y! has fetched several times now.

bilalseo




msg:3643170
 6:36 pm on May 6, 2008 (gmt 0)

Have you checked your robots.txt file by using Google webmaster tools? Try another thing, create a new robots.txt file and place in a root. Wait for a day, and see the results.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved