Welcome to WebmasterWorld Guest from 54.145.118.24

Forum Moderators: goodroi

Message Too Old, No Replies

Y! and robots.txt

Partial ignore

     
1:23 pm on May 6, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


My robots.txt file is more than a year old and has always been respected.
Lately I added some *.swf files to a disallowed directory and Y! goes after them. This directory also holds *.jpg and *.gif files which it ignores.

What does Y! not understand in :

User-agent: *
Disallow: /somedirectory/

Anyone else seen this before ?

1:52 pm on May 6, 2008 (gmt 0)

Full Member

5+ Year Member

joined:Sept 11, 2007
posts:303
votes: 0


the code is fine.. then whats wrong with Y!.. hey sfaffa.. I have been caught in the same condition and found zlich after experimenting robots.txt file... these days crawler make no sense while reading a robots.txt...
2:09 pm on May 6, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Slurp, like all robots, will obey the first robots.txt record that applies to its User-agent string. So be sure that you don't have any other records specific to Slurp, because if you do, it will honor only the first one.

Also, make sure the syntax of your file is 100% correct; All comments on separate lines starting with "#", and one and only one blank line after each record (including the last one).

Also, it's possible that Slurp has not yet processed your new robots.txt file. I prefer to post a new robots.txt file at least 24 hours before adding any content that I don't want spidered.

None of this may be applicable -- Just taking some guesses based on what you posted.

I have noticed that Slurp tries to fetch indexes for directories in which it has found content. That is, if it finds a link to /pages/foo.html, it may try to fetch /pages/. On Apache servers with "Options -Indexes" set, this results in a 403-Forbidden response. Similarly-configured IIS servers probably do the same. However, Slurp does seem to honor robots.txt even when doing this -- I've only seen it when fetching pages in that directory is allowed, and the only strange thing about it is that it's trying to fetch a directory index which is not linked-to anywhere on the Web.

Jim

4:16 pm on May 6, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


Thank you Jim, as always, a most explicit reply.

...you don't have any other records specific to Slurp, because if you do, it will honor only the first one.

No mention of Slurp in the entire robots.txt file

the syntax of your file is 100% correct

checked and correct

...Slurp has not yet processed your new robots.txt file

File date : # Last Updated: 19/05/2007
Since then no new rules added to the robots.txt file, only new *.swf files added to the restricted directory

...if it finds a link to /pages/foo.html, it may try to fetch /pages/

The directory /images/ contains no "text" files, ie html, asp, etc only image files, jpg, gif and swf which are displayed on pages outside that directory and it's only the *.swf type of files that Y! has fetched several times now.
6:36 pm on May 6, 2008 (gmt 0)

Full Member

5+ Year Member

joined:Sept 11, 2007
posts:303
votes: 0


Have you checked your robots.txt file by using Google webmaster tools? Try another thing, create a new robots.txt file and place in a root. Wait for a day, and see the results.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members