faulty robots.txt? - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

faulty robots.txt?

tenaka

9:48 am on Dec 27, 2003 (gmt 0)

10+ Year Member

hi guys,

my robots.txt looks like:

quote:
--------------------------------------------------------------------------------

User-agent: *
Disallow: /cgi-bin/
Disallow: /gallery/
Disallow: /images/
Disallow: /stat_www/
Disallow: /stat_www_old/
Disallow: /survey/
Disallow: /templates/

--------------------------------------------------------------------------------

when I look at the logfile I see strange things: the google bot is regularly visiting and indexing sites but
quote:
--------------------------------------------------------------------------------
this one: 66.196.65.36 - Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
--------------------------------------------------------------------------------

does only come to my site and reads the robots.txt allthe time and then leaves again. Today it read it 10 times and nothing else.

Btw to which chmode do I have to set the robots.txt?
And how do some ppl manage to get a 404 error, when retrieving my robots.txt?

Am I doing something wrong?

ncw164x

9:03 pm on Jan 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Your robots.txt file should be

[nameofyoursite...]

as in the root of your site, you should not be getting a 404 error via the browser if the file is in this directory
are you seeing a 404 error when googlebot and slurp requests the file

The file must be FTP'd in ASCII not Binary

hope this helps

ncw164x

jdMorgan

9:16 pm on Jan 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

tenaka,

Welcome to WebmasterWorld [webmasterworld.com]!

To second ncw164x, your robots.txt looks fine. For added reassurance, run it throught this robots.txt validator [searchengineworld.com]

Note that robots.txt Disallow patterns are prefix-matched. That is the robot will not fetch anything that begins with the string you specify after Disallow. Therefore, you can disallow both "/stat_www/" and "/stat_www_old/" using the single directive:


Disallow: /stat_www

The only side effect will be if you have other files or subdirectories whose names also start with "stat_www". For example, "/stat_www_public.html" would also be disallowed.

Inktomi's Slurp is notoriously slow about digging deeply into sites - You may just have to wait awhile. If your site is commercial and you want it spidered soon and frequently, consider the paid inclusion option.

chmod 644 should be fine - robots.txt is fetched just like any other text file or html page.

Jim

tenaka

9:27 pm on Jan 2, 2004 (gmt 0)

10+ Year Member

thx guys,

I just got an email from inktomis tech support telling me that they are retrieving the robots.txt from time to time to check the pages are still up. They also told me that they have thousands of indexed pages from my cgi-bin that I banned them from recently because that was a mistake on my side. well, actually they said they have a few pages from my site in their index and if I have more I should link to them.

I checked again and they only have 3 pages and 1000 of old ones from my cgi-bin...

I will just wait a little longer...

tenaka

2:13 pm on Jan 3, 2004 (gmt 0)

10+ Year Member

talking about 404 errors:

66.77.73.162 - FAST-WebCrawler/3.8 (crawler at trd dot overture dot com; [alltheweb.com...]

Date Page Status Referer
01/03 15:01 /robots.txt 404 -

The first time FAST visited me for a long time and it did not find my robots.txt?

Of course I have one and it is ok..