robots.txt syntex

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt syntex

How to block a directory.

jdancing

12:48 am on Oct 30, 2003 (gmt 0)

Below is apiece of my robots.txt file in the root directory of my site. Something must be wrong because pages from the /myfinancialhistory/ and /memberinfo/ (made up names for comic relief :-O ) are in the Google index.

User-agent: Black Hole
Disallow: /

User-agent: Titan
Disallow: /

User-agent: WebStripper
Disallow: /

And about 100 blocked robots........... then:

User-agent: *
Disallow: /myfinancialhistory/
Disallow: /memberinfo/

This is the site structure:

www.foobar.com/myfinancialhistory/bankaccounts.htm
www.foobar.com/memberinfo/criminalrecords.htm

I used the Google remove option a few days ago and they are still there. Is there something wrong with the robots.txt syntex? Should I move the User-agent: * stuff to the top?

Thanks

jdMorgan

2:08 am on Oct 30, 2003 (gmt 0)

jdancing,

You probably want to keep that "User-agent: *" at the end -- Remember that good robots will obey the first record containing either a match on their user-agent name or "*" whichever comes first.

Check your file for extraneous characters - such as spaces at the end of lines, etc.

More info:
Learn: [robotstxt.org...]
Validate: [searchengineworld.com...]

Jim

jdancing

4:11 am on Oct 30, 2003 (gmt 0)

so could a

User-agent: Googlebot-Image
Disallow: /

early on be and issue?

BlueSky

4:27 am on Oct 30, 2003 (gmt 0)

That entry shouldn't affect Googlebot because he and Googlebot-Image are two different bots.

If you cannot find a problem in your robots.txt, then I recommend you write to the company at googlebot@google.com . Send them a copy of the entries in your logs showing where he did not follow your robots.txt directives plus the URL of your site. They'll check out what happened with their bot.

jdMorgan

4:28 am on Oct 30, 2003 (gmt 0)

jdancing,

An issue? What, with respect to 'regular' GoogleBot? No, that's not an issue.

But you do want your specific, per-robot stuff first, and then either allow or disallow the rest with the
"User-agent: *" record at the end.

Also, since you say you have about 100 bad-bot disallows, you might want to peruse this old thread: [webmasterworld.com...]

Jim

closed

7:52 pm on Oct 31, 2003 (gmt 0)

Remember that good robots will obey the first record containing either a match on their user-agent name or "*" whichever comes first.

Do you have a source you can quote for that, Jim? The only time I remember where the order matters in a manner like that (general versus specific) in robots.txt is in the Disallow/Allow statements.