Welcome to WebmasterWorld Guest from 54.226.189.112

Forum Moderators: goodroi

Message Too Old, No Replies

Please, help with robots.txt

     

abrodski

1:53 pm on Jul 29, 2009 (gmt 0)

5+ Year Member



Hello!

In robots.txt files in Joomla...by default I see that images directory is NOT allowed to be indexed...but below it , there's a "stories" sub-directory which contains lots of graphic images.So would Google index "stories" subdirectory ? or because a parent directory (images) is marked as not allowed to be indexed, then the robots would NOT index any subdirectories underneath it ?
Disallow: /images/
that's how it appears in Joomla's default robots.txt file.
So I added also disallow: /images/stories
BTW, do robots mind the spaces? I mean in the original default Joomla robots.txt file there're no spaces...like:
User-agent: *Disallow:.......
I changed that to:
User-agent: * Disallow:.......
Also...after that...tor/
Disallow: /cache/Disallow: /components/Disallow:..........
You see, disallow has no space with a previous word...
Is it OK?

I get an error in Googlebot:
In Webmaster tools...:
Line 16: / Syntax not understood

Text of http://example.com/robots.txt

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Disallow: /images/stories/

As you see...I disallowed a subdirectory after disallowing a parent directory...
and after I used robots.txt generator, I got this:
User-agent: *
Disallow: /components/
Disallow: /libraries/
Disallow: /images/
Disallow: /modules/
Disallow: /administrator/
Disallow: /xmlrpc/
Allow: /
Disallow: /cache/
Disallow: /language/
Disallow: /includes/
Disallow: /installation/
Disallow: /images/stories/
Disallow: /media/
Disallow: /plugins/
Disallow: /tmp/
Disallow: /templates/
That doesn't make any sense to me...why some directories are disallowed and some are allowed-disallowed ? Or...maybe it means that the default on the site is to allow everything, except what's disallowed...but then why that allow command stands in between blocks and not right in the beginning?

[edited by: goodroi at 12:54 pm (utc) on July 30, 2009]
[edit reason] Please no urls [/edit]

abrodski

1:59 pm on Jul 29, 2009 (gmt 0)

5+ Year Member



same thing if I try this...Allow all and then...:
BlockAll robots/images/, /images/stores/
and I get...
User-agent: *
Disallow: /images/
Allow: /
Disallow: /images/stores/

jdMorgan

2:39 pm on Jul 29, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



It would be a very good idea for you to read the "Standard for Robot Exclusion [robotstxt.org]" rather than trying to guess at robots.txt syntax or functions.

If a directory is Disallowed, then all of its subdirectories are disallowed. And to be more specific, it any URL path-part is Disallowed, then all URL-paths beginning with that path-part are disallowed; Robots.txt handling is based on prefix-matching.

You have three major solution options available:

1) If possible, use a on-page <meta name="robots" content="noindex,follow"> instead of Disallowing the top-level directory. This only works if all objects to be disallowed in that directory are HTML pages.

2) Move the allowed directory out from under any Disallowed directory. This is the better long-term solution, and works for all robots.

3) For Google and other major robots which explicitly state that they recognize it, use the new "Allow:" extension to the robots.txt protocol, and also provide a separate policy record for those robots which do not claim support for it. (Obviously, this means that either those robots will never be able to access the "allowed" directory below the Disallowed directory, or that you cannot Disallow the top-level directory to these robots.)

Jim

 

Featured Threads

Hot Threads This Week

Hot Threads This Month