homepage Welcome to WebmasterWorld Guest from 54.226.0.225
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Please, help with robots.txt
abrodski




msg:3961762
 1:53 pm on Jul 29, 2009 (gmt 0)

Hello!

In robots.txt files in Joomla...by default I see that images directory is NOT allowed to be indexed...but below it , there's a "stories" sub-directory which contains lots of graphic images.So would Google index "stories" subdirectory ? or because a parent directory (images) is marked as not allowed to be indexed, then the robots would NOT index any subdirectories underneath it ?
Disallow: /images/
that's how it appears in Joomla's default robots.txt file.
So I added also disallow: /images/stories
BTW, do robots mind the spaces? I mean in the original default Joomla robots.txt file there're no spaces...like:
User-agent: *Disallow:.......
I changed that to:
User-agent: * Disallow:.......
Also...after that...tor/
Disallow: /cache/Disallow: /components/Disallow:..........
You see, disallow has no space with a previous word...
Is it OK?

I get an error in Googlebot:
In Webmaster tools...:
Line 16: / Syntax not understood

Text of http://example.com/robots.txt

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Disallow: /images/stories/

As you see...I disallowed a subdirectory after disallowing a parent directory...
and after I used robots.txt generator, I got this:
User-agent: *
Disallow: /components/
Disallow: /libraries/
Disallow: /images/
Disallow: /modules/
Disallow: /administrator/
Disallow: /xmlrpc/
Allow: /
Disallow: /cache/
Disallow: /language/
Disallow: /includes/
Disallow: /installation/
Disallow: /images/stories/
Disallow: /media/
Disallow: /plugins/
Disallow: /tmp/
Disallow: /templates/
That doesn't make any sense to me...why some directories are disallowed and some are allowed-disallowed ? Or...maybe it means that the default on the site is to allow everything, except what's disallowed...but then why that allow command stands in between blocks and not right in the beginning?

[edited by: goodroi at 12:54 pm (utc) on July 30, 2009]
[edit reason] Please no urls [/edit]

 

abrodski




msg:3961769
 1:59 pm on Jul 29, 2009 (gmt 0)

same thing if I try this...Allow all and then...:
BlockAll robots/images/, /images/stores/
and I get...
User-agent: *
Disallow: /images/
Allow: /
Disallow: /images/stores/

jdMorgan




msg:3961796
 2:39 pm on Jul 29, 2009 (gmt 0)

It would be a very good idea for you to read the "Standard for Robot Exclusion [robotstxt.org]" rather than trying to guess at robots.txt syntax or functions.

If a directory is Disallowed, then all of its subdirectories are disallowed. And to be more specific, it any URL path-part is Disallowed, then all URL-paths beginning with that path-part are disallowed; Robots.txt handling is based on prefix-matching.

You have three major solution options available:

1) If possible, use a on-page <meta name="robots" content="noindex,follow"> instead of Disallowing the top-level directory. This only works if all objects to be disallowed in that directory are HTML pages.

2) Move the allowed directory out from under any Disallowed directory. This is the better long-term solution, and works for all robots.

3) For Google and other major robots which explicitly state that they recognize it, use the new "Allow:" extension to the robots.txt protocol, and also provide a separate policy record for those robots which do not claim support for it. (Obviously, this means that either those robots will never be able to access the "allowed" directory below the Disallowed directory, or that you cannot Disallow the top-level directory to these robots.)

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved