homepage Welcome to WebmasterWorld Guest from 184.73.40.21
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Having Trouble With an Allow Instruction.
Trying to adapt an example given by jdMorgan.
Durnovaria

5+ Year Member



 
Msg#: 3816892 posted 3:03 pm on Dec 31, 2008 (gmt 0)


I have been referring to jdMorgan's post (#:1528392) in this old thread:

[webmasterworld.com...]

I am trying to follow the example to stop Google (and Ask Jeeves as it has become apparent) from indexing certain pages on my site and I have adapted this given example, written for someone else's specific situation:

User-agent: Googlebot
User-agent: Ask Jeeves/Teoma
Disallow: /cgi-bin/
Disallow: /robot.html

User-agent: *
Disallow: /cgi-bin/
Disallow: /wiget1.html
Disallow: /wiget2.html
Disallow: /robot.html

Using the above example I have created my disallow list and got this far:

User-agent: Googlebot
User-agent: Ask Jeeves/Teoma

User-agent: *
Disallow: /my_page_1.htm
Disallow: /my_page_2.htm
Disallow: /my_page_3.htm
Disallow: /my_page_4.htm

I removed the Disallow from the Google/Ask Jeeves section because I thought it was specific to the other person's site. When I test my robots.txt file it brings up an error saying that I need to give an instruction under the Google and Ask Jeeves section.

What do I need to put in the Google and Ask Jeeves section to get their robots to proceed as normal to my pages to find the <meta name="robots" content="noindex"> tags that I will put there?

I'm confused because in the example given above it only had Disallow comments for cgi-bin and robot.html and no other Allow instruction. Due to the absence of a specific allow instruction I thought I could remove those apparently-irrelevant-to-me Disallow comments and it would work properly by default!

Any advice would be much appreciated!

Many thanks,

Mike

[edited by: Durnovaria at 3:05 pm (utc) on Dec. 31, 2008]

 

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3816892 posted 1:14 am on Jan 1, 2009 (gmt 0)

if you are trying to exclude only google and ask from those 4 pages, you need this:
User-agent: Googlebot
User-agent: Ask Jeeves/Teoma
Disallow: /my_page_1.htm
Disallow: /my_page_2.htm
Disallow: /my_page_3.htm
Disallow: /my_page_4.htm

in your example, the blank line after the User-agent: list stops that set of exclusions and the wildcard User-agent specification applies to all robots.

Durnovaria

5+ Year Member



 
Msg#: 3816892 posted 12:08 pm on Jan 1, 2009 (gmt 0)

Thanks for your reply :-)

I was trying to follow the other example given and modify it for my needs.

The bit at the top with Google and Ask Jeeves in it was a separate section just for their robots. What it was supposed to do was direct Google and Ask Jeeves robots to go to the pages where they would find the "noindex" meta tag, due to the way they apparently log or index pages.

The remaining part of the file was for all other robots who apparently wouldn't have a problem excluding the pages in the list.

So what I think I need is an instruction under the Googlebot and Ask Jeeves section (but before the User-agent: * section) to send Google and Ask Jeeves to my pages. Apparently if I don't do that they will just use the normal Disallow list and still log the page URLs.

Mike

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3816892 posted 12:49 pm on Jan 1, 2009 (gmt 0)

yes that is slightly different from what you had:
User-agent: Googlebot
User-agent: Ask Jeeves/Teoma
Disallow:

User-agent: *
Disallow: /my_page_1.htm
Disallow: /my_page_2.htm
Disallow: /my_page_3.htm
Disallow: /my_page_4.htm

(note the "blank" disallow)

Durnovaria

5+ Year Member



 
Msg#: 3816892 posted 1:26 pm on Jan 1, 2009 (gmt 0)


Thank you :-)

It's all new to me, so I didn't really know how it worked!

I have updated my file now, so hopefully that should work.

Thanks again,

Mike

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3816892 posted 1:47 pm on Jan 1, 2009 (gmt 0)

you can validate you rules with GWT:
Checking robots.txt - Webmaster Help Center [google.com]

Durnovaria

5+ Year Member



 
Msg#: 3816892 posted 2:52 pm on Jan 1, 2009 (gmt 0)

I tried my robot.txt file in the Google Webmaster Tools and it said the following:

Allowed by line 3: Disallow:
Detected as a directory; specific files may have different restrictions

Also, out of curiosity I tested it on this site as well: [searchenginepromotionhelp.com...]

On that site it said:

Line/Contents

1/User-agent: Googlebot
The line below must be an allow, disallow or comment statement

2/User-agent: Ask Jeeves/Teoma

3/Disallow:
Missing / at start of file or folder name

So it appears not to like Line 1, saying that there should be a comment after it, and doesn't like Line 2, saying that there's a forward slash missing.

I've no idea what it's going on about, so I thought I would mention it!

Mike

:-)

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3816892 posted 2:58 pm on Jan 1, 2009 (gmt 0)

Some parsers cannot cope with multiple
User-agent: lines preceding the Disallow: statement(s).

There must be one or more Disallow: statements after the User-agent: line(s).

There must be a blank line after the last Disallow: statement of each block (i.e. before the next User-agent: line).

If there is a specific section for Google then it reads only that section of the file. That is, it does NOT read the User-agent: * section at all.

This is the correct syntax if everything is allowed: Disallow:
If a checker says otherwise, then it is the checker that is faulty.

Durnovaria

5+ Year Member



 
Msg#: 3816892 posted 3:17 pm on Jan 1, 2009 (gmt 0)


Okay, thanks. :-) I didn't doubt that what I was told here was correct, but I did wonder why that checker came up with those comments!

Mike

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3816892 posted 7:24 am on Jan 2, 2009 (gmt 0)

so try this then:
User-agent: Googlebot
Disallow:

User-agent: Ask Jeeves/Teoma
Disallow:

User-agent: *
Disallow: /my_page_1.htm
Disallow: /my_page_2.htm
Disallow: /my_page_3.htm
Disallow: /my_page_4.htm

Durnovaria

5+ Year Member



 
Msg#: 3816892 posted 1:43 pm on Jan 2, 2009 (gmt 0)


I'm happy with the one you did for me before, Phranque.

Again, out of curiostity I ran that latest one through the checking program and it didn't like that either!

For lines 2 and 5 it said 'Missing / at start of file or folder name' and for line 11 (the final line) it said 'The line below must be an allow, disallow, comment or a blank line statement.'

I don't think I'll be using that checking program again!

Mike :-)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved