| 12:41 pm on Oct 13, 2009 (gmt 0)|
I'm not sure which search engine you are trying to remove your urls from. A common mistake people make is that they use robots.txt to block access after a page has been indexed. They then add the "noindex" metatag to the page to make the search engine remove the page from the index. Since the robots.txt is blocking access the search engine can't access the page to see the "noindex" metatag.
If you are using metatags, make sure your robots.txt is not blocking access. Make sure you are using an official robots.txt validator to ensure your robots.txt is doing what you want it to do. Good luck.
| 12:50 pm on Oct 13, 2009 (gmt 0)|
Thanks for replying to this. I will investigate further but I think all is in order regarding the issues you have suggested.
An interesting (weird!) thing has shown up in Google webmaster tools in that when I test against the robots file it does not block the files specified in the robots file when the setting is User-agent: * - yet when I change the setting to User-agent: Googlebot it responds correctly i.e. blocking the files and folders correctly disallowed in the robots file......strange goings on indeed!
| 1:09 pm on Oct 13, 2009 (gmt 0)|
You might consider posting a shortened version of your robots.txt file for review with only a few URL-path Disallows/Allows showing in each 'section' and obscuring those URL-paths for your own security.
The problem could be one of syntax, structure, or user-agent-policy-record priority.
I can tell you that most robots.txt validators are flawed, and that none of them use that actual search engine's robots.txt parsing code to evaluate the file... I've found discrepancies in *all* major search engines' robots.txt validation tools.
| 1:18 pm on Oct 13, 2009 (gmt 0)|
I am 100% certain syntax etc are fine (I have been creating robots files for years) there is nothing complex in the robots file - that is why the problem is driving me nuts...
As mentioned above - the robots file ignores Disallows issues after User-agent: * - but if I change this to Useragent: Googlebot it works perfectly......this seems to be the cause of the URL removals being denied but I still want to know why it is ignoring the Useragent: * command....
| 2:33 pm on Oct 13, 2009 (gmt 0)|
Aight, we'll have to do things the hard/slow way, then.
> the robots file ignores Disallows issues after User-agent: *
The robots file ignores Disallows issues? AFAIK, the file just sits there on the server and gets fetched by robots, so this statement is unclear. I assume that you mean that Googlebot appears to ignore Disallow directives in the "User-agent: *" policy record, but that it seems to obey them if the "User-agent:" name in that record is changed from "*" to "Googlebot".
Since we can't see the file, a few questions come to mind:
Is there more than one User-agent policy record that applies to Googlebot (i.e. "Googlebot and "*")?
What is the position, relative to "User-agent: *" policy record of the other UA policy records?
Are these other UA policy records more or less "permissive" than the "User-agent: *" record?
Is there a completely-blank line after each UA policy record, including the last one in the file?
Are there any spurious blank lines, say, between a User-agent: line and a "Disallow:" line?
Do all comment lines begin with the required "#" character?
It'd likely be easier to spot the problem with an example to look at, unless you prefer to continue to insist that Gbot is broken instead of making sure that no-one can spot *any* problem --actual or potential-- with your robots.txt file.
| 2:41 pm on Oct 13, 2009 (gmt 0)|
Ok jd - here is a sample of the file. The "issues" was a typo taht weas meant to say "issued"....
Anyway - see below:
Above doesnt work - but if User-agent is changed to Googlebot everything is fine.
The problem has been resolved in that I can now remove the files i want from the SERPs but I would still like to know why the User-agent: * statement appears to be fucntioning incorrectly....
| 2:43 pm on Oct 13, 2009 (gmt 0)|
The remainder of the robots file is simply more diectories and files Disallowed in the same manner - the correct manner.
Now you can see why I am sure everything is fine sytnax wise :-)
| 3:11 pm on Oct 13, 2009 (gmt 0)|
It is usual to place the "User-agent: *" policy record last, as it is then used as a catch-all for those agents not already specifically-matched.
> Now you can see why I am sure everything is fine sytnax wise :-)
I listed policy-record "structure" and "priority" above --in addition to syntax-- as things to check.
| 3:24 pm on Oct 13, 2009 (gmt 0)|
Only using one User-agent line and everything really does seem to be in order - I have had 3 others look at this file too.
I have just spoken to someone else about it and he suggest maybe something is up on the server configuration side.....will look into this and report back...
Thanks for all the input Jim - you too goodroi.