Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: goodroi
5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's
I guess that's the problem, the robots.txt validates, and it's google I have problems with. But it's not crawling the sites, just showing the URL in a site: -search.
Maybe I remove them from the site: -search by using this meta-tag: <META NAME="robots" CONTENT="noindex, nofollow, noarchive" />
is invalid. never use * in the disallow feild when it is in the user-agent feild. Most bots will see it as invalid
googlebot DOES allow it though
is perfectly valid but pointless since disallow: /page.php? would have the EXACT same effect. a more valid use of the wildcard (for googlebot) would be something like disallow: /*? which would disallow any URL containing "?"
The original robots.txt should work for all bots but not if they are following an inbound link to the URL in question. In that case the listing would be URL-only and never go beyond that.
If the IBL is removed the URL-only listing should also disappear within a few crawls.
will disallow any URL beginning with /page.php?
it will not disallow /page.php but it will disallow /page.php?id=2 /page.php?id=3 ect.
robots.txt is based on prefix-matching meaning that any url that matches up with the prefix /page.php? will be disallowed.
if you disallow /page.php
then it will disallow /page.php And /page.php?id=2 /page.php?id=3 ect because they all contain the prefix /page.php
googlebot can use the wildcard
/*.php will disallow all .php files in every directory because you are essentially saying
disallow: /(any text string).php
all files with a .php extension would be matched but files with a different extension such as .html would not match.
MSN and Inktomi do not allow the wildcard * in the disallow field but they do obey the user-agent: * (so does googlebot)
googlebot will obey the user-agent: * one because it comes first.
when you have multiple lines in robots.txt naming multiple user-agents then user-agent: * should be the last entry to pick up all bots that did not match a user-agent: string
remember only googlbot can handle * in the disallow feild.
In most cases a simple robots.txt with user-agent: * will give one command to all bots but in some cases it is useful to give different bots different directives.
i don't have a problem with any other search engines indexing URL's containing "?" except googlebot so I gave it a special directive
I wan't em all out of my cgi-bin and those first 2 bots are wasting my bandwidth but they do obey robots.txt
(just a crude example)