Forum Moderators: goodroi
The all start with
User-agent: Googlebot
and then the next lines vary according to the source. All of these would seem to work?
Disallow: /*?
Disallow: /?
Disallow: /*?*
All of the above have been recommended. Google seems to like #1 and says to use that method in their support pages, but a test as of today (1/19/2005) had them displaying the error message "URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW /*? ". Maybe they have stopped allowing wildcards? Their own robots.txt file uses #2, and I've seen #3 suggested here and there. I'm afraid to try anything that might wipe out my index for 3 months without knowing what's worked for other people.
My real problem is that I have a page that uses incoming affiliate codes. So, something like index.asp?aff=1234. Google has indexed one of these and I want to get rid of it. The robots.txt file seemd like it might work, but I certainly don't want to disallow the root and have it fail to index the index.asp page. So, any suggestions anyone? Thanks in advance for any posts, and I know this has been asked a millio times here, but nobody seems to finish the conversation with "this method worked for me..." ;-)
by the way, the robots.txt file I tried to register with google (that threw the error) was:
User-agent: Googlebot
Disallow: /*?
Thanks
-heuristick
In your suggestion, it would appear that the wildcard character isn't necessary, so the second option I listed (Disallow: /?) would work as well for blocking everything with a querystring from being indexed?
Your comment helps me out greatly for the particular situation I outlined, but I would also like to find an answer that works for all the pages with any querystrings. I found all sorts of advice scattered across the web, much of it conflicting, and much of it specific to a certain situation. Have you tried, tested, and succeeded with any of the three methods listed in my original post?
Again, thanks for your post. That more or less solves my particular situation, but for anyone else looking for a complete answer (which still includes me ;-), I'm still searching for a more definitive and universal answer.
Now I am looking for some input from some of experienced folks what would be the fastest and most efficient way to get the new Site structure in to the SE indexes.
If Page requested by a spider :
1: if ‘?’ is in URL
DO a 301 on the page and point it to new Search Engine Friendly URL Page
Or
2: if? is in URL place a meta tag
<No Index, Follow>
OR
3. Use ‘robots.txt’ to disallow URLs to be re-crawled from the old index pages with ‘?’ in it.
Which way would be to pass PR from old pages to new once.
Page.cfm has PR of 3
Page.cfm?qurl=2 has PR of 2
What would the best way to pass PR from Page.cfm?qurl=2 to Page.cfm/qurl/2.cfm both pages will have the same content.
Thanks for your input.
In your suggestion, it would appear that the wildcard character isn't necessary, so the second option I listed (Disallow: /?) would work as well for blocking everything with a querystring from being indexed?
No. A trailing wildchar is not necessary, because the Robots Exclusion Standard implicitly adds a wildchar to the end of each path.
But you still need a leading wildchar, so the general purpose syntax to disallow any URL containing a "?" char is:
Disallow: /*?
The following one is equivalent (it uses an explicit trailing wildchar):
Disallow: /*?*
And the following one is simply wrong (some spider could interpret it as "disallow any index page containing a? char", but I'm not sure about it):
Disallow: /?
I found all sorts of advice scattered across the web, much of it conflicting, and much of it specific to a certain situation.
The original Robots Exclusion Standard is very vague and it leads to confusion.
Have you tried, tested, and succeeded with any of the three methods listed in my original post?
I have tried and used only the "Disallow: /*?" and it works flawlessly.
Also, do not pay too much attention to wildchar errors reported by Searchengineworld's robots.txt validator. The tool is a bit old and it does not support spider-specific (Googlebot) syntaxes.
It's odd, however, that google uses the disallow: /? standard on their own robots.txt file. If you do a search for something that is blocked under this standard (for instance, search google+mac) you'll get the www.google.com/mac page, while their robots.txt file has the line "Disallow: /mac?". So, it seems that the disallow: /mac? directive works (no dynamic pages indexed) but still allows access to the /mac folder? Oh well, I've implemented the /*? method and it seems to be working (the last index knocked off the aff=#*$!x url and the basic page is in it's place).
Thanks so much for your help!
-heuristick
Thank you for your note. You are correct that the best way to prevent your query string URLs from being indexed is to use the following disallow line:
Disallow: /*?
We are aware that this type of disallow line cannot be accepted to our removal tool, and we are investigating this issue. Please be assured that although our tool will not accept these types of robots.txt files, our robots will follow these directions.
Regards,
The Google Team
--This clears up the issue of why the google submission tool rejected the robots.txt file created in the way they recommended. Looks like method #1 in my original post is indeed the correct way to do this, at least for google.
--As a follow up, I have had the robots.txt file in place since this post originated, and it has worked flawlessly with the Disallow: /*? option for googlebot. Hope this helps out someone else as much as it helped me...
-heuristick