Correct way to exclude dynamic pages - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

Correct way to exclude dynamic pages

question about differing methods of excluding dynamic pages

heuristick

10:02 pm on Jan 19, 2005 (gmt 0)

10+ Year Member

I've been doing much research on the different ways to exlude dynamic pages from being indexed, specifically by google. Let me say that, at this time, I'm only concerned with google. So, it seems that there are a number of ways to disallow the indexing of dynamic content, and I was wondering which, if any, worked best?

The all start with
User-agent: Googlebot

and then the next lines vary according to the source. All of these would seem to work?

Disallow: /*?
Disallow: /?
Disallow: /*?*

All of the above have been recommended. Google seems to like #1 and says to use that method in their support pages, but a test as of today (1/19/2005) had them displaying the error message "URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW /*? ". Maybe they have stopped allowing wildcards? Their own robots.txt file uses #2, and I've seen #3 suggested here and there. I'm afraid to try anything that might wipe out my index for 3 months without knowing what's worked for other people.

My real problem is that I have a page that uses incoming affiliate codes. So, something like index.asp?aff=1234. Google has indexed one of these and I want to get rid of it. The robots.txt file seemd like it might work, but I certainly don't want to disallow the root and have it fail to index the index.asp page. So, any suggestions anyone? Thanks in advance for any posts, and I know this has been asked a millio times here, but nobody seems to finish the conversation with "this method worked for me..." ;-)

by the way, the robots.txt file I tried to register with google (that threw the error) was:
User-agent: Googlebot
Disallow: /*?

Thanks
-heuristick

LowLevel

5:55 am on Jan 21, 2005 (gmt 0)

10+ Year Member

So, something like index.asp?aff=1234. Google has indexed one of these and I want to get rid of it.

You can disallow just the URLs with the "aff" parameter:

User-agent: Googlebot
Disallow: /index.asp?aff=

heuristick

2:51 pm on Jan 21, 2005 (gmt 0)

10+ Year Member

LowLevel, thanks for your reply! I have sent messages to Google requesting that they clarify their policy, since I'm a bit confused, but they have yet to respond...

In your suggestion, it would appear that the wildcard character isn't necessary, so the second option I listed (Disallow: /?) would work as well for blocking everything with a querystring from being indexed?

Your comment helps me out greatly for the particular situation I outlined, but I would also like to find an answer that works for all the pages with any querystrings. I found all sorts of advice scattered across the web, much of it conflicting, and much of it specific to a certain situation. Have you tried, tested, and succeeded with any of the three methods listed in my original post?

Again, thanks for your post. That more or less solves my particular situation, but for anyone else looking for a complete answer (which still includes me ;-), I'm still searching for a more definitive and universal answer.

blend27

12:59 am on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I am actually interested on this subject as well. I have tons of links that have been indexed, not just by G, but Yahoo as well. I have removed all references of the pages with '?' as well as converted the pages to read as static URL Link on the entire website - worked very nice for New MSN-Bot.

Now I am looking for some input from some of experienced folks what would be the fastest and most efficient way to get the new Site structure in to the SE indexes.

If Page requested by a spider :

1: if �?� is in URL

DO a 301 on the page and point it to new Search Engine Friendly URL Page

Or

2: if? is in URL place a meta tag

<No Index, Follow>

OR

3. Use �robots.txt� to disallow URLs to be re-crawled from the old index pages with �?� in it.

Which way would be to pass PR from old pages to new once.

Page.cfm has PR of 3

Page.cfm?qurl=2 has PR of 2

What would the best way to pass PR from Page.cfm?qurl=2 to Page.cfm/qurl/2.cfm both pages will have the same content.

Thanks for your input.

LowLevel

4:44 am on Jan 22, 2005 (gmt 0)

10+ Year Member

In your suggestion, it would appear that the wildcard character isn't necessary, so the second option I listed (Disallow: /?) would work as well for blocking everything with a querystring from being indexed?

No. A trailing wildchar is not necessary, because the Robots Exclusion Standard implicitly adds a wildchar to the end of each path.

But you still need a leading wildchar, so the general purpose syntax to disallow any URL containing a "?" char is:


Disallow: /*?

The following one is equivalent (it uses an explicit trailing wildchar):


Disallow: /*?*

And the following one is simply wrong (some spider could interpret it as "disallow any index page containing a? char", but I'm not sure about it):


Disallow: /?

I found all sorts of advice scattered across the web, much of it conflicting, and much of it specific to a certain situation.

The original Robots Exclusion Standard is very vague and it leads to confusion.

Have you tried, tested, and succeeded with any of the three methods listed in my original post?

I have tried and used only the "Disallow: /*?" and it works flawlessly.

Also, do not pay too much attention to wildchar errors reported by Searchengineworld's robots.txt validator. The tool is a bit old and it does not support spider-specific (Googlebot) syntaxes.

heuristick

4:35 pm on Jan 24, 2005 (gmt 0)

10+ Year Member

Thanks LowLevel! This is exactly what I'm looking for.

It's odd, however, that google uses the disallow: /? standard on their own robots.txt file. If you do a search for something that is blocked under this standard (for instance, search google+mac) you'll get the www.google.com/mac page, while their robots.txt file has the line "Disallow: /mac?". So, it seems that the disallow: /mac? directive works (no dynamic pages indexed) but still allows access to the /mac folder? Oh well, I've implemented the /*? method and it seems to be working (the last index knocked off the aff=#*$!x url and the basic page is in it's place).

Thanks so much for your help!

-heuristick

LowLevel

3:54 pm on Jan 26, 2005 (gmt 0)

10+ Year Member

So, it seems that the disallow: /mac? directive works (no dynamic pages indexed) but still allows access to the /mac folder?

Correct. That directive disallows just URLs starting exactly with "/mac?"

The URL "/mac.html" does not start with "/mac?", so it will be spidered.

heuristick

6:50 pm on Jan 28, 2005 (gmt 0)

10+ Year Member

By the way, google finally got back to me:

Thank you for your note. You are correct that the best way to prevent your query string URLs from being indexed is to use the following disallow line:

Disallow: /*?

We are aware that this type of disallow line cannot be accepted to our removal tool, and we are investigating this issue. Please be assured that although our tool will not accept these types of robots.txt files, our robots will follow these directions.

Regards,
The Google Team

--This clears up the issue of why the google submission tool rejected the robots.txt file created in the way they recommended. Looks like method #1 in my original post is indeed the correct way to do this, at least for google.

--As a follow up, I have had the robots.txt file in place since this post originated, and it has worked flawlessly with the Disallow: /*? option for googlebot. Hope this helps out someone else as much as it helped me...

-heuristick