Forum Moderators: goodroi

Message Too Old, No Replies

Use & or & amp; in robots.txt

Which to use?

         

g1smd

1:14 am on Sep 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Problem...

A site has some URLs with session IDs indexed and those need to be removed from Google SERPs. They are Duplicate Content. Timescale unimportant - no need to get the removal tool out for this.

The site (hopefully) no longer gives session IDs out to anyone who is not logged in. So, no new URLs with sessions are being generated as far as anyone can tell.

Bots will indefinitely continue to request URLs that they already know about, including the rogue URLs that have session IDs in them.

Normally it would be simple to exclude these using robots.txt:

User-agent: Googlebot
Disallow: *s=

On this occasion there is another parameter that ends in an "s" in many of the URLs that DO need to be indexed.

So, what is needed is to:

Disallow: *&s=

This will ensure that only URLs including sessions are excluded.

.

The Question...

The problem is this. In the Valid HTML 4.01 source code, the URLs have the ampersand encoded as & each time, as it should be.

So, which format does the robots.txt file actually need?

Disallow: *&s=
OR
Disallow: *&s=

and why?

encyclo

1:30 am on Sep 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Important note: the following opinion is a guess. So don't sue me if I'm wrong. ;)

You should use

[b]&[/b]
, not
&
. The entity reference is defined via the DTD for HTML 4.01, and recognized by default in XML. But robots.txt is neither HTML nor XML, but a plain text file, so the entity reference is undefined. The use of the entity reference in the HTML document is required as
&
is reserved as an indicator of an entity reference, but the URI actually contains a
&
not
&
. In a plain text file,
&
has no such meaning.

g1smd

1:53 pm on Sep 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That is what I would assume, but...