Forum Moderators: goodroi
A site has some URLs with session IDs indexed and those need to be removed from Google SERPs. They are Duplicate Content. Timescale unimportant - no need to get the removal tool out for this.
The site (hopefully) no longer gives session IDs out to anyone who is not logged in. So, no new URLs with sessions are being generated as far as anyone can tell.
Bots will indefinitely continue to request URLs that they already know about, including the rogue URLs that have session IDs in them.
Normally it would be simple to exclude these using robots.txt:
User-agent: Googlebot
Disallow: *s=
On this occasion there is another parameter that ends in an "s" in many of the URLs that DO need to be indexed.
So, what is needed is to:
Disallow: *&s=
This will ensure that only URLs including sessions are excluded.
.
The Question...
The problem is this. In the Valid HTML 4.01 source code, the URLs have the ampersand encoded as & each time, as it should be.
So, which format does the robots.txt file actually need?
Disallow: *&s=
OR
Disallow: *&s=
and why?
You should use
[b]&[/b], not &. The entity reference is defined via the DTD for HTML 4.01, and recognized by default in XML. But robots.txt is neither HTML nor XML, but a plain text file, so the entity reference is undefined. The use of the entity reference in the HTML document is required as & is reserved as an indicator of an entity reference, but the URI actually contains a & not &. In a plain text file, & has no such meaning.