Matching URL with encoded ampersand against robots.txt

Does anybody know if Google, once it picks up URLs from the HTML of the page and puts it in its To Do list, decodes them properly, i.e. replaces & with & before comparing it to robots.txt (and if anything has changed in this area recently)?

In our HTML all ampersands are encoded as & however on occasions (exception rather than a rule) I can see that googlebot is trying to request URL with *&lang= or even &amp%3Blang= instead of translating this into &lang= as it should be.

This then results in spurious URLs being requested and causes errors on the server.

We have stopped this via robots.txt few months ago like this (and this seemed to work fine):

Disallow: /*&lang=
Disallow: /*&amp%3Blang=

On occasion we also see that GWT reports URL as "blocked by robots.txt", although when I copy and paste this URL into robots test window in GWT, it reports this URL as Allowed.

However, in the last two weeks we have noticed in GWT crawling stats that the number of KB downloaded has dropped significantly although the number of pages crawled by googlebot seems to be averaging as before.

Any ideas?

Matching URL with encoded ampersand against robots.txt

aakk9999

goodroi

aakk9999

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week