Does anybody know if Google, once it picks up URLs from the HTML of the page and puts it in its To Do list, decodes them properly, i.e. replaces & with & before comparing it to robots.txt (and if anything has changed in this area recently)?
In our HTML all ampersands are encoded as & however on occasions (exception rather than a rule) I can see that googlebot is trying to request URL with *&lang= or even &%3Blang= instead of translating this into &lang= as it should be.
This then results in spurious URLs being requested and causes errors on the server.
We have stopped this via robots.txt few months ago like this (and this seemed to work fine):
Disallow: /*&lang= Disallow: /*&%3Blang=
On occasion we also see that GWT reports URL as "blocked by robots.txt", although when I copy and paste this URL into robots test window in GWT, it reports this URL as Allowed.
However, in the last two weeks we have noticed in GWT crawling stats that the number of KB downloaded has dropped significantly although the number of pages crawled by googlebot seems to be averaging as before.
As a rule I try whenever possible to avoid ampersands in the url and really avoid multiple ampersands. Ampersands make things more difficult for the search engine bots and the last thing you want is to make it harder for the bots.
When dealing with sites that have ampersands I have found that sometimes the search engine crawling issues is because a website linking to the ampersand page is using the opposite form that the website was setup as.
Don't forget that htaccess might be helpful in rewriting urls requests that arent formatted the way you would prefer.
Im not sure wht the crawling is using less KBs but visiting the same amount of pages. My guess would be that the page might have returned a 304 status code and/or your images were not downloaded. As long as your traffic from the search engines is doing fine I would not be worried.