Forum Moderators: goodroi
In our HTML all ampersands are encoded as & however on occasions (exception rather than a rule) I can see that googlebot is trying to request URL with *&lang= or even &%3Blang= instead of translating this into &lang= as it should be.
This then results in spurious URLs being requested and causes errors on the server.
We have stopped this via robots.txt few months ago like this (and this seemed to work fine):
Disallow: /*&lang=
Disallow: /*&%3Blang=
On occasion we also see that GWT reports URL as "blocked by robots.txt", although when I copy and paste this URL into robots test window in GWT, it reports this URL as Allowed.
However, in the last two weeks we have noticed in GWT crawling stats that the number of KB downloaded has dropped significantly although the number of pages crawled by googlebot seems to be averaging as before.
Any ideas?
When dealing with sites that have ampersands I have found that sometimes the search engine crawling issues is because a website linking to the ampersand page is using the opposite form that the website was setup as.
Don't forget that htaccess might be helpful in rewriting urls requests that arent formatted the way you would prefer.
Im not sure wht the crawling is using less KBs but visiting the same amount of pages. My guess would be that the page might have returned a 304 status code and/or your images were not downloaded. As long as your traffic from the search engines is doing fine I would not be worried.
I agree with you on ampersands, we are getting ready for URL rewrite and came across this whilst doing detailed analysis of parameters in URL the site has.
Unfortunately, no htacces (IIS6, no ISAPI, will have to use bespoke db-based rewrite...)