|PHP errors ONLY on Baidu requests - how come? |
Hello all, looking to pick someone's brains on this issue that leaves me perplexed:
A site's error log is filled with these
PHP Warning: Cannot modify header information - headers already sent ... and all IPs are Baidu's in 18.104.22.168 - 22.214.171.124 range. OK, not all - a few per day (out of several thousand) are just any random IPs that generate the same error. But the vast majority are Baidu spider visits that generate this error which, in the logic of the scripts running on the site, mean that the request is missing a required field in the query.
The strangest thing though is this: I see a new error log line, cross-check the IP and timestamp in access_logs and immediately go open that URL - and see no error, neither in browser output, nor in the error_log. Same URLs are routinely visited by Googlebot and other bots as well as regular visitors, and create no error.
The script only accepts GET queries and they are translated using .htaccess from the URL itself - /field1value/field2value.html becomes script.php?field1=field1value&field2=field2value
I have no logic in there that processes IPs - everyone should see the same output.
In other words, you cannot have a proper URL (which Baidu has - I see it in the access_log with the same timestamp/IP as the error line from error_log) and not have all the required GET fields in the query.
So, how does Baidu manage to get an error whereas everybody else visiting the same URL does not? Can they add to query something that does not make it to the access logs? Can they append something (say, a string in Chinese Traditional charset - Big5) that Apache does not include in access logs and yet processes it in .htaccess?
I hope this explanation makes sense to you guys. I would appreciate any comment or a tip on how to troubleshoot this. I don't want to just wholesale-ban Baidu only to fix PHP errors...
|I don't want to just wholesale-ban Baidu only to fix PHP errors... |
Seems like a perfectly acceptable reason to me. Lots of folks wholesale-ban Baidu just because they're, uhm, Baidu. They're somewhere on the continuum between "I don't like your face" and "Shoot to kill".
See ongoing thread next door in SSID: it's peppered with phrases like "I feel your pain" and "I spoke too soon" that always seem to come up once Baidu gets involved...
lucy24, thank you for your input and Happy 2013! Although I do understand where you're coming from, the site is about electronics and you will really shoot yourself in the foot if you block access to ALL Chinese users given that EVERYTHING in this particular industry is made in China. Perhaps it may come as a surprise to people in different niches but at least on these subjects I do have legitimate user-generated content from visitors in China (shocking, I know). So, wholesale-ban of Baidu is not a good idea, at least in my specific case.
I guess, my biggest problem is that I don't understand how what I observe is even possible technically: what's so special about Baidu bot that it's able to send a request for a URL that's different from other agents. Perhaps if I could see the raw HTTP requests and responses the way Live HTTP Headers FF addon lets you see on the client side, I could figure it out, but I am not so certain how to do it from the server side.
Anyone has an idea about watching the low level HTTP exchange from the server side?
|Perhaps if I could see the raw HTTP requests and responses the way Live HTTP Headers FF addon lets you see on the client side, |
There is a script (floating around here somewhere) that checks headers and then sends the headers information to a log.
Good luck finding it.