Hello all, looking to pick someone's brains on this issue that leaves me perplexed:
A site's error log is filled with these
PHP Warning: Cannot modify header information - headers already sent ...
and all IPs are Baidu's in 180.76.0.0 - 180.76.255.255 range. OK, not all - a few per day (out of several thousand) are just any random IPs that generate the same error. But the vast majority are Baidu spider visits that generate this error which, in the logic of the scripts running on the site, mean that the request is missing a required field in the query.
The strangest thing though is this: I see a new error log line, cross-check the IP and timestamp in access_logs and immediately go open that URL - and see no error, neither in browser output, nor in the error_log. Same URLs are routinely visited by Googlebot and other bots as well as regular visitors, and create no error.
The script only accepts GET queries and they are translated using .htaccess from the URL itself -
/field1value/field2value.html becomes
script.php?field1=field1value&field2=field2value I have no logic in there that processes IPs - everyone should see the same output.
In other words, you cannot have a proper URL (which Baidu has - I see it in the access_log with the same timestamp/IP as the error line from error_log) and not have all the required GET fields in the query.
So, how does Baidu manage to get an error whereas everybody else visiting the same URL does not? Can they add to query something that does not make it to the access logs? Can they append something (say, a string in Chinese Traditional charset - Big5) that Apache does not include in access logs and yet processes it in .htaccess?
I hope this explanation makes sense to you guys. I would appreciate any comment or a tip on how to troubleshoot this. I don't want to just wholesale-ban Baidu only to fix PHP errors...