homepage Welcome to WebmasterWorld Guest from 54.166.148.189
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
PHP errors ONLY on Baidu requests - how come?
1script




msg:4532260
 12:17 am on Jan 1, 2013 (gmt 0)

Hello all, looking to pick someone's brains on this issue that leaves me perplexed:

A site's error log is filled with these
PHP Warning: Cannot modify header information - headers already sent ... and all IPs are Baidu's in 180.76.0.0 - 180.76.255.255 range. OK, not all - a few per day (out of several thousand) are just any random IPs that generate the same error. But the vast majority are Baidu spider visits that generate this error which, in the logic of the scripts running on the site, mean that the request is missing a required field in the query.
The strangest thing though is this: I see a new error log line, cross-check the IP and timestamp in access_logs and immediately go open that URL - and see no error, neither in browser output, nor in the error_log. Same URLs are routinely visited by Googlebot and other bots as well as regular visitors, and create no error.
The script only accepts GET queries and they are translated using .htaccess from the URL itself - /field1value/field2value.html becomes script.php?field1=field1value&field2=field2value
I have no logic in there that processes IPs - everyone should see the same output.

In other words, you cannot have a proper URL (which Baidu has - I see it in the access_log with the same timestamp/IP as the error line from error_log) and not have all the required GET fields in the query.

So, how does Baidu manage to get an error whereas everybody else visiting the same URL does not? Can they add to query something that does not make it to the access logs? Can they append something (say, a string in Chinese Traditional charset - Big5) that Apache does not include in access logs and yet processes it in .htaccess?

I hope this explanation makes sense to you guys. I would appreciate any comment or a tip on how to troubleshoot this. I don't want to just wholesale-ban Baidu only to fix PHP errors...

 

lucy24




msg:4532278
 2:18 am on Jan 1, 2013 (gmt 0)

I don't want to just wholesale-ban Baidu only to fix PHP errors...

Seems like a perfectly acceptable reason to me. Lots of folks wholesale-ban Baidu just because they're, uhm, Baidu. They're somewhere on the continuum between "I don't like your face" and "Shoot to kill".

See ongoing thread next door in SSID: it's peppered with phrases like "I feel your pain" and "I spoke too soon" that always seem to come up once Baidu gets involved...

1script




msg:4532301
 6:10 am on Jan 1, 2013 (gmt 0)

lucy24, thank you for your input and Happy 2013! Although I do understand where you're coming from, the site is about electronics and you will really shoot yourself in the foot if you block access to ALL Chinese users given that EVERYTHING in this particular industry is made in China. Perhaps it may come as a surprise to people in different niches but at least on these subjects I do have legitimate user-generated content from visitors in China (shocking, I know). So, wholesale-ban of Baidu is not a good idea, at least in my specific case.
I guess, my biggest problem is that I don't understand how what I observe is even possible technically: what's so special about Baidu bot that it's able to send a request for a URL that's different from other agents. Perhaps if I could see the raw HTTP requests and responses the way Live HTTP Headers FF addon lets you see on the client side, I could figure it out, but I am not so certain how to do it from the server side.
Anyone has an idea about watching the low level HTTP exchange from the server side?
Thanks!

wilderness




msg:4532306
 7:33 am on Jan 1, 2013 (gmt 0)

Perhaps if I could see the raw HTTP requests and responses the way Live HTTP Headers FF addon lets you see on the client side,


There is a script (floating around here somewhere) that checks headers and then sends the headers information to a log.

Good luck finding it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved