Forum Moderators: open

Message Too Old, No Replies

Baidu UA

         

wilderness

5:10 pm on Dec 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplr has a non-active thread on June 2014 [webmasterworld.com].

Note the addition of the underscore to the word spider

123.125.71.109 - - [08/Dec/2014:07:04:36 -0700] "GET /MyFolder/ HTTP/1.1" 301 524 "-" "Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spi_der.html)"

lucy24

9:56 pm on Dec 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Idle query: Once the UA string contains the element www.baidu.com-- or merely "baidu"-- is there any possibility that it will be something other than a Baidu-related spider?

Pro tip: Use the [ code ] markup inline to prevent unwanted smileys ;)

Speaking of which, isn't the form
;)
already part of your personal "wonky punctuation" lockout arsenal? It could be
; \w\w-\w\w;)

or
\b\w\w-\w\w;\)

if you want to narrow it down to dubious language strings.

:: detour for quick check of raw logs ::

Huh.
Mozilla/4.0 (compatible;)
Forgot about that one. It's some kind of archiver.* But everything else with
;)
is a lockout of one kind or another. There's also
SV1;)
and one or two other patterns associated with MSIE 6.


* Which reminds me that I should poke a comprehensive hole for Keewaytinook Okimakanak** to go with the Northwestel/Qiniq holes.
** As illustrated by the fact that I can type the name "cold".

wilderness

9:33 pm on Dec 10, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



lucy,
Really only posted this because spider (crawler is the likely the next most common) is one of the dozen or so commonly abused UA keywords, and they chose to add a variation.

lucy24

9:51 pm on Dec 10, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



spi_?der


But if nobody but baidu is using it, it probably isn't worth adding to your code-- especially in htaccess, where the RegEx has to be re-parsed on every request.

:: quick detour to raw logs ::

Nope, they've never tried it on me.

:: further detour to look up ::

Huh. If source can be believed, it really is a Baidu spider.
baiduspider-123-125-71-109.crawl.baidu.com 

Why would they lie in their own UA? (The "spi_der" element occurs in the http:// part of the string.) Linking to a nonexistent page (I checked) will hardly inspire confidence. Is it possible the request is logged, and they do it to get information about which websites actually look at their own access logs?

:: further detour to Live Headers to see if there are any intermediate steps between original URL and Search Error (not 404) page ::

Nope, just a straight 302 redirect. Some further cursory experimentation suggests that this is their generic handling of all 404s at baidu.com. Or, at least, the ones originating from my IP ;)

keyplyr

10:05 am on Dec 31, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, those guys are a real piece of work.