Forum Moderators: DixonJones
There isn't any bad code (W3C validated) and I use nothing remotely like this anywhere on the site. This has been going on for awhile now. The logs show one or two regular Get's then one or two of these, usually the same IP but not always with various browsers, various OS', various UA's.
they don't seem interested in actual files with a straight "Get" - notice the 206 - but go for the index files in subdiredctories and then try this onclick thingie on one or more files in the subdirectory.
Small sample from logs with IP and files/directories changed to protect the innocent:
111.222.333.444 - - [08/Jun/2003:22:22:35 -0500] "GET /aaaaa.htm\" onclick=\"kr(this,'wr','61','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [08/Jun/2003:22:22:35 -0500] "GET /aaaaa.htm\" onclick=\"kr(this,'wr','61','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [08/Jun/2003:22:51:03 -0500] "GET / HTTP/1.0" 200 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0; QXW0334h)"
111.222.333.444 - - [09/Jun/2003:08:46:39 -0500] "GET /bbbbb.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 3.0; Windows 98; DigExt)"
111.222.333.444 - - [09/Jun/2003:08:47:22 -0500] "GET /bbbbb.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; QXW03318)"
111.222.333.444 - - [09/Jun/2003:12:53:56 -0500] "GET /\" onclick=\"kr(this,'wr','40','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:12:53:56 -0500] "GET /ccccccc/ HTTP/1.0" 200 8000 "-" "Mozilla/5.0 (compatible; Konqueror/2.0; X11); Supports MD5-Digest; Supports gzip encoding"
111.222.333.444 - - [09/Jun/2003:12:53:56 -0500] "GET /ccccccc/iiiii.htm HTTP/1.0" 206 8000 "-" "Mozilla/5.0 (compatible; Konqueror/2.0; X11); Supports MD5-Digest; Supports gzip encoding"
111.222.333.444 - - [09/Jun/2003:12:53:56 -0500] "GET /ccccccc\" onclick=\"kr(this,'wr','46','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:12:53:56 -0500] "GET /ccccccc\" onclick=\"kr(this,'wr','46','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:13:09:11 -0500] "GET /dddddd.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Hotbar 2.0)"
111.222.333.444 - - [09/Jun/2003:14:55:55 -0500] "GET /eeeee HTTP/1.0" 301 311 "-" "Mozilla/4.72 [en]C-CCK-MCD (Win98; U)"
111.222.333.444 - - [09/Jun/2003:14:55:55 -0500] "GET /eeeee/ HTTP/1.0" 200 8000 "-" "Mozilla/4.72 [en]C-CCK-MCD (Win98; U)"
111.222.333.444 - - [09/Jun/2003:14:55:59 -0500] "GET /bbbbb.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.72 [en]C-CCK-MCD (Win98; U)"
111.222.333.444 - - [09/Jun/2003:15:01:30 -0500] "GET /bbbbb.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT; EOCstring)"
111.222.333.444 - - [09/Jun/2003:15:01:30 -0500] "GET /ffffff-jjjjjj.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT; EOCstring)"
111.222.333.444 - - [09/Jun/2003:15:01:30 -0500] "GET /ffffff/kkkkkkk.htm\" onclick=\"kr(this,'wr','28','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:15:01:30 -0500] "GET /ffffff/kkkkk.htm\" onclick=\"kr(this,'wr','28','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:15:01:30 -0500] "GET /dddddd.htm\" onclick=\"kr(this,'wr','26','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:16:19:06 -0500] "GET /ggggg.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.5; MSN 2.5; Windows 98; AT&T WNS5.0; AT&T WNS5.2)"
111.222.333.444 - - [09/Jun/2003:21:29:00 -0500] "GET /hhhhh.htm\" onclick=\"kr(this,'wr','74','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [09/Jun/2003:21:29:00 -0500] "GET /lllllll.htm HTTP/1.0" 206 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt; Yahoo-1.0)"
111.222.333.444 - - [09/Jun/2003:21:29:03 -0500] "GET /hhhhh.htm\" onclick=\"kr(this,'wr','74','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [10/Jun/2003:09:15:28 -0500] "GET /\" onclick=\"kr(this,'wr','97','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [10/Jun/2003:09:15:29 -0500] "GET /\" onclick=\"kr(this,'wr','97','i') HTTP/1.0" 400 374 "-" "-"
111.222.333.444 - - [10/Jun/2003:09:27:50 -0500] "GET / HTTP/1.0" 200 8000 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 95; CNETHomeBuild03171999)"
[Tue Jun 10 00:25:20 2003] [error] [client xxx.xxx.xxx.xxx] GET /name of page requested\" onclick=\"kr(this,'wr','75','i') HTTP/1.0
these are the variations of the above example
/\" onclick=\"kr(this,'wr','95','i') HTTP/1.0
onclick=\"kr(this,'wr','87','i') HTTP/1.0
onclick=\"kr(this,'wr','75','i') HTTP/1.0
onclick=\"kr(this,'wr','73','i') HTTP/1.0
onclick=\"kr(this,'wr','64','i') HTTP/1.0
onclick=\"kr(this,'wr','63','i') HTTP/1.0
onclick=\"kr(this,'wr','48','i') HTTP/1.0
onclick=\"kr(this,'wr','34','i') HTTP/1.0
Ok, so now that I know what it is ....
why the 206's? all /filenames.htm are partial Get's. These visits didn't start before 5/11/03 and all Gets for filename.htm have returned only 206?
why is it calling files that have been 301 for 7 months but not calling the renamed file?
why is it calling dead files that it discovered, on an earlier pass, were 404?
and - is this related to the paid version of (keyword software)?
and finally - I woudda thought that (keyword software) would work correctly - not send bad syntax that returns a 400?
Any ideas?
Don't understand the 206 that well either - what do they get with partial content? Are they just looking at the head for an updated date or do they actually get part of the file- perhaps the title description tags?
And, since I was thinking about going to the paid version of this s/w, I wonder what/why it is written with bad syntax (the 400's in my log example)?
I thought I remembered seeing a thread about this and did try searching here, and on Google, before posting, but couldn't find a relevant thread. ;)
10.2.7 206 Partial ContentThe server has fulfilled the partial GET request for the resource. The request MUST have included a Range header field (section 14.35) indicating the desired range, and MAY have included an If-Range header field (section 14.27) to make the request conditional.
You just made the acquaintance of possibly the most broken bot out there, and you're asking why it does a lot of stupid things? ;)
Don't understand the 206 that well either
Since they always fetch 8000 bytes, I guess they just want to limit their bandwidth requirements. They're researching keywords, after all, and according to certain theories, those near the top of each page are considered to be the most relevant.
I wonder what/why it is written with bad syntax (the 400's in my log example)?
Because the person who wrote this bot hasn't figured yet that a second double quote ends an URL. That's probably one of the most stupid mistakes you can make in an HTML parser, and they still don't seem to have noticed after running that way for at least a month. Really raises your confidence in their product, doesn't it?
couldn't find a relevant thread.
Just search for the IP:
[webmasterworld.com...]
[webmasterworld.com...]
Aah... I read the RFC but didn't understand it. Thanks, bird, for that extra bit of clarification.
Just search for the IP:
maybe I better go take GoogleGuy's search turorial ;). never though to try searching on the IP either. Think I'll add 6/10 and 6/11 to my 2003 Dumb Days List
Thanks for the help!