Forum Moderators: open

Message Too Old, No Replies

Heads up for a voracious scraper

         

Mokita

1:36 am on Sep 25, 2006 (gmt 0)

10+ Year Member



This IP raced through several of our interlinked sites yesterday taking absolutely every single page!

It requested about 8-10 pages per second and did not request robots.txt.

The UA is: "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98"

From Japan IP 221.191.105.***
Reverse DNS is ***.tokyo.ocn.ne.jp

[edited by: volatilegx at 2:04 am (utc) on Sep. 25, 2006]
[edit reason] obscured IP address and hostname [/edit]

Mokita

2:19 am on Sep 25, 2006 (gmt 0)

10+ Year Member



More investigation, since first I posted, reveals this to be an active Spambot:

[projecthoneypot.org...]

keyplyr

6:24 am on Sep 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Took just my index.html which pretty much proves it's not the browser it pretends to be:

221.191.105.*** - - [24/Sep/2006:01:13:26 -0400] "GET / HTTP/1.0" 200 11072 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

[edited by: volatilegx at 2:31 pm (utc) on Sep. 26, 2006]
[edit reason] obfuscated ip addresses [/edit]

GaryK

2:58 pm on Sep 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Many of those IP Addresses from PH correspond to ones I have in my database as using randomized (aka scrambled) user agents. Typically I see a normal-looking user agent hit a page and if it doesn't get rejected the other user agent grabs the same page. Are you sure you're not missing the actual user agent that's doing the scraping?

Mokita

2:28 pm on Oct 1, 2006 (gmt 0)

10+ Year Member



Hi Gary,

No, I'm not missing the actual agent - the raw log shows dozens of sequential requests, all looking similar to what keyplyr posted.

***.tokyo.ocn.ne.jp - - [24/Sep/2006:11:36:59 +1000] "GET / HTTP/1.0" 200 9194 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

(I made a typo in my first message - inadvertantly left off the last bracket for the UA.)

It certainly is arresting to the eye when you come across that many requests happening so quickly all from the same IP.

The other interesting thing, was that the bot also requested all file bookmarks separately e.g.
"GET /file.html HTTP/1.0"
"GET /file.html#bookmark1 HTTP/1.0"
"GET /file.html#bookmark2 HTTP/1.0"
etc etc

So in some cases, it took the same file numerous times. <sigh>