Forum Moderators: open

Message Too Old, No Replies

Braindead Scraper With Garbage UA

Braindead Scrapers Garbage UA

         

jmccormac

4:44 pm on Feb 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For the last few weeks, one of my sites has seen a braindead scraper/bot network from dialin.net (A German ISP). The primary seems to be one with a pile of garbage as the user agent (eg: "sMgpl3vnyqau i o3virMel"). I've had to 403 it but it still keeps hammering away. The IP changes periodically but the characteristics are always the same - a two page attempt every few minutes. As dialin.net is a fairly big ISP, would it be better to redirect any traffic from the ISP to some kind of CAPTCHA page? A scraper trying to download a site with approximately 300 million pages is something that points to a rather special kind of idiocy. There is also a second scraper from the same ISP though with a static UA.

Regards...jmcc

wilderness

6:56 pm on Feb 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's a site search on Random UA's [google.com], the solutions provided are rather old (although still effective), however you may need to tweak them a little.

jmccormac

10:24 pm on Feb 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the help. The ISP is t-dialin.net rather than dialin.net. I had the scraper 403ed on an IP basis for a few days but it still kept hammering away.

Regards...jmcc

dstiles

11:50 pm on Feb 13, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I get a lot of trouble from t-dialin but since my customers sometimes get traffic from Germany/Austria I can't block the ISP, although I'd like to. I wonder if the duff traffic is actually from Austria, which has a bad name for exploit/spam.

I also get a lot of letter/number only UAs. I use a regex on ^[a-zA-Z0-9 ,\./] and kill any hits by blocking the IP (this is, of course, after checking for valid Mozilla / bot etc openings). It catches a lot and hasn't false'd yet.