Forum Moderators: open

Message Too Old, No Replies

Home brewed spider?

216.237.31.*** scraping my site

         

Megaclinium

7:28 pm on Feb 25, 2008 (gmt 0)

10+ Year Member



A few days ago I got a log entry of someone showing up at my site via a normal outside page that links to my site. (after I found problems with this guy per below and went to look at logs)

I started noticing alot of 404 errors (thousands) from this address. I looked at the log and this person was scraping jpegs from my site without going thru the web page that links to them.

There is no robot listed in the log entry each time, just standard looking PC. He tried sequentially grabbing jpegs. My web page generator renames pix as a sequence# and increments it. And EVERY request for a jpeg was OK but it generated a 404 page get also.

Is this some kind of home brewed robot? or some page accelerator? I'd think some 'get site' would show as going thru normal web pages.

And funny as it sounds, the scraper failed to get additional pages that weren't in decimal sequence. (my web page generator doesn't count in decimal). A page grabber package would have done a better job. Has this guy been plaguing anyone else? or is it just my content he was trying to grab. The # resolved to a huge block of ISP address (Verizon, in GA I think ) so I couldn't ban them all.

wilderness

12:06 am on Feb 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The # resolved to a huge block of ISP address (Verizon, in GA I think ) so I couldn't ban them all.

the 0-63 Class C of this IP all belongs to the same back-bone provider.
1) Determine if you derive any other traffic from the
back-bone range?
a) If you don't get traffic from back-bones ranges than you
alone must decide what is beneficial or detrimental to
your own website (s)
2) Perhaps you may be able to focus on the back-bones sub-net
ranges (however, from the few I looked at, most of the
sub-net ranges are operated by the back-bone as well)

NOT VERIZON

Is this some kind of home brewed robot? or some page accelerator? I'd think some 'get site' would show as going thru normal web pages.

You haven't provided a UA line from your logs?
Is there a software named in the UA that is image specific.

If your unable to deny utilzing a well known harvesting software that is contained within the UA?

You may be able to use multiple critera (the IP range and a portion of the UA) to reduce the number of innocents.

Don

Megaclinium

12:13 am on Feb 26, 2008 (gmt 0)

10+ Year Member



Thanks - it was just one address (still hitting the site) but the moderator put it as '**'. I've just denied the site to this one IP address. I don't see anything on my control panel to deny specific UA. (I'm new here, the UA is the part of the log record showing what their machine is, and which sometimes show spider name I presume?)

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2
.0.50727; .NET CLR 3.0.04506.30)"

was what was in the log records for all of them right after the referring page.

wilderness

1:07 am on Feb 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Temporarily (because the NET version will change on updates), you could focus on

# Ends with and comes from Class C or 31
RewriteCond %{HTTP_USER_AGENT} 30)$
RewriteCond %{REMOTE_ADDR} ^216\.237\.31\.
RewriteRule .* - [F]

the 4th paragraph in this link provides and example of the "Combined logs", field data:

[httpd.apache.org...]

Megaclinium

12:20 am on Feb 28, 2008 (gmt 0)

10+ Year Member



Now I realize why I was getting thousands of 404 errors: I have site leeching turned on in control panel. They can't harvest the .jpegs directly, they have to go thru the .htm pages and this guy wasn't. Boitho seemed to be doing same thing so I've banned it.

keyplyr

2:17 am on Feb 29, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Probably just some cable subscriber using a download tool. Very common and hardly worth the defensive effort IMO.