Forum Moderators: open

Message Too Old, No Replies

Bad Spiders and Proxy Servers

Using Opera Proxy

         

grandma genie

12:54 am on May 23, 2012 (gmt 0)

10+ Year Member



This is probably nothing, but considering all the very strange visitors coming to my website lately, I thought I would pass this one along.

This IP: 141.0.9.nnn hit the home page only, had no referer, and this user agent: Opera/9.80 (iPhone; Opera Mini/7.1.32694/27.1741; U; en) Presto/2.8.119 Version/11.10

I checked the IP on project honeypot and they think it could be a "harmless" spider. The IP belongs to Opera Software ASA and is a Confirmed Proxy Server. This same IP visited my site 6 months ago, only that time it had a referer of h**p://www.2zoo.com/vb/showthread.php?t=nnn, which is an arabic language site. The user agent back then was: Opera/9.80 (J2ME/MIDP; Opera Mini/4.2.18154/26.1153; U; ar) Presto/2.8.119 Version/10.54

I've noticed a number of visitors coming from proxy servers. This visitor hasn't done anything... yet. But I'm starting to get trigger happy. Just the other day this bad boy (192.114.71.nn) snagged all my pix and a day earlier this one (62.219.8.nnn) grabbed everything else. The 192 IP had no referer or user agent. The 62 used these user agents: Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.91 Safari/534.30
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Mozilla/6.0 (compatible)
Mozilla/5.0 (compatible)

Both of these IPs are from Bezeq International. The scuttlebutt is that it is the picscout bot working for Getty looking for copyright infringement issues. They (Getty) are trying to buy up all the stock photos on the internet, then extort money from anyone who is using one of "their" pix. So, does that mean if you are using some freebee pix from XYZ Company, and Getty has now purchased that company, that they can come after you for copyright infringement? That's what I call EVIL.

keyplyr

1:26 am on May 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I block all known proxy ranges and even block the term "proxy" and "proxi" in UA strings as well as referrers. I may loose a few real visitors, but I save myself a lot of trouble.

wilderness

2:20 am on May 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



gg,
these UA's are also easily blacklisted.

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Mozilla/6.0 (compatible)
Mozilla/5.0 (compatible)

ends with 5\.1
ends with compatible

grandma genie

4:14 am on May 23, 2012 (gmt 0)

10+ Year Member



I think the problem with stopping these types of events is once it has occurred, they don't usually come back the same way. It's like closing the barn door after the horse has bolted. I think that is why these bots come with a variety of UAs - hoping at least some of them will get through all our attempts to thwart them. I think wilderness has mentioned before there are a myriad number of combinations you can use to block a UA. I will use these suggestions to continue to keep that old barn door shut... and the spiders out.

wilderness

5:26 am on May 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's like closing the barn door after the horse has bolted. I think that is why these bots come with a variety of UAs -


gg,
Over an extended period of time (it's been more than a decade for me) you implement various solutions, and after a while, the possibility of a new method of entry is reduced drastically.
I certainly do NOT see a fraction of the complete site crawls that I saw years ago, the doors have simply been closed.

The time-span of reaching a destination, also allows you to acquire skills in recognition and solutions that are simply not available to the masses, nor in many instances may these skills of experience even be expressed.

motorhaven

4:12 am on May 24, 2012 (gmt 0)

10+ Year Member Top Contributors Of The Month



Agree completely with wilderness.

I've tightened up filtering over the past few months (even more than I did in the past). This time I took the route of focusing not on the bots, but on real users. Learn to accurately know its a real user even when through a corporate or military proxy, and the effectiveness skyrockets while having to use very little blacklisting. Whitelisting is the way to go.

The small number of legit users getting blocked, a fraction of a tenth of a percent, is a trade off well worth it. With each passing week I see less captchas filled in the bot traps because every time a legit user fills one out my program gives me everything imaginable about their browser session. This gives me the opportunity to further fine tune my whitelist filtering for better bot trapping and less falsing.

Kendo

4:25 am on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



124.238.243.* HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+98)


This one hits one of our sites every few minutes hitting on most pages, confusing our stats because when it hits one page it follows 6 other links on that page, resulting in our stats showing 1,000s hits on a single page like page.asp?id=xyz

The IP block seems to belong to China and each visit is a different member of that range. Is this a search engine that anyone would welcome? Or is it a site scraper looking for web forms to spam?

Kendo

4:39 am on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, does that mean if you are using some freebee pix from XYZ Company, and Getty has now purchased that company, that they can come after you for copyright infringement?


Free pix were provided for free, so a change of ownership should not affect your rights.

However it's best to not use free pics because people can tell, and if it looks familiar then you can appear to be dodgy. We always use originals or images that we purchase the rights for. However I have noted of late that one competitor has started using the same image stock and even use the same images as we use in our ads.

About image tracking... if one resamples or edits the image then tracking becomes useless.

Kendo

4:55 am on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



124.238.243.* HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+98)


It seems to be raised elsewhere on this forum... [webmasterworld.com...]

I just did a count of hits from this app, and decided that I can do without the 25,939 page hits per day.

wilderness

5:46 am on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



124.238.243.* HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+5.5;+Windows+98)


The IP block seems to belong to China and each visit is a different member of that range. Is this a search engine that anyone would welcome? Or is it a site scraper looking for web forms to spam?


It's hardly a SE, and if it was legitimate, it would use the proper identification protocol.

1) you should have a deny in place for this malformed UA.
2) I'm assuming your denying to the Class D precise IP, rather than correctly denying the providers entire Class B range.
3) There's possibility you could consider denying the entire Class A.

lucy24

7:58 am on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just the other day this bad boy (192.114.71.nn) snagged all my pix and a day earlier this one (62.219.8.nnn) grabbed everything else.

Eeeuw, it's the Bezeq tag team! They do eventually come back, so you don't feel as if you've wasted your time blocking them. Last time I saw them, the htmlbot was new to me and got several hundred files, but the picbot was familiar so it got a few thousand 403s. Num, num.

A third member of the team is 82.80-81. (I think technically 82.80.248-255 but who's counting.) I've only met them without a UA, so they'd be blocked anyway.

And unless I've overlooked something, every single image on my site is either public domain in the US (published before 1922) or created by me, so take that, Getty :-P

dstiles

9:07 pm on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Kendo - I blocked 124.238.224.0 - 124.238.255.255 over two years ago with the note that it's a server farm. A quick check suggests it still is.

grandma genie

1:13 am on May 25, 2012 (gmt 0)

10+ Year Member



Wilderness is right about the benefits of learning methods of problem solving over time. Compared to when I first began, I have learned quite a lot. Am still learning. Being on a hosted server has its limitations. It took me a day to get an answer from my host about the PHP-CGI vulnerability exploit. Their servers don't use the PHP-CGI configuration.

I have the 82.80 range blocked. Also blocked 124.238. Am working on the Linode IP ranges. It seems like the more stuff I block, the stranger my traffic becomes. Almost zombie-ish.

As for image use, most of mine are manufacturer images for the products I sell. I take quite a few of my own. And I have one of those huge sets of CDs with thousands of images. They are quite nice. Getty does not own them... yet.

Kendo

3:45 am on May 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I blocked 124.238.224.0 - 124.238.255.255 over two years ago


Thanks. I dedicated yesterday to creating detection and redirection for a very long list of user-agents, something on my to-to list for ages.

keyplyr

7:33 am on May 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I block that ChinaNet range entirely.
124.236.0.0 - 124.239.255.255
124.236.0.0/14

wilderness

7:02 pm on May 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I block that ChinaNet range entirely.


ditto

RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]

lucy24

9:20 pm on May 25, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I block that ChinaNet range entirely.
124.236.0.0 - 124.239.255.255
124.236.0.0/14

I'm more cold-blooded.

124.220.0.0/14
124.224.0.0/12

Kendo

4:37 am on May 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



124.238.224.*


Interesting... this IP block has so far pulled 454,277 web pages from one of our sites, but while that site supports many clients in China, each with hundreds of members, none of them belong to this IP block.

It can't be a caching service because I see the same pages hit every few minutes. If this has been going on for more than 2 years, how come they are still connected?

keyplyr

7:25 am on May 27, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I'm more cold-blooded.

124.220.0.0/14
124.224.0.0/12


Not the same ranges as 124.236/14 and 124.220/14 is mostly Gov't R&D anyway. Have you seen bad agents from that range?