Forum Moderators: open
They don't identify themselves whatsoever when hitting my servers, don't ask for robots.txt, and use a variety of generic UAs.
193.238.230.* "libwww-perl/5.803"
193.238.230.* "libwww-perl/5.805"
193.238.230.* "Mozilla/5.0 (compatible)"
193.238.230.* "Mozilla/5.0"
They crawl from HostWay in France using the IPs 193.238.230.109-138 as far as I know.
The dns of those IPs all resolves to optioncarriere except one which resolved to careerjet.
Hostway info:
inetnum: 193.238.228.0 - 193.238.231.255
descr: Hostway
country: FR
That's all I have for now.
[edited by: incrediBILL at 12:34 am (utc) on April 13, 2008]
But in this case this lamer just crawls my homepage, doesn't take robots.txt just tries and crawl. And all it gets is nice 403 and no content.
I first started seeing it come from this range in 02-23-2007 with "libwww-perl/5.803" for the User-Agent. The only other User-Agent I seen from this range is "Mozilla/5.0+(compatible)" and that was back in 12-02-2007. The "libwww-perl/5.803" are still on going but are blocked.
I block it so should you if you can. "libwww-perl/5.803" is a pest User-Agent, blocking it all together will save you problems and grief.
My widgets are focused on "North American".
RIPE, APNIC (except some of the Oceanic ranges), LACNIC, JAPNIC, AFRIC and any other ARIN ranges that I'm able to differentiate from the continent are denied.
I do make exceptions for specific ranges, should a widget industry person refer another from a denied registrar to me.
Unfortuanetly, North Am Internet Providers and their policies of focused ranges in assignment is far different from other countries/continent policies.
Thus I explain to these folks that adjustments may not work in five minutes and a fixed IP-range (which I've been told is expensive in Europe) is a long term solution.
Don
Most abhor this
Surprisingly I don't and am torn on many aspects on international blocking.
When I was running ecommerce sites I too only accepted US and Canadian orders so I didn't need to block them specifically because the checkout page only had US and Canada as billing and delivery options. However, to combat Internet fraud I did block all other countries from gaining access to the checkout page.
My current main site is international so I don't really block any country but I have been forced to put up captcha's on first contact from a couple of countries with the worst automated scraper abuse.
Furthermore, I don't firewall entire countries because the problems I have are with scrapers and spammers so I selectively lock out certain IPs from SMTP or HTTP access only based on the abuse.
It's rather complicated, wish I could be as straightforward as you are and just block the planet ;)
What about just "Mozilla/5.0" and "Mozilla/5.0 (compatible)" any humans using those?
I see "Mozilla/5.0 (compatible)" used often by filtering corporate networks. Where the corporate firewall/proxy server will see a request for a website it doesn't know and download the page using that user-agent. Then if it pass's all its rules it will allow the original request to go though.
As for "Mozilla/5.0" just ban it no major service uses that user-agent for anything other then scraping now days that I know of.
I am not sure why they do not use the Same UA as the original request, thats always stumped me too. I think its a default in the software when it doesn't match any of the known black list topics, it will fetch the content again. And it does not save the original UA that requested the page, thus it has to use a stand in UA in its place. And then its sends the content to be reviewed by the software makers company, so it can be classified properly. Updates to the blacklists happen daily usually sometimes more. But the software is designed to be transparent and the time it would take to be classified as good or bad may take some time, so it makes another request and lets the original though.