Optioncarriere / Careerjet

Forum Moderators: open

Message Too Old, No Replies

Optioncarriere / Careerjet

job search engines with generic ua's

incrediBILL

11:59 pm on Apr 12, 2008 (gmt 0)

These 2 search engines are owned by something called Intervog and have a the same domain names registered for various other countries with some language appropriate variations.

They don't identify themselves whatsoever when hitting my servers, don't ask for robots.txt, and use a variety of generic UAs.

193.238.230.* "libwww-perl/5.803"
193.238.230.* "libwww-perl/5.805"
193.238.230.* "Mozilla/5.0 (compatible)"
193.238.230.* "Mozilla/5.0"

They crawl from HostWay in France using the IPs 193.238.230.109-138 as far as I know.

The dns of those IPs all resolves to optioncarriere except one which resolved to careerjet.

Hostway info:

inetnum: 193.238.228.0 - 193.238.231.255
descr: Hostway
country: FR

That's all I have for now.

[edited by: incrediBILL at 12:34 am (utc) on April 13, 2008]

Ocean10000

12:34 am on Apr 13, 2008 (gmt 0)

I banned libwww-perl by default, and exclude it from most of my reports just to keep them reasonably clean. Most of my problem traffic comes from lame brute force remote php exploit include attempts using libwww-perl on my Asp.net website.

But in this case this lamer just crawls my homepage, doesn't take robots.txt just tries and crawl. And all it gets is nice 403 and no content.

I first started seeing it come from this range in 02-23-2007 with "libwww-perl/5.803" for the User-Agent. The only other User-Agent I seen from this range is "Mozilla/5.0+(compatible)" and that was back in 12-02-2007. The "libwww-perl/5.803" are still on going but are blocked.

I block it so should you if you can. "libwww-perl/5.803" is a pest User-Agent, blocking it all together will save you problems and grief.

incrediBILL

12:56 am on Apr 13, 2008 (gmt 0)

Anything with "libwww" gets whacked automatically but I still log it.

FWIW, I'm seeing their "Mozilla/5.0" stuff most often lately.

wilderness

2:26 am on Apr 13, 2008 (gmt 0)

I'm no help on Euro bots or ranges.
In this example, have the entire Class A denied.

incrediBILL

2:56 am on Apr 13, 2008 (gmt 0)

Don, you block France?

wilderness

4:00 am on Apr 13, 2008 (gmt 0)

Bill,
Most abhor this ;)

My widgets are focused on "North American".

RIPE, APNIC (except some of the Oceanic ranges), LACNIC, JAPNIC, AFRIC and any other ARIN ranges that I'm able to differentiate from the continent are denied.

I do make exceptions for specific ranges, should a widget industry person refer another from a denied registrar to me.
Unfortuanetly, North Am Internet Providers and their policies of focused ranges in assignment is far different from other countries/continent policies.

Thus I explain to these folks that adjustments may not work in five minutes and a fixed IP-range (which I've been told is expensive in Europe) is a long term solution.

Don

incrediBILL

4:06 am on Apr 13, 2008 (gmt 0)

Most abhor this

Surprisingly I don't and am torn on many aspects on international blocking.

When I was running ecommerce sites I too only accepted US and Canadian orders so I didn't need to block them specifically because the checkout page only had US and Canada as billing and delivery options. However, to combat Internet fraud I did block all other countries from gaining access to the checkout page.

My current main site is international so I don't really block any country but I have been forced to put up captcha's on first contact from a couple of countries with the worst automated scraper abuse.

Furthermore, I don't firewall entire countries because the problems I have are with scrapers and spammers so I selectively lock out certain IPs from SMTP or HTTP access only based on the abuse.

It's rather complicated, wish I could be as straightforward as you are and just block the planet ;)

wilderness

4:49 am on Apr 13, 2008 (gmt 0)

It's rather complicated, wish I could be as straightforward as you are and just block the planet wink

You just need to get a corner on the "good ol boy" theme ;)

Hobbs

8:59 am on Apr 13, 2008 (gmt 0)

ok,

so libwww gets whacked

What about just "Mozilla/5.0" and "Mozilla/5.0 (compatible)"
any humans using those?

Hobbs

11:03 am on Apr 13, 2008 (gmt 0)

from my logs, turns out libwww is only used by nice folks trying to enrich my site, attempting to upload their own magnificent php scripts, I'm so touched, I'm sending them to a special thank you page made just for them.

[edited by: Hobbs at 11:04 am (utc) on April 13, 2008]

Ocean10000

2:37 pm on Apr 13, 2008 (gmt 0)

What about just "Mozilla/5.0" and "Mozilla/5.0 (compatible)" any humans using those?

I see "Mozilla/5.0 (compatible)" used often by filtering corporate networks. Where the corporate firewall/proxy server will see a request for a website it doesn't know and download the page using that user-agent. Then if it pass's all its rules it will allow the original request to go though.

As for "Mozilla/5.0" just ban it no major service uses that user-agent for anything other then scraping now days that I know of.

Hobbs

2:43 pm on Apr 13, 2008 (gmt 0)

> corporate firewall/proxy server

I see the same for country proxies too
"Mozilla/4.0 (compatible)" and "Mozilla/3.0 (compatible)"
both get a 403 from me, they just duplicate the same requests after their clients do them.

Ocean10000

4:36 pm on Apr 13, 2008 (gmt 0)

I don't ban the follow up request, since my day job has one of these devices to help filter out materials that may land the company in hot water. If it saves me from having to do another lecture to another employee with supervisors present its done its job.

jdMorgan

4:52 pm on Apr 13, 2008 (gmt 0)

Ocean,

Since you might know: Why don't these corporate content-checkers/filters simply identify themselves properly?

I've got no problem allowing them access, but not with a "generic" UA string.

Jim

Ocean10000

6:31 pm on Apr 13, 2008 (gmt 0)

The goal of the filter is to get the same content as the end user so it can make sure there is no items in the requesting page that will violate company policy.

I am not sure why they do not use the Same UA as the original request, thats always stumped me too. I think its a default in the software when it doesn't match any of the known black list topics, it will fetch the content again. And it does not save the original UA that requested the page, thus it has to use a stand in UA in its place. And then its sends the content to be reviewed by the software makers company, so it can be classified properly. Updates to the blacklists happen daily usually sometimes more. But the software is designed to be transparent and the time it would take to be classified as good or bad may take some time, so it makes another request and lets the original though.

wilderness

6:43 pm on Apr 13, 2008 (gmt 0)

Updates to the blacklists happen daily usually sometimes more. But the software is designed to be transparent

Perhaps the solution is to feed them with transparent web pages ;)

In a pinch 403's.