Forum Moderators: open

Message Too Old, No Replies

Jakarta Commons-HttpClient/3.0-rc1

Anyone seeing this?

         

The Contractor

2:13 pm on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would venture to say it's being used to scrape/include by it's habits. I looked at the Apache Documentation quickly and wasn't sure if I should post here or the Apache forum.

GaryK

5:34 pm on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've had it banned for ages. I forgot to make a note about why I banned it but usually it's because it didn't read or respect robots.txt.

The Contractor

5:56 pm on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks, that's what I've done

mcneely

11:55 pm on Aug 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Tried this
RewriteCond %{HTTP_USER_AGENT} ^Jakarta

but doesn't seem to work

I've been having to resort to banning it by ip

wilderness

1:43 am on Aug 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"but doesn't seem to work"

If that is the ONLY rewrite or the last rewrite before your option?
Than that will work fine.

If you have other rewrite lines following that line?

Your missing the [OR]

RewriteCond %{HTTP_USER_AGENT} ^jakarta [OR]

mcneely

12:28 pm on Aug 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is just one of many, and the [OR] is present in the .htaccess because it is nestled among all of the many other creepy crawlies that I don't allow.

I've looked the whole of my file through to be sure that everything is space delimited and it appears to be as it should be.

I thought that since I had the "^" that I wouldn't need to include the "_Commons" part of it...
Think I should include this as well then?
Or, should I write it in to include [NC]?...or not?
Thought maybe the [NC] wasn't necessary if I had the "^"........ as none of my others have the [NC] but just the "^"

This Jakarta is something that has only recently been coming around. I included the .htaccess after I realised that robots.txt didn't work.....then, after noticing that the .htaccess might not be doing the trick, I began to deny the ip 64.94.163.*** before I finally started to log the 403 for it.

volatilegx

1:06 pm on Aug 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "^" (or caret symbol) is used to match the start of a string, so ^Jakarta means any string starting with "Jakarta". You could leave off the "^" to match any string with "Jakarta" in it in any portion of the string.

mcneely

4:14 pm on Aug 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is what I thought

I'll leave it be and just watch it for a bit.

Chances are that I missed something along the way, I'll be leaving the "^" in. Could be .htaccess is doing it's job as it ought, and I'm just not paying close enough attention to it.

jd01

2:43 am on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



RewriteCond %{HTTP_USER_AGENT} ^jakarta [OR]

Rewrite Info

Conditions only effect a rule immediately following.
[NC] = No Case : to match jakarta OR Jakarta OR JakaRta use [NC]
^ = Beginning of a Line : independent from [NC] : if the UA does not begin with a lowercase j nothing will be blocked by this condition.
$ = End of a Line : Not mentioned but good to know.

I always use this style:

RewriteCond %{HTTP_USER_AGENT} jakarta [NC,OR]

If I want someone blocked I do not want them throwing a character before the name or changing the case of a character to be all they have to do to get by.

You can also combine and shorten for efficiency

EG I don't know of any user agents that start with or contain jaka I want to let through, so not point in checking for the whole string... jaka is enough to know I don't really need them on the site.

Also, if I was blocking jaka and joke, there is no point in two rules - Of course sometimes I block a couple of extras this way, because of the way the rules work out, but as long as they are not real user-agents, there is not too much concern - In this example, I also block the UA's jake and joka:

RewriteCond %{HTTP_USER_AGENT} j(a¦o)k(a¦e)

Blocks j followed by an a or o (jo or ja) followed by a k (jak or jok) followed by an a or an e (jaka or joke or joka or jake) are all blocked.

Hope this helps the rewriters.

Justin

mcneely

5:47 am on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So all of mine are written this way;

RewriteCond %{HTTP_USER_AGENT} ^jakarta [OR]

and the last one doesn't contain the [OR]

If I get you right, I add the [NC,OR] to eliminate any possibility of an upper / lower case letter change.

Basicly what I have written only targets that *exact name? and no other variations?

If eliminating the potential for variations to get through means that I've got to do a bit of rewriting, well then, I'm all up for that.

wilderness

11:54 am on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The reference is the 1st page of the search (below) ALL provide lower case in the UA:

[google.com...]

jd01

8:29 pm on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I get you right, I add the [NC,OR] to eliminate any possibility of an upper / lower case letter change.

Basicly what I have written only targets that *exact name? and no other variations?

Yes and Yes and the exact name at the beginning of the UA string. To block the name anywhere in the UA string, remove the ^ from the beginning of the line. Not having the OR on the last line is correct.

Here are a couple of lines from my file:

# Web followed by any of the strings
RewriteCond %{HTTP_USER_AGENT} Web(Account列apt列opier字ank名hack吁trip后ip存ter在andit) [NC,OR]

# Wget
RewriteCond %{HTTP_USER_AGENT} Wget [NC,OR]

# Begins exactly with User-Agent
RewriteCond %{HTTP_USER_AGENT} ^User-Agent [OR]

# Xenu
RewriteCond %{HTTP_USER_AGENT} Xenu [NC,OR]

# robot or abot that is not gigablast, gigabot, walhello - these two must be in order and not contain an OR in the first condition, so AND IS NOT is implied
RewriteCond %{HTTP_USER_AGENT} (Ro地)bot [NC]
RewriteCond %{HTTP_USER_AGENT} !(Giga(blast在ot)名alhello) [NC]

Hope this helps.

Justin

wilderness

9:01 pm on Aug 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



# Web followed by any of the strings
RewriteCond %{HTTP_USER_AGENT} Web(Account列apt列opier字ank名hack吁trip后ip存ter在andit) [NC,OR]

Lot of extra crap for nothing when

RewriteCond %{HTTP_USER_AGENT} ^Web [NC,OR]

works as well.

What ever floats your boat :)

jd01

12:51 am on Aug 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am not as spider savvy as you, and did not want to accidentally block a good one, so I thought it better to be specific. =)

I know mod_rewrite - I am spider illiterate.

Justin

wilderness

2:53 am on Aug 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



did not want to accidentally block a good one

Hey Justin,

Nothing good begins with "Web" or contains the word "spider" ;)

Although I have noticed an old SetEnv on spider catching the Lycos mod spider in the last week. It's the only exception I've seen on the word.
(Before this recent round of activity from Lycos, I'm unable to recall when the last time I saw their bot was active.)

Don

jd01

6:40 am on Aug 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Cool. Thanks, Don...

Justin

Riamus

8:37 pm on Aug 26, 2005 (gmt 0)

10+ Year Member



I'm wondering about this one as I just saw it myself...

From what I've seen online, this isn't a spider or a bot, but a wrapper for Java developers. I suppose someone could wrap a bot/spider in it, but I'd think that people can use it in normal ways as well and just be browsing a site.

Or, am I wrong? I personally don't care about bad bots unless they are actually causing problems. And, if this can be a legitimate person viewing the site, I definitely don't want them banned.

jdMorgan

5:41 pm on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> but I'd think that people can use it in normal ways as well and just be browsing a site.

Normal people browsing a site have no need for a Java wrapper. They use a regular browser. The Java wrapper strongly implies that it is a Java program being used to fetch pages from our sites. And a program that fetches pages is what we call a robot or a spider. If it checks and obeys robots.txt, that might be OK, but I've never seen one do it.

The problem with a lot of library functions and open source robots is that they're written by programmers who are utterly naive to the abuse that takes place on the Web. They provide very powerful tools for both good and bad.

But when all the Webmaster sees is the bad, then the tool is widely-banned, which makes it useless for the good. So, without an effective and enforceable terms-of-use agreement for all users, and enforcement of same, many many initially-useful tools end up being useless because the abuse greatly exceeds the good use, and they end up banned.

Look at Indy Library. It's just a useful HTTP functions library that anyone might have a good use for. But it's so widely-banned because of abuse that's it's now unusable. Same with LWP-simple and many others.

Jim