Forum Moderators: open
So the repeat offender shown below, 65.214.39.180, may now appear as:
180.39.214.65.in-addr.arpa
?
Because Ask/Jeeves has been sneaking in via IP for a while now. For example, I have notes to myself with info dating back to last December:
65.214.39.180
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)
12/09 19:20:17 /dir/file.html 403 -
65.214.39.180
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)
01/03 11:32:56 /dir/file.html 403 -
01/03 19:05:13 /dir/file.html 403 -
65.214.39.180
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)
04/06 12:26:47 /dir/file.html 403 -
Always different directories and filenames, always the same IP and UA.
Never robots.txt, never proper ID. Thus the 403s.
.
P.S.
I'm sure there's a shorter, more elegant way to do this (waves to Jim), but here's what I've been using re numeric Jeeves:
## ASK JEEVES
RewriteCond %{REMOTE_ADDR} ^65\.214\.36\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.214\.37\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.214\.38\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.214\.39\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.29\.50\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.75\.152\.120$
RewriteRule ^.*$ - [F]
(If they are using in-addr.apra now, I'll have to figure out how to rewrite a range of ^line-starters. Aw, shoot.)
## ASK JEEVES
RewriteCond %{REMOTE_ADDR} ^65\.214\.36\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.214\.37\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.214\.38\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.214\.39\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.29\.50\.$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.75\.152\.120$
RewriteRule ^.*$ - [F]
RewriteCond %{REMOTE_ADDR} ^65\.214\.(3[6-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.29\.50\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.75\.1(2[89]¦[3-8][0-9]¦9[01])\.
RewriteRule .*$ - [F]
BTW the "$" or "ends with is only used is the rewrite contains a Class D range.
And no ^ needed, eh? Okay.
RewriteRule .*$ - [F]
My Rewrites over 800 lines and they all function with that ending.
I seem to recall using the ^ in the beginning and than later seeing Jim mention that it was redundant and not necessary.
edited by wilderness.
BTW, you do need to change the forum broken pipe line.
RewriteCond %{REMOTE_ADDR} ^65\.214\.(3[6-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.29\.50\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.75\.1(2[89]¦[3-8][0-9]¦9[01])\.
RewriteRule .*$ - [F]
Should read:
RewriteCond %{REMOTE_ADDR} ^65\.214\.(3[6-9])\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.29\.50\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.75\.1(2[89]¦[3-8][0-9]¦9[01])\.
RewriteRule .*$ - [F]
The forum breaks the pipe line and requires correction.
It identified itself as a crawler too.
Just curious why you're all posting how to block Ask/Teoma as I get traffic from them, not as much as Google/Yahoo/MSN, but they seem to be picking up steam with all the new advertising they're doing. Blocking probably isn't wise at this juncture unless you want your competitors to get that traffic instead of you.
I don't know if anyone else is blocking A/J/T solely by IP besides me. When they use their proper name(s) and ask for robots.txt, then everything's A-OK.
No traffic to speak of from Ask.com -- not yet, anyway. Even with one of their descriptions.
.
P.S.
LOL. We're all here, posting/replying within minutes of each other... Who brought the beer? :)
My Rewrites over 800 lines and they all function with that ending.
Not meaning to rain on your parade but chasing them in this manner in an endless task and there are botnets out there that use random gibberish user agents which means any list you build is basically meaningless against the worst of them.
Here's a sample of what I'm talking about:
209.190.21. bedmdFjkFhc4a noFjajakffieapvngdtpwxk
209.190.21. gdouk6Ss6nnykg66hvojc6txjsecuu
209.190.21. aphErvbtijj vulgctlslo
209.190.21. jgbhwntsdlprxcwogijI8orrw b8
209.190.21. DrbspcgyubxrpeikfiihxD mh
209.190.21. jvAhnviAjwwud8gymvewtcqhehgbAcytyqdxq
209.190.21. cvwkvl6kfujhqlujqblFl dffrepmrxdspmdFjq
To help thwart this type of nonsense I posted a sample of what a opt-in whitelist might look like in the Apache forum a while back:
[webmasterworld.com...]
The beauty of that approach is you spend ZERO time blocking bots which is endless and only ADD the new bots you want to give access, which is very minimal.
I'm using something similar on my site, but it's dynamic sever side scripting, not an .htaccess file, which is why I tossed out sample untested code just to give people an idea how I would possibly go about what I do with Apache.
The reason I don't bother with blocking in Apache is it won't even stop the worst of the problem which is stealth mode bots and a script is the only way to identify and intercept them in real time, so I just perform the user agent blocking there as well.
[edited by: incrediBILL at 11:12 pm (utc) on June 11, 2006]
I'll never understand (or worse, appreciate) this method bilfw02-3.directhit.com
The original IP (65.214.36.73) and range is in the first post, that was just for edification to show the source was directhit.com which is owned by Ask.
However, I did notice I glossed over the user agent they used which was bizarre:
"teoma_agent1"
Not meaning to rain on your parade but chasing them in this manner in an endless task
[snip]
I do appreciate your concerne and offer for assitance, however bots and unidentifed crawlers are only a small portion of my rewrites.
I'll give you and example:
I have a sub-folder on one of my sites that utilizes frames to display bio paragraphs on inductees in the body frame. the next two colums in the body frame contain images.
The top frame has A-Z. The left frame and site map.
The sub-folder has it's FAQ which visitors rarely read. In the first page of the FAQ is offered a download of the entire text bios.
One day I may get up on the "wrong side of the bed" and deny a range because a vistior didn't read the FAQ and download the text as opposed to eating up my bandwidth and going from A to Z and saving the pages.
Additionally there are organizations which have nothing to do with bots, however are interested in the material on my pages and are denied for that reason.
I'm glad that your white-list works for you.
For me?
What if the next visitor is willing to provide research fees to enhance my goal of archival?
The visitors who come to my pages (or are allowed) are rewarded with highly on-topic material of a sunject they are both interested in and that there is very little quality-of-depth on the entire internet (they CANNOT get it anywhere else). (The majority on pages are NOT cached.)
I'm glad that your white-list works for you.
For me?
What if the next visitor is willing to provide research fees to enhance my goal of archival?
When it comes to bots I guess I'm not seeing how the white-list conflicts with your other goals as it just simplifies the usual bot war to non-existant.
Search engines and end users get in without a problem and you could still deny specific individuals and referals, which I do. However, some person aiming any offline downloader you've never heard about would get stopped unless it's a stealth crawler/bot that claims to be a browser.
The last batch of steath bots is a whole new thread, or forum, unto itself.
FYI, most people think I'm "bot blocking" which is wrong, I'm "bot including" which is why I research all the new stuff my scripts identify to see if I should let them in or not. If it's anything of value, I open a door for them, if it's a leech like all the aggregator sites, I let them stay out as they didn't get anything in the first place.