DKIMRepBot - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

DKIMRepBot

dkim-reputation.org

dstiles

2:20 am on Mar 28, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This ripped through one of my sites today - and I do mean ripped!

UA: Mozilla/5.0 (compatible; DKIMRepBot/1.0; +http://www.dkim-reputation.org)
IP: 85.10.242.nnn (same as web site)
rDNS: repsys.dkim-reputation.org
Robots.txt: NO
Bot page: not that I could find.
Contact details on web site: email: I haven't bothered as yet.

154 pages in just under 4 minutes - just about every page on the site. It also gobbled up the CSS but ignored the images and javascript. For some reason it missed the contact form - there MAY be an explanation for that (blocked because of suspect UA - completely blank headers) but I haven't checked; some sites are, some not. It took a sub-folder of the site without the trailing "/" - no idea why or where it got the idea from since it wasn't trailing referers.

It read robots.txt first and I have trouble believing it made a decision based on that in under two seconds. In any case, it went ahead and took pages that were listed as forbidden in robots.txt. And since no-one could have foreseen the bot it's unlikely anyone would have blocked it specifically in any case.

DKIM is supposed to be an anti-spam system, so why is it scraping my web server? I have a suspicion but they would have to be really dumb if it were the reason!

[edited by: incrediBILL at 5:17 am (utc) on Mar. 28, 2009]
[edit reason] obscured IPs [/edit]

GaryK

2:40 am on Mar 29, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It's late and I'm not feeling too bright. So I'll ask. What's your suspicion?

jdMorgan

3:02 am on Mar 29, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> It read robots.txt first and I have trouble believing it made a decision based on that in under two seconds.

With a clock rate of 2 to 4 gigahertz, a CPU can make a yes/no decision in one-quarter to one-half of one billionth of a second. If it's a dual-core, it could make two non-co-dependent decisions in that time.

And even allowing for a program requiring 1,000,000 sequential instructions (which is likely too high an estimate) to decide to make the *specific* decision you mentioned but didn't elaborate, it would still only take a 2 gHz single-core machine 1/2000 of one second (0.0005 seconds) to make that decision.

> And since no-one could have foreseen the bot it's unlikely anyone would have blocked it specifically in any case.

iBill's chorus: Whitelist, whitelist, whitelist.
That UA isn't getting into any of my sites, and neither is anyone else hosted in that same server farm. And no headers with a Mozilla-compatible UA? --- forget it!

> DKIM is supposed to be an anti-spam system, so why is it scraping my web server? I have a suspicion but they would have to be really dumb if it were the reason!

They probably are that dumb, but at least they told us in their User-agent string: Reputation management for sale, and we Webmasters get to pay the bandwidth costs and take the server performance hit so they can download our sites... No thanks, I decline. You're not on the whitelist. :)

Jim

dstiles

3:30 am on Mar 29, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Gary: The site sells things allied to things that are advertised in spam but aren't anywhere near the same thing. Sort of thing. I wonder if they are trying to discover rogue sellers.

On the other hand, they aren't being polite about it.

Jim: Yeah, I know. Still seems over-fast to me, especially as it doubtless had lots of other things on its mind as well, AND it totally ignored the findings anyway. Sort of, "Lets fool the idiots into thinking we're checking the file then just do what we like."

As to robots.txt - it's far easier to trap UAs, headers and behaviours, especially as at least 99.9% of bots don't even bother to check. This one actually slipped through a ban on a technicality but got logged as "suspicious"; it and its kin are now banned.

jdMorgan

6:48 pm on Mar 29, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I didn't mention robots.txt. I'm suggesting you don't "trap UAs and headers," but rather, allow only certain UAs and headers. The whole "catch a bad guy" thing leaves you closing the barn door after the horses have already run out. A whitelist-based approach means you decide what you will *allow* to enter your site, and anything else is shown a short, general, and somewhat-misleading-but-polite page about "Something's wrong, please try again later." Meanwhile, your server can send you an e-mail or specifically log the attempt, in order to make it easy for you to decide whether to allow that user-agent or HTTP header discrepancy, and modify your filters to suit.

Robots.txt says, "Please stay out." But you need a guy behind the door with a baseball bat to whack anyone who he doesn't recognize... "If yer name's not Dan, yar not comin' in!" So I don't rely on robots.txt for anything but its original stated purpose -- bandwidth reduction.

Jim

dstiles

10:08 pm on Mar 29, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You cited Bill's white-list, which I understand INCLUDES a carefully crafted robots.txt. Please correct me if I mis-understood, Bill.

It's difficult to whitelist UAs since they vary considerably; it's easier to stop bad ones - and you can't white or black on UAs alone anyway. A significant quantity of entries in my logs are perfectly valid UAs but with serious other defects.

Remember I'm running a server for commercial enterprises: if there is a chance that a UA is good I have to let it through, other factors permitting - at least for a trial.

And if I got an email every time something violated either a white or black list I'd be knee-deep in emails. Logs - yes: extensively.

tangor

10:23 pm on Mar 29, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

A general whitelist will take care of the SIGNIFICANT traffic. Minor blacklisting for more egregious unwanted is okay and usually managable. But there comes a point of diminishing returns the closer one wishes to approach "zero". For me that is somewhere around 5% I don't catch or let live because <2000 requests a week from two dozen or so "bots" or bad actors is not "significant". I consider it a cost of doing business just like large B&M stores factor in shoplifting in their bottom line.

Samizdata

1:55 am on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Another vote for whitelisting here.

The cited UA would have tripped two filters and got nothing but a robots.txt saying "no".

Any other request would get a 403 (richly deserved).

Whitelists are not the only tool in the box, but as noted above they deal with a significant number of nuisances very easily - and more importantly, they do so automatically.

...

dstiles

8:22 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

And what would you do about, say:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Both of these are dubious: some hits are fine, others are very bad indeed. It requires more than just whitelisting UAs.

Bots are easy when they trail telltales behind them such as "nutch" but what if they lack of a leading "mozilla", which can be valid for some browsers? The more common browser UAs are far more insidious and can very easily wipe a site if not parsed correctly.

None of which, of course, has anything to do with my original posting, which was intended to warn of activity from what I think is a new scrape activity by a source that should, to my mind, not scrape.

I may give up reporting these, since this is by no means the first time I've been hit by "why don't you whitelist" propaganda. No one here knows my exact methods and I'm not going to explain further beyond the fact that I mix methods and black, white and shades of grey are included in appropriate degree. My methods work well for the platform I'm using and the type of sites I maintain, and that's the important point.

jdMorgan

9:21 pm on Mar 30, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I whitelist MSIE requests matching something like
^Mozilla/4\.0\ \(compatible;\ MSIE\ ([3-6]\.[0-9]{1,2}�6\.0)(;\ [^;)]+)*;\ Windows\ (NT\ (4\.0�5\.(01?�[12])�6\.0)�98;\ Win\ 9x\ 4\.90�98�95)(;\ [^;]+)*\))
and then further qualify any such whitelisted requests by checking that the various Accept headers are correct for MSIE 6.0.

BTW, I wasn't referring specifically to iBIll's whitelist above, I was referring to the *technique* of whitelisting in general. A basic tenet of computer security is not to decide which behaviors to block, but rather to decide which to allow. The various posts here at WebmasterWorld dealing with SQL query injections cover this subject fairly well.

Jim

[edited by: jdMorgan at 9:22 pm (utc) on Mar. 30, 2009]