Forum Moderators: open
UA: Mozilla/5.0 (compatible; DKIMRepBot/1.0; +http://www.dkim-reputation.org)
IP: 85.10.242.nnn (same as web site)
rDNS: repsys.dkim-reputation.org
Robots.txt: NO
Bot page: not that I could find.
Contact details on web site: email: I haven't bothered as yet.
154 pages in just under 4 minutes - just about every page on the site. It also gobbled up the CSS but ignored the images and javascript. For some reason it missed the contact form - there MAY be an explanation for that (blocked because of suspect UA - completely blank headers) but I haven't checked; some sites are, some not. It took a sub-folder of the site without the trailing "/" - no idea why or where it got the idea from since it wasn't trailing referers.
It read robots.txt first and I have trouble believing it made a decision based on that in under two seconds. In any case, it went ahead and took pages that were listed as forbidden in robots.txt. And since no-one could have foreseen the bot it's unlikely anyone would have blocked it specifically in any case.
DKIM is supposed to be an anti-spam system, so why is it scraping my web server? I have a suspicion but they would have to be really dumb if it were the reason!
[edited by: incrediBILL at 5:17 am (utc) on Mar. 28, 2009]
[edit reason] obscured IPs [/edit]
With a clock rate of 2 to 4 gigahertz, a CPU can make a yes/no decision in one-quarter to one-half of one billionth of a second. If it's a dual-core, it could make two non-co-dependent decisions in that time.
And even allowing for a program requiring 1,000,000 sequential instructions (which is likely too high an estimate) to decide to make the *specific* decision you mentioned but didn't elaborate, it would still only take a 2 gHz single-core machine 1/2000 of one second (0.0005 seconds) to make that decision.
> And since no-one could have foreseen the bot it's unlikely anyone would have blocked it specifically in any case.
iBill's chorus: Whitelist, whitelist, whitelist.
That UA isn't getting into any of my sites, and neither is anyone else hosted in that same server farm. And no headers with a Mozilla-compatible UA? --- forget it!
> DKIM is supposed to be an anti-spam system, so why is it scraping my web server? I have a suspicion but they would have to be really dumb if it were the reason!
They probably are that dumb, but at least they told us in their User-agent string: Reputation management for sale, and we Webmasters get to pay the bandwidth costs and take the server performance hit so they can download our sites... No thanks, I decline. You're not on the whitelist. :)
Jim
On the other hand, they aren't being polite about it.
Jim: Yeah, I know. Still seems over-fast to me, especially as it doubtless had lots of other things on its mind as well, AND it totally ignored the findings anyway. Sort of, "Lets fool the idiots into thinking we're checking the file then just do what we like."
As to robots.txt - it's far easier to trap UAs, headers and behaviours, especially as at least 99.9% of bots don't even bother to check. This one actually slipped through a ban on a technicality but got logged as "suspicious"; it and its kin are now banned.
Robots.txt says, "Please stay out." But you need a guy behind the door with a baseball bat to whack anyone who he doesn't recognize... "If yer name's not Dan, yar not comin' in!" So I don't rely on robots.txt for anything but its original stated purpose -- bandwidth reduction.
Jim
It's difficult to whitelist UAs since they vary considerably; it's easier to stop bad ones - and you can't white or black on UAs alone anyway. A significant quantity of entries in my logs are perfectly valid UAs but with serious other defects.
Remember I'm running a server for commercial enterprises: if there is a chance that a UA is good I have to let it through, other factors permitting - at least for a trial.
And if I got an email every time something violated either a white or black list I'd be knee-deep in emails. Logs - yes: extensively.
The cited UA would have tripped two filters and got nothing but a robots.txt saying "no".
Any other request would get a 403 (richly deserved).
Whitelists are not the only tool in the box, but as noted above they deal with a significant number of nuisances very easily - and more importantly, they do so automatically.
...
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Both of these are dubious: some hits are fine, others are very bad indeed. It requires more than just whitelisting UAs.
Bots are easy when they trail telltales behind them such as "nutch" but what if they lack of a leading "mozilla", which can be valid for some browsers? The more common browser UAs are far more insidious and can very easily wipe a site if not parsed correctly.
None of which, of course, has anything to do with my original posting, which was intended to warn of activity from what I think is a new scrape activity by a source that should, to my mind, not scrape.
I may give up reporting these, since this is by no means the first time I've been hit by "why don't you whitelist" propaganda. No one here knows my exact methods and I'm not going to explain further beyond the fact that I mix methods and black, white and shades of grey are included in appropriate degree. My methods work well for the platform I'm using and the type of sites I maintain, and that's the important point.
BTW, I wasn't referring specifically to iBIll's whitelist above, I was referring to the *technique* of whitelisting in general. A basic tenet of computer security is not to decide which behaviors to block, but rather to decide which to allow. The various posts here at WebmasterWorld dealing with SQL query injections cover this subject fairly well.
Jim
[edited by: jdMorgan at 9:22 pm (utc) on Mar. 30, 2009]