Forum Moderators: open

Message Too Old, No Replies

"Referrer equals request url" bot

Identify a bot where the referrer repeats the requested url

         

btherl

9:50 pm on Oct 27, 2011 (gmt 0)

10+ Year Member



Hi,

There is a particular class of bot traffic that originates from dynamic ips around the world, and which has the following characteristics:

1. The referrer is always the same as the url requested
2. The user agent changes on each request, and occasionally includes obsolete user agents such as "Mozilla/0.91 Beta (Windows)" and "Mozilla/0.6 Beta (Windows)". They appear to be picked randomly from a pre-defined list.
3. The IP address is a dynamic IP, commonly RU, UA, DE and FR but occurring in many countries
4. The domain requested is different in each request.

The pattern is obvious on multiple domains but very difficult to catch on a single domain. Requests frequently include a query string, and this is always repeated identically in the referer.

Is anyone familiar with this and can shed some light on it? The urls tend to suggest forums, guestbooks and comment pages, so I suspect it is a forum spamming package.

Pfui

12:35 am on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Whatever it is, it's virulent, and cancerous. I first started seeing the fake URI=REF in early 2008 on one site with two message boards, on just one board. Since then it's 'expanded' to regular .html pages in almost every directory, and from just a few IPs in as many weeks to scores of IPs and hundreds of URI=REF hits every single day.

There are two reasons why I've not brought it up here:

1.) The less bot-runners know how we spot them, the better. Plus the program, whatever it is (and I'm not sure it's just one) is more spam harvester/comment spammer/whatever than a search engine spider/bot per se.

2.) There's not one bot or bot-runner to block. The compromised IPs -- dynamic AND fixed -- are bot-running real and fake UAs (some just a few; some many scores) as indicated by Project Honey Pot (PHP) [projecthoneypot.org...] and Stop Forum Spam [stopforumspam.com...] records.

For example:

PHP Threat Level 42: [projecthoneypot.org...]
PHP Threat Level 48 (5 wks ago: 47): [projecthoneypot.org...]

Because of #1, I'm reluctant to go into any further detail about how it works on-site. Suffice it to say that whatever the malignancy, it's tricky. I use mod_rewrite to combat the patterns and .htaccess (& a firewall) to combat the worst sources -- China; Russia; Ukraine; but yep, no one's immune.

If you're seeing it, good luck stopping it. I've had my fingers in the dike for years to no avail.

On a pattern-related note:

The ever-egregious TalkTalk [google.com...] employs the same URI=REF pattern with robots.txt and .html files. But for robots.txt, all are blocked.

FWIW...

It gives me the willies to see so many compromised machines marauding around, seemingly unchecked, independent of the usual exploits. Wherefore art thou "security" companies? Or perhaps 'It' has a name by now?

Pfui

12:54 am on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



P.S. to the OP

I let my obsession with fighting It precede my curiosity about It:)

The domain requested is different in each request.

Could you explain this, please? Are you reading composite logs from multiple sites? Are there boards (or query string URIs) on the domains where you're seeing the requests?

obvious on multiple domains but very difficult to catch on a single domain

Interesting. My experience is the exact opposite. I only see it on the one domain, even though other IPs in the same Class C link to it (and it to them).

btherl

1:29 am on Oct 28, 2011 (gmt 0)

10+ Year Member



Thanks for your comments! Here is one example of my ones, it looks like the same thing:

[projecthoneypot.org...]

I also won't say what I'm doing about it.. it's a tricky one, but the nature of its activities will always give it away, even if they get rid of the obvious giveaways.

Thanks for pointing out TalkTalk - I found it using the same URI=REF pattern and the same UA and IP addresses as in the other thread about it here.

lucy24

1:48 am on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By the usual yawn-provoking coincidence...

[webmasterworld.com...]

Yawp! Just remembered that I added a piece to my htaccess to address this one, but never got around to uploading it. Oops.

I've got a couple of files that seem especially popular, so the rule is just written for them:

RewriteCond %{HTTP_REFERER} ^http://(www\.)?example\.com/$1
RewriteRule (fun/AlonzoMelissa|ebooks/\w+/\w+)\.html goaway.html [L]


Hope I got that in the right order. Putting $1 up there in the Condition made me very uneasy, but I think it's correct.

Oh, and sometimes they give my top-level Index page as referer. But only for pages that aren't linked from the index page.

btherl

2:41 am on Oct 28, 2011 (gmt 0)

10+ Year Member



I'm a bit suspicious of that RewriteCond line too. What is the $1 referring to?

Pfui

3:21 am on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Aside: lucy's example is not applicable to dynamic query string URI=REF hits. (And please no code that might be because of #1, above.)

lucy24

4:06 am on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm a bit suspicious of that RewriteCond line too. What is the $1 referring to?

The captured part in the Rule. It becomes a little less unnerving when you remember that the Rule is evaluated before the Conditions, so the capture has already taken place. All I can establish by my own testing is that requesting the file in the normal way doesn't make the server explode. I'll have to wait for the next robot to see if it works as intended.

And please no code that might be because of #1, above.

It's the only form I've personally met. But you could easily adjust it for the query string-- especially if it's in a form that a real query would never have.

Staffa

2:08 pm on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The IP numbers mentioned at Projecthoneypot are from OVH a known host of unsavouries.

As an aside, just saw MJ12bot coming from another OVH range (94.23.42.135), good luck to them ;o)

dstiles

9:30 pm on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Anything from ANY server farm - especially OVH - is automatically blocked here.

The mj12 bot is also permanently blocked. I block all distributed bots. It's far too easy to forge them - and yes, I know majestic thinks otherwise but I'm just not playing that kind of game.

g1smd

10:10 pm on Oct 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_REFERER} ^http://(www\.)?example\.com/$1

It would be great if it worked that way, but unfortunately $1 can only be used "on the left".

RewriteCond  $1 <pattern>

lucy24

1:29 am on Oct 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Phooey. But I can achieve the same thing using an extra RewriteCond that essentially duplicates the Rule, right?

RewriteCond %{REQUEST_URI} ^(fun/AlonzoMelissa|ebooks/\w+/\w+)\.html
RewriteCond %{HTTP_REFERER} ^http://(www\.)?example\.com/%1

The details of the Rule then become redundant, except that it saves the server from having to evaluate the Condition every single time.

I think I can count on the fingers of one hand the number of htaccess blunders that didn't lead to a prompt 500 error. If the Error Logs haven't gone berserk, I assume I'm home free.

Pfui

3:08 am on Oct 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Phooey.

Yes?

: )

g1smd

6:15 am on Oct 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_REFERER} ^http://(www\.)?example\.com/%1

Same applies to %1. It can only be used "on the left".

RewriteCond %1 <pattern>

lucy24

8:20 am on Oct 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But, but, but, :: splutter ::

What's the difference?

[webmasterworld.com...]

:: uneasily looking around for approach of moderator with large, sharp scissors ::

g1smd

5:50 pm on Oct 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The difference ... is that today I read every word of the question. :)