Forum Moderators: open
We have a huge archive of posts and a reliable sign of botnet [en.wikipedia.org] attacks is when the URI matches the referer (REF). For three-plus years I've watched an alarming (and alarmingly increasing) number of coordinated forum spambots [en.wikipedia.org] flail against blessed mod_rewrite and eat 403s, so much so that I now simply beat 'em off with: http:[code][/code]//127.0.0.1/ [L,R=301].
(And a tip o' the hat to Jim Morgan for helping with the 'block direct GETs and PUTs w/ self-referrer & missing query string' coding.)
Today:
Below is a list of rapid-fire hits from "google.com" to html and cgi-generated files, ALL of which were URI=REF hits. In addition, ALL of the hits went to one directory where ALL bots are disallowed by robots.txt as well as by host and UA in root and per-directory .htaccess files. Alas, I don't have an IP because we do rDNS on the server.
Were these hits from any other host, they'd be 100% typical of a zombie [en.wikipedia.org]. Even a dusty 2005/2006 UA is typical. Actually, the number of URI=REF hits far exceeds typical zombie visits. And this isn't just any host...
Thoughts?
google.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Netscape/8.0.4
10/06 07:02:59
10/06 07:03:01
10/06 07:03:04
10/06 07:03:07
10/06 07:03:08
10/06 07:03:09
10/06 07:03:10
10/06 07:03:13
10/06 07:03:15
10/06 07:03:18
10/06 07:03:19
10/06 07:03:21
10/06 07:03:22
10/06 07:03:25
10/06 07:03:28
10/06 07:03:29
10/06 07:03:30
10/06 07:03:32
10/06 07:03:35
10/06 07:03:39
10/06 07:03:40
10/06 07:03:41
10/06 07:03:43
10/06 07:03:46
10/06 07:03:49
10/06 07:03:50
10/06 07:03:51
10/06 07:03:52
10/06 07:03:54
10/06 07:03:57
10/06 07:04:00
10/06 07:04:01
10/06 07:04:03
10/06 07:04:04
10/06 07:04:07
10/06 07:04:10
10/06 07:04:11
10/06 07:04:12
P.S. I'm now blocking ^google.com$ from the entire site.
*Here's an ELF excerpt of the URI=REF pattern:
google.com - - [06/Oct/2009:07:03:22 -0700] "GET /dir/file.html HTTP/1.0" 301 221 "http://www.example.com/dir/file.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Netscape/8.0.4"
Most likely someone/something abusing one of Google's proxy-like services (mobile tranalator, language translator, etc.), but it's impossible to be sure without IP addresses and additional HTTP header info.
Jim
Most likely someone/something abusing one of Google's proxy-like services
1.) Currently we use the "NCSA extended/combined log format" as our access_log config in httpd.conf:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined KEY (a.k.a. Apache 1.3: Module mod_log_config [httpd.apache.org])
%h: Remote host
%l: Remote logname (from identd, if supplied)
%u: Remote user (from auth; may be bogus if return status (%s) is 401)
%t: Time, in common log format time format (standard english format)
%r: First line of request
%>s: Status (...of the *last* request)
%b: Bytes sent, excluding HTTP headers. In CLF format...
2.) From the mod_log docs, it looks like we could add any Environment Variables [httpd.apache.org] headers, for example, after User-Agent:
LogFormat "[as above] \"%{User-Agent}i\" "%{header}i\" "%{header}i\"" combined (Note: We've got to keep an eye on server space but we could disable all rewrite_log files until needed to debug an error. They're currently all set: "RewriteLogLevel 2")
Any recommendations as to which headers/info would be most useful were I able to include same in my posts? Now's your chance, gang:)
google.com
Opera/9.01 (Windows NT 5.1; U; en)
robots.txt? NO
referer? Yes BUT... Did not hit the on-site referer in the same session.
Looks like I need to rewrite ^google\.com$ in all .htaccess files. Why can't even one major SE play by all the rules all the time?
195.24.76.232 13/10/2009 21:05:21 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.01 (forums fast scraper) 7
195.24.76.232 13/10/2009 21:05:20 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.01 (forums fast scraper) 12
Here is a sample apache log entry:
fgrep -c '195.24.76.232' access_log.1
41
195.24.76.232 - - [13/Oct/2009:21:05:21 +0100] "GET /viewforum.php?f=4 HTTP/1.0" 503 155 "http://forums.example.co.uk/viewforum.php?f=4" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.01" In:- Out:-:-pct. "-"
Ho HTTP headers are stored for these, sorry.
GaryK: your experience was likely something else. Although I have no personal experience of it on my own forum, phpBB2/3 can auto-register GoogleBot to remove session IPs.
The IP 195.24.76.232 for example is in project honeypot's anti-spam database: I would hope google could never get an IP into that. The honeypot lists several UAs that frequently crop up in site abuse but the URLs listed include pharma paths/filenames.
I really can't see google being associated with that IP unless it is researching badhats, in which case why advertise?
Thing is, if something fishy is going on -- from spoofing to hijacking to hacking to who knows what -- and if Google's not involved (which may or may not be pretty big Ifs), it's a bit hair-raising to think someone's getting away with something against 'em right under their, and our, noses.