Forum Moderators: open

Message Too Old, No Replies

google.com -- spoof? spider? botnet zombie? employee?

         

Pfui

6:19 pm on Oct 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Backstory:

We have a huge archive of posts and a reliable sign of botnet [en.wikipedia.org] attacks is when the URI matches the referer (REF). For three-plus years I've watched an alarming (and alarmingly increasing) number of coordinated forum spambots [en.wikipedia.org] flail against blessed mod_rewrite and eat 403s, so much so that I now simply beat 'em off with: http:[code][/code]//127.0.0.1/ [L,R=301].

(And a tip o' the hat to Jim Morgan for helping with the 'block direct GETs and PUTs w/ self-referrer & missing query string' coding.)

Today:

Below is a list of rapid-fire hits from "google.com" to html and cgi-generated files, ALL of which were URI=REF hits. In addition, ALL of the hits went to one directory where ALL bots are disallowed by robots.txt as well as by host and UA in root and per-directory .htaccess files. Alas, I don't have an IP because we do rDNS on the server.

Were these hits from any other host, they'd be 100% typical of a zombie [en.wikipedia.org]. Even a dusty 2005/2006 UA is typical. Actually, the number of URI=REF hits far exceeds typical zombie visits. And this isn't just any host...

Thoughts?

google.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Netscape/8.0.4

10/06 07:02:59
10/06 07:03:01
10/06 07:03:04
10/06 07:03:07
10/06 07:03:08
10/06 07:03:09
10/06 07:03:10
10/06 07:03:13
10/06 07:03:15
10/06 07:03:18
10/06 07:03:19
10/06 07:03:21
10/06 07:03:22
10/06 07:03:25
10/06 07:03:28
10/06 07:03:29
10/06 07:03:30
10/06 07:03:32
10/06 07:03:35
10/06 07:03:39
10/06 07:03:40
10/06 07:03:41
10/06 07:03:43
10/06 07:03:46
10/06 07:03:49
10/06 07:03:50
10/06 07:03:51
10/06 07:03:52
10/06 07:03:54
10/06 07:03:57
10/06 07:04:00
10/06 07:04:01
10/06 07:04:03
10/06 07:04:04
10/06 07:04:07
10/06 07:04:10
10/06 07:04:11
10/06 07:04:12

P.S. I'm now blocking ^google.com$ from the entire site.

*Here's an ELF excerpt of the URI=REF pattern:

google.com - - [06/Oct/2009:07:03:22 -0700] "GET /dir/file.html HTTP/1.0" 301 221 "http://www.example.com/dir/file.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Netscape/8.0.4"

jdMorgan

12:06 pm on Oct 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looks totally bogus to me... Netscape has never included "MSIE" in its user-agent string.

Most likely someone/something abusing one of Google's proxy-like services (mobile tranalator, language translator, etc.), but it's impossible to be sure without IP addresses and additional HTTP header info.

Jim

GaryK

2:23 pm on Oct 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Most likely someone/something abusing one of Google's proxy-like services

A few weeks ago I needed to look at a forum that was powered by phpBB and from which the admins were restricting viewing of the forums to all but registered members. I'm not sure if this is novel or not, but since I noticed GoogleBot was on the site at the time, I changed my UA to GoogleBot and came at the site via Google Translate. I was able to see all the forums.

Pfui

5:32 pm on Oct 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Which additional HTTP header info would be most useful in situations similar to the OP? Is there a preferred Apache log config specifically for bot- or zombie-spotting?

1.) Currently we use the "NCSA extended/combined log format" as our access_log config in httpd.conf:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

KEY (a.k.a. Apache 1.3: Module mod_log_config [httpd.apache.org])

%h: Remote host
%l: Remote logname (from identd, if supplied)
%u: Remote user (from auth; may be bogus if return status (%s) is 401)
%t: Time, in common log format time format (standard english format)
%r: First line of request
%>s: Status (...of the *last* request)
%b: Bytes sent, excluding HTTP headers. In CLF format...

2.) From the mod_log docs, it looks like we could add any Environment Variables [httpd.apache.org] headers, for example, after User-Agent:

LogFormat "[as above] \"%{User-Agent}i\" "%{header}i\" "%{header}i\"" combined

(Note: We've got to keep an eye on server space but we could disable all rewrite_log files until needed to debug an error. They're currently all set: "RewriteLogLevel 2")

Any recommendations as to which headers/info would be most useful were I able to include same in my posts? Now's your chance, gang:)

Pfui

3:32 pm on Oct 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This just in... From the log of another site on the same server as the OP's:

google.com
Opera/9.01 (Windows NT 5.1; U; en)

robots.txt? NO
referer? Yes BUT... Did not hit the on-site referer in the same session.

Looks like I need to rewrite ^google\.com$ in all .htaccess files. Why can't even one major SE play by all the rules all the time?

AlexK

7:47 pm on Oct 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I spotted an IP from google.com doing a fast scrape on my forums - caught using the Bad-Bots Stopper Script [webmasterworld.com]. Here is the precis of that (stored on my forum blog for reference):

195.24.76.232 13/10/2009 21:05:21 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.01 (forums fast scraper) 7 
195.24.76.232 13/10/2009 21:05:20 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.01 (forums fast scraper) 12

note: a max of 12 hits / sec, so not that fast in the current context, when the fastest attempted scraper on my site to date was at >300 hits/sec.

Here is a sample apache log entry:

fgrep -c '195.24.76.232' access_log.1
41

195.24.76.232 - - [13/Oct/2009:21:05:21 +0100] "GET /viewforum.php?f=4 HTTP/1.0" 503 155 "http://forums.example.co.uk/viewforum.php?f=4" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.01" In:- Out:-:-pct. "-"

Ho HTTP headers are stored for these, sorry.

Pfui

11:17 pm on Oct 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Curiouser and curiouser..

Googling the IP that Alex snagged, 195.24.76.232 (rDNS: "google.com"), shows that the apparently Luxembourg-based IP is (in)famous for forum-spamming.

Might this "google.com" be an example of Host spoofing? Domain hijacking? Paranormal activity? :)

AlexK

8:37 am on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm. Not a word of explanation from Google (do they read this section?) nor anyone else. It seems worthwhile bumping this one back up to see if we can catch some attention.

GaryK: your experience was likely something else. Although I have no personal experience of it on my own forum, phpBB2/3 can auto-register GoogleBot to remove session IPs.

dstiles

9:10 pm on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not convinced this actually is google. Following up info at Robtex suggests this and several other IPs are merely pointers to google.com or fraudulent rDNS setups.

The IP 195.24.76.232 for example is in project honeypot's anti-spam database: I would hope google could never get an IP into that. The honeypot lists several UAs that frequently crop up in site abuse but the URLs listed include pharma paths/filenames.

I really can't see google being associated with that IP unless it is researching badhats, in which case why advertise?

Pfui

10:45 pm on Oct 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



dstiles: Yep, you found the same info I did in my 'Curiouser and curiouser' msg. Weird, huh?

Thing is, if something fishy is going on -- from spoofing to hijacking to hacking to who knows what -- and if Google's not involved (which may or may not be pretty big Ifs), it's a bit hair-raising to think someone's getting away with something against 'em right under their, and our, noses.