Forum Moderators: open

Message Too Old, No Replies

fake Googlebot?

         

zerillos

7:57 am on Dec 10, 2008 (gmt 0)

10+ Year Member Top Contributors Of The Month



I've been seeing activity from this IP 68.39.68.12 that has the User Agent signed just as Googlebot does. But the IP resolves to comcast.net.

Is this a fake Googlebot or is Mr.G using one of comcast's IPs?

Thank you!

wilderness

10:41 am on Dec 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Fake.

If your seeing that IP?
You'll see more IP's. And you'll see more from different Comcast IP's.

Read this thread from here down:
[webmasterworld.com...]

The solution is to either require the Goggle IP ranges or deny the invalid spaces in the UA.

I would howevewr heed caution, because in the past few days, I've seen requests with standard browser UA's from these IP's that were using the FAKE Google UA.

zerillos

1:42 pm on Dec 10, 2008 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thank you! This was helpful. I've blocked that IP.

caribguy

4:37 pm on Dec 15, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Might as well include:

67.210.114.* "Mozilla/5.0 (compatible; Googlebot/2.1; hxxp://www.google.com/bot.html)"

Resulting in my rewriting all access from Lunar Pages to the dark side of the moon...

jdMorgan

5:17 pm on Dec 15, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Also 221.148.19.* with a slightly-defective user-agent string.

Jim

keyplyr

11:22 pm on Dec 15, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



zerillos - Googlebot is spoofed so much by so many that IP white listing is the most time effective method IMO. On a shared host unix server I use mod_rewrite via .htaccess:

RewriteCond %{HTTP_USER_AGENT} ^(AdsBot¦AppEngine¦Mediapartners¦PageFetcher)-Google [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^12\.34\.56\.78$ # example of allowed IP
RewriteCond %{REMOTE_ADDR} !^12\.34\.56\.78$ # example of allowed IP
RewriteCond %{REMOTE_ADDR} !^12\.34\.56\.78$ # example of allowed IP
RewriteRule .* - [F]

Any other IP using a Google UA that's not one of these allowed IP address will be served a 403 Forbidden.

For a list of valid Google IP address ranges see: [iplists.com...]

Note: Replace the broken pipe characters ( ¦ ) due to this forums software.

zerillos

12:46 am on Dec 16, 2008 (gmt 0)

10+ Year Member Top Contributors Of The Month



This is probably the best idea. But I'm sure Mr. G will not make it public if it starts to use different IPs for its bots. What if G starts using new IPs and the server returns 403 Forbidden to valid Googlebot requests?

janharders

1:01 am on Dec 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Either rely on specialised services to quickly gather the new official IPs or just give it a little work:
If the useragent is googlebot's, do a reverse-lookup on the IP. if the host is one of google's hosts, e.g. *.google.com, go do a lookup on the host and see if the IP is the same that is requesting the file since anyone could set up a reverse dns entry pointing to google. that is afaik the only way google officially approved to identify it's crawlers.

once you have the result, you can white- or blacklist the specific IP (or subnet, if you like) so that the whole process has to be done only once.

GaryK

1:18 am on Dec 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The easiest and most reliable way to ensure it's Googlebot is a reverse DNS lookup. It will always return a PTR that ends with googlebot.com.

[googlewebmastercentral.blogspot.com...]

[edited by: GaryK at 1:19 am (utc) on Dec. 16, 2008]

GaryK

7:05 pm on Dec 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oops. I was in a bit of a hurry last night. I forgot to mention you also need to do a forward DNS lookup in order to avoid problems associated with reverse DNS/PTR spoofing like Jan mentioned above. Sorry about that.

Megaclinium

5:30 am on Dec 23, 2008 (gmt 0)

10+ Year Member



I've got this fake googlebot also,

UA: Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]

Came from 75.146.149.xx, grabbed some root pages plus some subpages in a hurry, failed speed tests. Hey but a MSN bot also recently failed speed test and got grabby quickly.

fake googlebot resolved to Comcast Business (not home user) Minnesota range:
Comcast Business Communications, Inc. CBC-CM-5 (NET-75-144-0-0-1)
75.144.0.0 - 75.151.255.255
Comcast Business Communications, Inc. CBC-MINNESOTA-9 (NET-75-146-144-0-1)
75.146.144.0 - 75.146.159.255

Had to 403'd him

wilderness

5:58 am on Dec 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Megaclinium,
Did the UA on this once include the double-trailing space after after the semi-colon?

TIA

Megaclinium

6:33 am on Dec 27, 2008 (gmt 0)

10+ Year Member



Yes, it did

here's the exact UA pasted
"Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
(don't know if this box will change it tho, split line there)

It tried twice more after an hour or so, UA same. I 403'd it after 1st time. Hasn't been back since 23rd.

Megaclinium

6:36 am on Dec 27, 2008 (gmt 0)

10+ Year Member



after reading post above and pasting it back into notepad, looks like one of the blanks was removed when posted here,

but I have my site monitoring write the orig recs out to a .txt file hourly in addition to writing a log database table with the complete orig record unparsed, keyed by IP + date time with the access record.

Megaclinium

4:45 am on Dec 29, 2008 (gmt 0)

10+ Year Member



now back on verizon range: 71.103.249.xx

Megaclinium

6:20 am on Jan 13, 2009 (gmt 0)

10+ Year Member



now coming fromt theplanet.com in houston
74.52.177.#*$!

dstiles

12:26 am on Jan 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Coming from everywhere and anywhere!

I suspect a few may be from idiots using (eg) Firefox User-Agent rotators, since the hits I get are seldom in the "scrape" category.

jdMorgan

12:51 am on Jan 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles,

Can you briefly characterize their behavior? I have no idea what they would do, because I'm immediately booting them from my sites. As a result, all I can say is that each fake Gbot instance seems to go away after it gets a 403-Forbidden response -- I have no idea what they'd do if allowed access.

Thanks,
Jim

dstiles

9:33 pm on Jan 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jim...

I'm getting three or four bad googles a day but unfortunately I have insufficient time to analyse them properly.

Of three hits today on our UK-based server with several virtual sites...

Site (A): UK broadband IP hit home page on site once and went away with a 403. Nothing else seen.

Site (B): UK broadband IP hit three related sites (only one has trap installed) before getting a 403 on site.

Site (C): US broadband IP hit home page then pricelist page, 403 each time, then left.

(B) came in with an unterminated folder in the URL (ie no trailing "/"), got a 301 and then came back with two hits to the same home page - 1 second between each hit so a slow browser or robot?

(C) suggests prior knowledge of site structure - 2.5 minutes between hits. The site is very low traffic so it was easy to notice a hit, 30 minutes previous, to robots.txt and another hit 2 seconds later than that on the home page from inktomi IP 74.6.17.n.

All instances had only the single google UA.

The Inktomi IP range is probably not significant. Various IPs in the range appeared in all logs as above although not as close as the one noted. The range also read other files such as CSS so probably shared between robot and page-checker.

Whether my suggestion of rotating UAs is correct or not I can't be sure. It was just a feeling. Today's batch suggest not since as I understand it the UAs rotate per access and there has been no previous access from the IPs with a different UA (that I can find!).

GaryK

9:36 pm on Jan 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



At the risk of sounding rude, something I absolutely don't wanna do, why does everyone seem to be making such a big deal out of fake Googlebot? Just ban anything that isn't legit after doing a full round-trip DNS lookup and move on. It's really not worth all the attention you all are giving it. :)

blend27

8:30 pm on Apr 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The funny part after all this time this IP is still Operational and trying to Scrape.

Latest UA: Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)

The IP is in ProjectHoneypot Database as well.

GaryK

7:09 pm on Apr 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's really not so hard to believe. I hear from people almost every day who haven't a clue how to ban bots. And others who don't even know how or what to look for. I try to get them to sign-up here cause you all have provided a wonderful learning experience for me. Sadly though most don't seem to sign-up here. And so the fake Googlebots continue to scrape away.

wilderness

7:43 pm on Apr 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gary,
How many references at Webmaster World do you recall in which a webmaster does not even know where or how to locate their visitor logs ;)

It seems most never get beyond the general and worthless stats provided by webhosts.

Don

GaryK

2:10 am on Apr 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



True, Don. But I've never seen a name I recognize as being someone I referred here. It's a shame, really. You're right about the worthless stats, too. I know cause when I first started I had those kinds of stats. Now I use a stats package from the same vendor as my mail server software.