Fake Googlebot from SoftLayer

Forum Moderators: open

Message Too Old, No Replies

Fake Googlebot from SoftLayer

creepy creepy crawly crawly

Samizdata

9:18 pm on Aug 4, 2008 (gmt 0)

Came from a SoftLayer IP range at 67.228.207.#*$! with a Googlebot UA.

Other data says it is not a real Googlebot coming through an open proxy.

Undesirable either way. Heads up for those who need it.

...

wilderness

1:14 pm on Aug 5, 2008 (gmt 0)

I've the Class B denied, however I'm leftist ;)

and see no benefits to allowing them.

Samizdata

1:41 pm on Aug 5, 2008 (gmt 0)

No benefits and several drawbacks.

The only reason I noticed this one is because it triggered another trap before the IP ban.

...

blend27

2:18 pm on Aug 5, 2008 (gmt 0)

More SoftLayer IP Ranges here.

[ws.arin.net...]

Nothing ever ever good comes from these ranges; Scrapers, Forum/GuestBook Spam attempts and parasite hosting.

This might be a bit harsh as far as a description for the this host but their customers really got to me in the beginning of 2006.

foof, that was it....

keyplyr

6:27 pm on Aug 5, 2008 (gmt 0)

I keep a whitelist for several of the SEs, only allowing valid IP ranges for those UAs. So far it's worked well.

GaryK

4:21 am on Sep 12, 2008 (gmt 0)

Um, SoftLayer is not all bad. I host all my servers there. It was formed by the people who used to run The Planet before the evil forces at EV1 took over. If you lodge a complaint with SL they WILl deal with it. I saw the same Googlebot and reported it. I was told the account would be canceled.

blend27

12:28 pm on Sep 12, 2008 (gmt 0)

-- Um, SoftLayer is not all bad. --

That company is in my top 8 $%^&*@ list when it comes to scrape attempts and MFA sites location. Don't know what attracts the "evil" to it.

GaryK

12:34 pm on Sep 12, 2008 (gmt 0)

Not sure if it's worth your time, but lodging a complaint does get results.

Samizdata

11:22 pm on Sep 12, 2008 (gmt 0)

As far as I can see the only result was a change of IP address.

Now playing at 74.86.171.nn for any SoftLayer softies.

...

blend27

1:23 pm on Sep 19, 2008 (gmt 0)

today from Softlayer:

IP: 67.228.175.XX
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*
Connection: Keep-Alive
Content-Length: 0
Host: www.example.com
User-Agent: Google Bot 2 Beta
Accept-Language: en-us

Apperantly there is a Beta 2 of a scraper tool in the works....

dstiles

5:21 pm on Sep 19, 2008 (gmt 0)

Got a UA Googlebot/2.1 (+http://www.googlebot.com/bot.html) from a Netherlands block today - 89.146.0.0 - 89.146.63.255 belongs to Routit BV which appears to be associated with scams. That's the second time this month.

Oddly enough, over the past few days I've been getting exactly the same UA from Google on their crawl range 66.249.71.* block. Have google changed UA?

wilderness

9:00 pm on Sep 22, 2008 (gmt 0)

Class B denied.

74.86.114.zzz - - [22/Sep/2008:14:14:08 -0500] "GET /MyFolder/Sub/Sub/Sub/index.html HTTP/1.1" 200 10170 "-" "+SitiDi.net/SitiDiBot/1.0 (+Have Good Day)"
74.86.114.zzz - - [22/Sep/2008:14:14:09 -0500] "GET /Mypage.html HTTP/1.1" 200 3656 "-" "aranhabot"
74.86.114.zzz - - [22/Sep/2008:14:14:09 -0500] "GET / HTTP/1.1" 200 6096 "-" "+SitiDi.net/SitiDiBot/1.0 (+Have Good Day)"
74.86.114.zzz - - [22/Sep/2008:14:14:11 -0500] "GET /MyOtherPage.html HTTP/1.1" 403 - "-" "ArabyBot (compatible; Mozilla/5.0; GoogleBot; FAST Crawler 6.4; [araby.com;)"...]
74.86.114.zzz - - [22/Sep/2008:14:14:11 -0500] "GET / HTTP/1.1" 200 6096 "-" "AcoiRobot"

frontpage

6:08 pm on Nov 7, 2008 (gmt 0)

We just got hit with a Softlayer scraper and reported it as well.

67.228.175.zzz - - [06/Nov/2008:07:48:00 -0500] "GET / HTTP/1.1" 200 2637 "-" "Google Bot 2 Beta"

incrediBILL

7:25 am on Nov 8, 2008 (gmt 0)

only allowing valid IP ranges for those UAs

That doesn't fly anymore, you need to use full trip DNS checking just like the SEs recommend. Had a SE just recently expand into new IPs, some of you may know which one ;)

However, either full trip DNS or IP range checking boots the fake Googlebots, eliminates proxy hijacking, so I'm always wondering why this topic keeps popping back up since the solution to totally eliminate fake SE spiders has been officially endorsed for 2 years now.

wilderness

2:19 pm on Nov 8, 2008 (gmt 0)

only allowing valid IP ranges for those UAs

That doesn't fly anymore, you need to use full trip DNS checking just like the SEs recommend. Had a SE just recently expand into new IPs, some of you may know which one wink

Bill,
Time and time again in this forum, I've seen multiple solutions which result in the same or similar actions.
What works for one of use may not be useable (by choice or understanding) for another.

If keyplr has an effective method (even though you may percieve flaws) it is after all his boat.

A few been doing these access limitations for so long that we've long had in place restrictions for IP ranges and UA's which do not allow visitor deceptions that others may see.

Perhaps one of the reasons for lack of use of "full trip DNS" is the forum frequency of not providing valid examples in order to keep methods hidden from harvesters and others?
I don't use "full trip DNS" for anything, nor do I have bots harvesting my pages endlessly.

Don

Samizdata

4:03 pm on Nov 8, 2008 (gmt 0)

not providing valid examples

I think that is a little unfair - jdMorgan went to great lengths (as usual) last year to show how it can be done in .htaccess - though the technique may not work on some shared hosting configurations.

[webmasterworld.com...]

An alternative method using an auto-prepended PHP file has also been posted:

[webmasterworld.com...]

I confess that I too found it all rather baffling, and abandoned several attempts to implement it - though coincidentally I finally got the latter to work yesterday (go me!) after making a couple of adjustments for my hosting environment.

Meanwhile my sites use the "allow valid IP ranges" method, which (as incrediBILL conceded) is effective enough if you can stay up-to-date with the ranges used by the search engines.

I don't know which SE just expanded into new IPs - but I do know to find out.

...

[edited by: Samizdata at 4:07 pm (utc) on Nov. 8, 2008]

wilderness

4:20 pm on Nov 8, 2008 (gmt 0)

I don't know which SE just expanded into new IPs - but I do know to find out.

I'm an unfair guy ;)

Jim has in the past, and continues to provide marvelous insights in this forum and many others. Unfortuantley, unless links are marked and/or utilized (as you have done) the references and the benefit of Jim's brow is limited.

Jeeves is such a pest, who cares how many ranges they add.

I seem to recall nother new SE range, however the "pokes and probes" couldn't have been too overwhelming, or else I'd had made a notation or access adjustment.

GaryK

6:14 pm on Nov 8, 2008 (gmt 0)

More SoftLayer IP Ranges here.

That company is in my top 8 $%^&*@ list when it comes to scrape attempts and MFA sites location. Don't know what attracts the "evil" to it.

We just got hit with a Softlayer scraper and reported it as well.

I'm getting really tired of seeing hosts like SoftLayer (SL) bashed. SL does not do any scraping. Its customers do the scraping. Unless you report them to SL they'll continue to scrape. Every one that I've reported is no longer hosted at SL. They are most likely now hosted by someone else. The Planet perhaps? How about ServerBeach, or a ton of other hosts? So let's blame the scrapers, not the companies that host them.

Now then, why are we arguing about how to recognize bad bots? We've always had the attitude in this forum that whatever works best for you is what you should use. If you want to ban blocks of IP Address do it. If you want to use rDNS to see if a user agent is really from the SE it claims to be from, do it. If you want to rely only on header data, do it. If you want to rely on the pattern of files a bot takes, then that's what's best for you.

wilderness

6:23 pm on Nov 8, 2008 (gmt 0)

why are we arguing about how to recognize bad bots?

Hey Gary,
eenie, meanie miney, mo works also ;)

Don

edited by wilderness:

It's a slow day, a full moon and everybody crawled out of the wrong side of the bed.

GaryK

7:04 pm on Nov 8, 2008 (gmt 0)

eenie, meanie miney, mo works also

Some days that works as well as anything else.

and everybody crawled out of the wrong side of the bed

I had to sleep on the left side of the bed last night so when I got up of course I was on the wrong side of the bed. To get somewhat back on topic, I had to be scraped off the wall I crawled into whilst getting out of bed. ;)

incrediBILL

10:18 pm on Nov 8, 2008 (gmt 0)

How about ServerBeach

ServerBeach is about as squeaky clean as any data center I've ever found which is why I use them and Peer 1 exclusively these days.

Any AUP violations or hacked servers get swatted down almost instantly.

As far as Softlayer is concerned, when my server was under a botnet attack 2 years ago they took relatively swift action to disable a vulnerable server that keep getting used over and over again.

incrediBILL

10:59 pm on Nov 8, 2008 (gmt 0)

why are we arguing about how to recognize bad bots

OK, I'll show you why ;)

Here's how I just successfully faked Googlebot using Google's own servers that would pass many bot traps.

I set my user agent to "Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...] and started accessing pages via Google's translator.

Here's what showed up on my server:

66.249.85.85 "GET /testgooglebot HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...] (via translate.google.com)"

If you're just using "66.249.64.0 - 66.249.95.255" or "66.249.64.0/19" to validate it's a Google IP plus it claims to be Googlebot, and if the Googlebot UA validation is sloppy, your site has just been duped by the Google translator.

Now imagine if I know another location in Google that will do the same thing that doesn't tack on the tell tale ",gzip(gfe) (via translate.google.com)" and you're completely duped.

Similar holes may exist in Yahoo, Live, etc...

If you don't want people to be able to use the SE's facilities against you, which they do, full trip DNS is the only way to fly.

Samizdata

11:20 pm on Nov 8, 2008 (gmt 0)

I don't think anyone is suggesting that SoftLayer are inherently evil (I certainly wasn't) any more than Amazon Web Services, The Planet or any other hosting concern is inherently evil.

The fact remains, however, that lots of unwanted robotic nuisances come from such places, and as far as I am aware nothing useful is lost by blocking their IP ranges.

If I am wrong about this I would welcome enlightenment.

...

GaryK

11:26 pm on Nov 8, 2008 (gmt 0)

Bill, I still say it all depends on your needs.

Because of my user agent project I want to attract all the bots I can and see how they behave.

I use a combination of methods to see where they're really from, including rDNS. But up until they get too abusive I won't turn them away. So for me I need the user agent, IP, rDNS, header data, pattern of files taken, whether it reads robots.txt, honeypots, and a few other things.

Others may not need that level of sophistication.

So I have to return to my original statement that we cannot tell others the best way to handle bots.

It's got to be whatever method works best for them. Even if that means the occasional phony bot slips through. For them the amount of code that has to be written to handle these bots might not be worth it. Not everybody is dealing with thousands of uniques a day. So the few bad bots that visit just aren't the drain on resources that they might be on other sites.

incrediBILL

11:33 pm on Nov 8, 2008 (gmt 0)

Continuation...

So now there are a few scenarios to consider

If you're doing full UA matching it could break if Googlebot changes a single letter.

"^Mozilla/5\.0 (compatible; Googlebot/2\.1; [google\.com...]

Then of course the looser UA matching can be duped as shown above:

"^Mozilla/5\.0\ \(compatible;\ googlebot/"

So the best bet of full functionality with the least likelihood of being duped or broken by changes at Google is the code Jim showed in the other thread where you only specify the minimum matching value of "googlebot" and let full trip DNS do it's job.

FWIW, all bot blocking methods are perfectly valid, just trying to show the pitfalls, traps and loopholes that other methods may present is all.

incrediBILL

11:36 pm on Nov 8, 2008 (gmt 0)

Because of my user agent project I want to attract all the bots I can and see how they behave

Not disagreeing with you as I too let things in just a little to see how it behaves.

Just showing the flaws in various methodologies for those that want their filters as tight as possible.

Some may not need it or be able to do it, just education.