Forum Moderators: open

Message Too Old, No Replies

Google Home Page as Referrer and Random User Agents

Is this a scraper?

         

dataguy

8:42 pm on Nov 6, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I operate a large site which scrapers seem to love. We do a pretty good job of keeping them at bay, but I've been watching something new that I'm not sure of.

We get hit about once a minute from something or things which have only this in common:

HTTP_REFERER: [Google.com...] and random U/A's, typically less common U/A's such as iPhone, Android, Firefox

Is there anything that legitimately returns http_referer as "http://www.Google.com/"?

I know my rankings are good, but not that good.

wilderness

12:53 am on Nov 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is there any consistency in the IP ranges?

encyclo

1:06 am on Nov 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is there anything that legitimately returns http_referer as "http://www.Google.com/"?

Yes - the "I'm feeling lucky" button.

Are they loading images? CSS/JS files?

dataguy

3:17 pm on Nov 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There isn't any consistency in IP ranges.. they are from all over the world. I've only ever seen one page view from the each IP.

I'll have to check to see if they are loading the css file.

My system looks at various factors from the remote header, assigning each factor a different weight. A few of the factors have to do with the referral string. This is the first time I've seen this, so it's throwing my system off track.

dstiles

7:55 pm on Nov 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been blocking [google.com...] (neither www nor trailing /) for some time. It's not a common referer on my sites. I've now added versions with and without / and www to see what happens. If it's google "lucky" then I'm not interested in the visitor anyway.

One thought that occurs: if the hits are in infrequent bursts could it be games-players or botnets? There are some odd sites out there and I know we've been hit by off-line players (esp tv quizzes) in the past searching for answers. Is it possible some online game site has incited users (possibly via installation of software) to use a google referer to allay suspicions? Probably not but it might be something similar. We've often seen botnets hitting through random compromised zombies, different IP on each hit, so it may be something similar to that. A bit odd using google as a referer, though.

Of course, it could be google log-spamming via zombies. :)

Or even MS mounting a smear campaign! :)

incrediBILL

10:18 pm on Nov 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The proper referrer from Google will always start with "http://www.google.com/" so blocking "http://google.com/" is perfectly valid best I can tell.

I'm not interested in the visitor anyway.

Redirect them to my server, I'll bet their money is still green.

dstiles

12:55 am on Nov 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think blocking the exact referer text ^http://www.google.com/$ should be safe enough.

Who hits "lucky" if they are seriously looking for products or information? I wouldn't expect it to return anything sensible and so no "sale".

incrediBILL

1:00 am on Nov 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Considering I have a lot of #1 rankings, no way would I ever block "lucky".

Might as well just do this:

User-agent: Googlebot
Disallow: /

Seriously, if I showed you some of the insanity I often find on the web and the logic behind those that do it, you would reconsider.

Perhaps they just know you're the #1 result, didn't remember your domain, but know "lucky" will get back to your site without additional clicks and delays.

I've heard stranger things...

dstiles

9:00 pm on Nov 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



To me it's always sounded like purely random chance so I've never considered it and certainly didn't think anyone would actually USE it! Certainly I've never heard any of my customers or colleagues trying it - and some of them do some really dumb things.

Having now tried it, it seems to redirect immediately to the top site. Which seems highly dangerous in today's world where compromised sites rate well in google.

I've taken your advice and removed the www version. When I get time I'll add a tracer instead of a block on it.