Strange hits from Google's IP range trip my anti-scraper

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Strange hits from Google's IP range trip my anti-scraper

bcc1234

5:21 am on Jul 30, 2007 (gmt 0)

I'm getting hits from 72.14.192.0/18, which seems to belong to Google. But what's weird is that reverse DNS doesn't respond with the usual PTR record "*.googlebot.com."

On top of that, the request headers sent by the client usually have an X-FORWARDED-FOR header with some Comcast IP.

The clients from that range don't break the robots.txt restrictions but do hit hidden links on occasion.

Because DNS is not set up as it is with the usual Googlebot, such hits trip the anti scraping protection.

Is there a way for Google to either confirm or deny that it is their range?

I wouldn't mind adding it to the white list, but would like to make sure those are read Google-related hits.

Bones

10:51 am on Jul 30, 2007 (gmt 0)

You might want to do a search for Google Web Accelerator.

theBear

12:39 pm on Jul 30, 2007 (gmt 0)

Is the agent really Googlebot?

That is a Google IP and the Web Accelerator would cause a prefetch of hidden links, because of that it basicly acts like a bot and thus your defense system jabbered at you.

The prefetch can be turned off, it takes just a few lines in .htaccess you should be able to find it if you do a search.

bcc1234

12:57 pm on Jul 30, 2007 (gmt 0)

Not Googlebot user agent.
That's what I thought, it's some kind of Google proxy or something like that or a gateway.

At first, I thought those are human reviewers working for Google, but because of the fact that they seemed to hit the trap urls all to often, I wasn't sure.

Do you know if it's possible at all to use that accelerator as a proxy? In other words, can a scraper use it somehow to copy content?

If not, then I'll just add the whole range to the whitelist and be done with it. If yes, then it gets more tricky.

bcc1234

1:09 pm on Jul 30, 2007 (gmt 0)

That's the latest user agent I'm seeing:

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.5); .NET CLR 1.1.4322)"

So blocking by UA won't work.

Also, it seems like all hits are fetches through those IP's, not just pre-loading stuff. I'm not sure, maybe that's how it's supposed to work.

theBear

1:12 pm on Jul 30, 2007 (gmt 0)

That is exactly how prefetch works.

Search for it on WebmasterWorld there are ways to turn it off at the server end.

Here ya go:

[webmasterworld.com...]

You may want to search a bit further depending on what you want to do with prefetch there are other way to handle it.

But the information is out there.

[edited by: theBear at 1:20 pm (utc) on July 30, 2007]

bcc1234

1:27 pm on Jul 30, 2007 (gmt 0)

OK, thanks.

Do you know if it can be used by scrapers in some way?
Like faking requests to the accelerator and pretending to be a toolbar so that Google does the fetching?