| Welcome to WebmasterWorld Guest from 184.108.40.206 |
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
|Pubcon Platinum Sponsor 2014|
|Strange hits from Google's IP range trip my anti-scraper|
I'm getting hits from 220.127.116.11/18, which seems to belong to Google. But what's weird is that reverse DNS doesn't respond with the usual PTR record "*.googlebot.com."
On top of that, the request headers sent by the client usually have an X-FORWARDED-FOR header with some Comcast IP.
The clients from that range don't break the robots.txt restrictions but do hit hidden links on occasion.
Because DNS is not set up as it is with the usual Googlebot, such hits trip the anti scraping protection.
Is there a way for Google to either confirm or deny that it is their range?
I wouldn't mind adding it to the white list, but would like to make sure those are read Google-related hits.
You might want to do a search for Google Web Accelerator.
Is the agent really Googlebot?
That is a Google IP and the Web Accelerator would cause a prefetch of hidden links, because of that it basicly acts like a bot and thus your defense system jabbered at you.
The prefetch can be turned off, it takes just a few lines in .htaccess you should be able to find it if you do a search.
Not Googlebot user agent.
That's what I thought, it's some kind of Google proxy or something like that or a gateway.
At first, I thought those are human reviewers working for Google, but because of the fact that they seemed to hit the trap urls all to often, I wasn't sure.
Do you know if it's possible at all to use that accelerator as a proxy? In other words, can a scraper use it somehow to copy content?
If not, then I'll just add the whole range to the whitelist and be done with it. If yes, then it gets more tricky.
That's the latest user agent I'm seeing:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.5); .NET CLR 1.1.4322)"
So blocking by UA won't work.
Also, it seems like all hits are fetches through those IP's, not just pre-loading stuff. I'm not sure, maybe that's how it's supposed to work.
That is exactly how prefetch works.
Search for it on WebmasterWorld there are ways to turn it off at the server end.
Here ya go:
You may want to search a bit further depending on what you want to do with prefetch there are other way to handle it.
But the information is out there.
[edited by: theBear at 1:20 pm (utc) on July 30, 2007]
Do you know if it can be used by scrapers in some way?
Like faking requests to the accelerator and pretending to be a toolbar so that Google does the fetching?
All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved