Forum Moderators: open

Message Too Old, No Replies

cache.google.com requests

         

sirius

3:58 am on Aug 14, 2017 (gmt 0)



Hi,

I have noticed a huge increase in traffic to my website for ip address's with a reverse DNS of cache.google.com

None of these ips are "Google's" Servers and appear to be coming from around the world.

If you do a reverse dns on any of these ips, they will claim they are cache.google.com

Here is a small list of some of the recent ips in the past few days..

Have you noticed anything like this? Does the cache.google.com bot come to your website with these types of requests?

Do you think these requests legitimate or should I attempt to block them?

[edited by: keyplyr at 9:38 am (utc) on Aug 14, 2017]
[edit reason] IPs removed [/edit]

keyplyr

9:46 am on Aug 14, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi sirius,

I snipped the IPs since we don't publish individual IPs because of privacy concerns, but before I did I looked a few up and can verify you're correct. While the IPs are registered to various sources, they all rDNS to cache.google.com.

I believe these are Google AMP sites but not completely sure since I haven't seen this activity before - thanks for reporting it.

What were the requests after at your site? Were these just referrers or were the requests for certain files?

sirius

3:03 pm on Aug 14, 2017 (gmt 0)



Hi Keyplyr,

Thank you for approving my post!

I'm not sure what these are and I'm wondering if these are part of an illegal proxie that has been hitting my site.

I have been battling a content scraper from taking my content when i'm posting it and these requests are new for me.

These requests are coming to new posts to my website, pretty much with the same pattern as the other cloud proxie requests I was finally able to block. Whenever I make a new post, a cache.google.com requests will come to the file.

What's weird is ALL of my traffic is from USA and more than 75% of these cache.google.com requests are from different countries like AG, AT, etc.

Shouldn't something on the Google.com domain be pointed to the Google.com owned ip address's?

Any help or suggestions would be greatly appreciated.

sirius

3:04 pm on Aug 14, 2017 (gmt 0)



I also do not have Google AMP setup or configured in any way.

lucy24

3:57 pm on Aug 14, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Shouldn't something on the Google.com domain be pointed to the Google.com owned ip address's?

Not necessarily; even legitimate Google functions can benefit from using non-US or non-ARIN IP ranges. A couple years back, there was considerable blahblah about Google crawling from non-US ranges, though I don't know if they ever really did it, and if so, what IP ranges they used.

"cache" and "translate" (whether Google or some other major search engine) are both approaches used by scrapers, so that's a legitimate concern. Only you can decide whether it's worth the risk of also blocking law-abiding human users.

keyplyr

8:17 pm on Aug 14, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One possibility is your site is being framed by someone using an AMP page.

I say this because of your statement:
Whenever I make a new post, a cache.google.com requests will come to the file


You can stop remote framing of your pages with either or both of these methods:

A script:

<script type="text/javascript">
if (parent.frames.length > 0) {
parent.location.href = location.href;
}
</script>

A header tag in htaccess:
 
Header append X-FRAME-OPTIONS "deny"


Another measure you should consider is the Content Security Policy [wiki.mozilla.org]

sirius

2:54 pm on Aug 17, 2017 (gmt 0)



Thank you for the tips Keyplr,

I have tried using the header as suggested to block iframing and the problem is still there.

I am now more sure these requests are not legitimate and I am working on redirecting them to the homepage for newer pieces of content.

What i'm attempting:

I have been collecting the ip address's for over a week and now have a pretty big list. I have scanned all ranges close to the initial ip address to find the other offending ip's on that pool.

I now have thousands of ip address's that all say cached.google.com.

All of these requests will no longer be able to request content that is newer than 5 days.

There are also still new ip pools coming in, but they are mostly out of the USA.

So for all non USA traffic, I am resolving the DNS before they can load the page and taking the appropriate action.

Do you think this type of setup would have any bad effect if some of these requests are legitimate?

keyplyr

4:36 pm on Aug 17, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have tried using the header as suggested to block iframing and the problem is still there.
May take a few weeks until cached content expires.

I recommend using those 3 methods of security, along with HTTPS, anyway.

However, without knowing exactly what we are seeing here, any counter measure is only speculation.

You can safely block any server farm range. You can temporarily block individual ISP IP addresses.

There is always collateral damage. Server farms lease to companies that may be benneficial to your interests. They may also lease to home DSL. Company employees may surf from work, etc.

If you choose to block connectivity, it's highly recommended to keep a diligent watch on your logs to see exactly who is getting blocked.