Forum Moderators: open

Message Too Old, No Replies

What is "cache.google.com"?

cache.google.com

         

JamesSC

7:39 pm on Jul 9, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



I'm trying to understand more about a source of user hits which was initially discussed on Webmasterworld here:

[webmasterworld.com ]

the visitations come from any number of entities which appear to be ISPs rather than server farms but which all fall under the aegis "hostname: 'cache.google.com'" and which all present as their first request to my WordPress blog

GET /?taxonomy=post_tag&term=[TAG] HTTP/1.1


which is then automatically redirected to

GET /tag/[TAG]/ HTTP/1.1


after which the normal array of front page request proceeds - absent a request for the URL itself referenced by [TAG].

I am also getting requests from this same cache.google.com group source which appear to be pre-publication page references of the sort

GET /?p=###### HTTP/1.1


followed by the full array of front page requests including the post-publishing permalink URL.

I have a Google Analytics account, and the GA header is embedded on my front page.

To the best of my knowledge I have not been hacked in any way, nor does anyone else possess a back door key into my site, but I find the structure of these requests peculiar and their implied collective source a mystery.

Has anyone had any further experience with cache.google.com since August, 2017?

Thanks.

lucy24

9:54 pm on Jul 9, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Any relation to
webcache.googleusercontent.com
which I occasionally see as a referer The oddity there is that it only requests supporting files, not pages, though the full referer string does indicate what page is involved. No WP, no GA on my side.

entities which appear to be ISPs rather than server farms but which all fall under the aegis "hostname: 'cache.google.com'"
Clarify, please. Who is supplying the hostname? Your server?

JamesSC

10:35 pm on Jul 9, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



No, Lucy, this doesn't appear to be googleusercontent.com, which I've also encountered.

Clarify, please. Who is supplying the hostname? Your server?


A lookup of the IP making the requests. Here is a typical one, redacted:

ip: "[REDACTED]"
hostname: "cache.google.com"
city: "[REDACTED]"
region: "[REDACTED]"
country: "US"
loc: "[REDACTED]"
postal: "[REDACTED]"
timezone: "America/Chicago"
asn: Object
asn: "[REDACTED]"
name: "[REDACTED]"
domain: "[REDACTED]"
route: "[REDACTED]"
type: "isp"
company: Object
name: "[REDACTED]"
domain: "[REDACTED]"
type: "isp"
privacy: Object
vpn: false
proxy: false
tor: false
hosting: false
abuse: Object
address: "[REDACTED]"
country: "US"
email: "[REDACTED]"
name: "[REDACTED]"
network: "[REDACTED]"
phone: "[REDACTED]"
domains: Object

On other lookup sites cache.google.com is listed as the PTR record. Some do not refer to it at all.

The IPs themselves supposedly belong to Acme Anvils, LLC, or Childfog Educational District, or any number of other entities, but regardless of their real world commercial function, all that I have seen so far exhibit that peculiar request behavior and all have cache.google.com attached as hostname/PTR record.

JamesSC

9:49 pm on Jul 13, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Another flurry of these things, all with the first request for a WordPress tag on a new WordPress post rather than for the permalink which is the norm; none in fact for the permalink itself, but each does request the cached page as is the norm; each a different private/commercial ISP; all but one - an exception - this time under the rubric/hostname/PTR record cache.google.com, one cache-azo2.google.com.

There is some scenario which maps to this behavior - what is it? A WordPress tag scraper using a variety of compromised computers, all sharing cache.google.com?(?) Something to do with the way a Chrome browser bookmark works? COVID-19?

JamesSC

4:31 am on Jul 17, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



The scenario is that 7 out of 10 of the IPs associated with the unresolvable (if you can, by all means post the IP of cache.google.com or any other info at all here) cache.google.com are being reported by at least one lookup site as "suspected proxy servers".

As with Sirius in August, 2017 [webmasterworld.com ], the only way of effecting access control if desired is by individual IP or range after the fact, unless someone knows how to control resolvable server IPs/ranges via the as yet unresolvable cache.google.com.

As the one who post-edited Sirius' original August, 2017 post, Keyplyr can of course vouch for the relationship of the random IP addresses to cache.google.com.

lucy24

5:18 am on Jul 17, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"suspected proxy servers".
Do they send the X-Forwarded-For header? My impression is that the googloid ranges (74.125.blahblah, 66.102.blahblah) generally do.

JamesSC

11:58 am on Jul 17, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



No, Lucy.

Here are the header log entries (redacted) for the latest three visits making the peculiar WordPress-syntax direct tag request, belonging to the cache.google.com hostname/PTR record group, and being reported as "suspected proxy servers":

2020-07-13:20:48:44
URL: /tag/[TAG]/
IP: 205.213.###.#
Content-Length: 0
Connection: close
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip,deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
Host: www.example.com


2020-07-13:20:30:59
URL: /tag/[TAG]/
IP: 198.70.##.##
Content-Length: 0
Connection: close
Accept-Language: en-US
Accept-Encoding: gzip,deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
Host: www.example.com


2020-07-15:13:19:44
URL: /tag/[TAG]/
IP: 76.75.##.###
Content-Length: 0
Connection: close
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip,deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36
Host: www.example.com

Again, I don't believe these can be characterized as googloid ranges - this is one of the mysteries: what is cache.google.com's IP or range?

These are, rather, non-Google servers with their own respective names and ranges, all sharing "hostname: cache.google.com" and for which the entity cache.google.com maintains the PTR records. And, with only a few exceptions which could easily be compromised zombie computers, being reported as "suspected proxy servers". Clarification: only those servers making the peculiar WordPress-syntax direct tag request which did not share "hostname: cache.google.com" were not reported as "suspected proxy servers"; all "suspected proxy servers" did.

One can find any number of references to "cache.google.com" or "hostname: cache.google.com" on the Internet, including the erroneous one that it belongs to Google Ireland Limited and resolves to 208.65.152.234 (that entity, in point of fact, hosts three YouTube IPs and a Google IP, and only shares the same characteristic as the suspects above save not being a "suspected proxy server".

So my little situation is beginning to shrink in my mind before the larger

- What is cache.google.com, which is neither webcache.googleusercontent.com nor safebrowsing-cache.google.com (although the last may be related)?

- Why does it not have an associated IP or range?

- Why, out of the reputed 94,000+ servers for which it maintains PTR records, are such a large percentage being reported as "suspected proxy servers"?


[edited by: not2easy at 7:46 pm (utc) on Jul 17, 2020]
[edit reason] exemplified hostname [/edit]

JamesSC

2:07 pm on Jul 27, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



So...more of the same, what appears to me now to be identical to what Sirius first reported here [webmasterworld.com ]:

Immediately upon publication, and when I say immediately, I mean within seconds - and now the first tipoff - a swarm of hits, all from various "suspected proxy servers", random business and educational ISPs; all rDNSing back to cache.google.com; all innocuous and legitimate UAs, one, oddly, with no corresponding header log entry; all available header log entries apparently innocuous and benign as well.

How did they know to come immediately upon publication? Options:
- someone within my host, monitoring my site with a script (possible, but statistically unlikely; but many technically proficient people working in IT know of my site)
- one of my subscribers, for some reason wanting to visit my site using the same UA across a swarm of cache.google.com proxies immediately upon my publishing anything new
- some other external script-monitoring

The nature of this activity and intentions:
- unknown and malignant, but scraping or other evidence not yet demonstrable
- unknown and benign, only using proxies for anonymity
- unknown and something else

Access control, whether desired or not: none beyond after the fact blocking of random "suspected proxy server" IPs, IP ranges, or ASNs.

Reported twice now to Webmasterworld by two different people, Sirius and myself, as yet no clues as to what the phenomenon represents.

dstiles

10:54 am on Jul 29, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Could it be your broadband ISP cacheing to save it for other visitors using their bandwidth? I know this used to happen in the days of slow deliveries. Maybe still?

JamesSC

2:13 pm on Jul 29, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Could it be your broadband ISP cacheing to save it for other visitors using their bandwidth? I know this used to happen in the days of slow deliveries. Maybe still?

I find it difficult to understand how my broadband ISP would randomly refract responses to my new posts through the reported 94,000 servers (a claim made by an entity returned in an Internet search for c.g.c. offering a software package for purchase) for which cache.google.com, without any IP itself, maintains the PTR records, the majority of which I've checked are also reported as "suspected proxy servers", plus a few others not associated with cache.google.com.

There are other returns on the Internet for cache.google.com in addition to the Sirius one I cited, none informative. I would have hoped inquiring minds would have sussed out the particulars of cache.google.com long, long ago.

Before anyone volunteers to clap palm to my forehead, in a moment of clarity yesterday I did so myself: it is almost a certainty my new posts are being picked up by a subscriber or subscribers to my blog, through email or feed, if not one/several who has/have a tab of me open continuously.

Thereafter - hypothetically, to ensure IP anonymity - my newest offering is then accessed for review through some site offering a plethora of anonymizing proxy servers, a bulk of which 1) share cache.google.com as their PTR record keeper and 2) their being reported as "suspected proxy servers". None of these responding server IPs have repeated themselves, so the likelihood of them being actual subscribing workers at Acme Anvil or No Child Helped Educational Authority accessing me during a quiet moment at work seems slender to me.

This could be entirely benign, a soul or souls merely jealously guarding their privacy. If not for the peculiar log entries, I may never have noticed them. So far no obvious wholesale scraping of content has occurred, and should it be being done piecemeal into another language, I would never know it.

So for the time being my interest necessarily rests less on any actual damage being done to me and more on this mysterious entity cache.google.com and its unknown role in 1) maintaining all these apparently entirely unrelated ISP PTR records 2) of which ISPs a majority get reported as "suspected proxy servers" and 3) its future utility for someone actually intending my site harm.

If this sort of thing, benign or not; and let's assume for simplicity limited solely to the cache.google.com folks, were happening to one of you, how would you go about attempting to assert control over it, should you ever determine you needed to? I am already implementing htaccess proxy control headers, and because I use a VPN frequently myself, for the time being I'm not inclined to use the more astringent PHP script available for proxy-blocking purposes.

Steven29

6:11 pm on Jul 29, 2020 (gmt 0)



Hi,

I == Sirius.

These cache.google.com requests appear to have no relation to Google, as far as I experience, and blocking them will increase your rankings but attract WAY more visits.

I am currently blocking over 30% of requests to the website....

I believe these requests are related to a seo attack and using hacked machines.

What i've experienced is these requests are taking content from the internet and spinning (sometimes exact content on hundreds of domains to get you penalized for higher search keywords) by putting the content on more hacked machines.

Just do a search in Google for the Past 1 hour for any 3 letter random combination. You will see 90% of the links being viruses and spam. These domains and links are cloaking and doing other things to benefit certain people and domains.

Example searches for the past 1 hour: "xli", "sse", "bew" and so on. Look for all of the domains with the random subdomains.

Where do you think those hacked machines are getting all of the "content"?

JamesSC

8:48 pm on Jul 29, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks, Steven29.

I understand the content spinning and this is the sort of thing I fear, although I'm not getting the number of hits in this context one would expect from a concerted spinning effort. But, then, one browser-copied copy is all it takes to seed the process, no?

One particular keyword which I am incongruously ranking for has been bouncing up and down since the beginning of the year - usually in concert with Google's updates.

I tried your 3 letter combination thing, but I haven't yet gotten anything obvious; perhaps I'm misunderstanding you.

If I am understanding you, though, what you are blocking are the individual servers revealed to have this relationship with cache.google.com I'm talking about, but only after the fact.

Any more light you or anyone else could shed on this mysterious cache.google.com and the phenomena associated with it would indeed be appreciated.

Steven29

9:58 pm on Jul 29, 2020 (gmt 0)



Hi,

I wanted to add more information into these requests and answer your question above:

Here is a screen shot of what I see doing a search on Google for the past 1 hour: [i.ibb.co...]

If you click any of those links, you will be redirected about 10 times and then 99% to a page to install a Browser Extension that usually has less than 1,000 active users. Is this not what you see? It's for anything and these links flood my niche searching for real keywords. My Google Alerts provide 80% hacked links these days.

------

There is an article by doing a Google Search that seems to imply these are legitimate requests made to Google Analytics to "speed up google analytics", which makes no sense as these requests are hitting the server and not the Analytics Pixel.

There are some additional subdomains that come through periodically, like:

cache.google.com.64.90.64.in-addr.arpa
external-192-154-121-5.cache.google.com
71.cache.google.com
100.cache.google.com
133.cache.google.com
akr.cache.google.com

etc.

I see repeat visits from the IPS, never in the same day usually.. but within 1 - 4 days and when returning the last .x ip range will be different in most cases by 1 number.

The only requests that I can seem to have some merit are:

209.141.121.128/26

AND

IPS on the scnresearch [dot] com/ network, which I believe is a network of websites that bring 'Analytics' to help competitors find content to post.

I see requests coming with a referrer of "https://www.google.com/" (usually to the homepage, but see requests periodically to different pages with this referrer).

15:14:42 - 7/2967.218.93.x /loginxxxxxxReferer: [google.com...] (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36 Edg/83.0.478.61
15:12:38 - 7/2867.218.93.x /xxxxxxMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
15:15:32 - 7/2767.218.93.x /xxxxxMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
10:37:30 - 7/2767.218.93.x /xxxxxxMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
16:12:27 - 7/2667.218.93.x /Referer: [google.com...] (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36

iamlost

11:43 pm on Jul 29, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mea culpa, I read this thread on first post but was in hurry and thought others would know and answer, then forgot to check back.

cache.google is the default server name (rDNS) of every GGC (Google Global Cache) edge node that Google provides in partnership with ISPs for timely delivery of Google content.


Server Naming / Reverse DNS
Please configure reverse DNS entries for all servers’ IP addresses to cache.google.com

From page 7 of GGC Installation and Operation Guide [docshare04.docshare.tips]
Note: guide is from while still beta in 2011.
Note: I’ve blanket blocked all such since 2008.

JamesSC

12:53 am on Jul 30, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks for the additional info, Steven29. I believe I've seen some of those same ranges myself.

JamesSC

1:16 am on Jul 30, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Note: I’ve blanket blocked all such since 2008.


When you say you have blanket blocked all such, iamlost, which all have you blocked, how have you done so - after the fact of discovering [an IP? something else?], or before the fact [using what?]? - and in your case why did you find it necessary?

I've looked through the pdf and I have yet to discover anything I might get a grip on.

And thanks much for following up.

iamlost

2:54 am on Jul 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




When you say you have blanket blocked all such, iamlost, which all have you blocked, how have you done so

As part of my bot mitigation I run rDNS on non-whitelisted IPs; and put timeouts on cache.google results.

and in your case why did you find it necessary?

It was eating bandwidth without giving anything in return. And why would I want a Google services CDN server chatting with my info site? It’s not a traffic referrer. Heck I block SEs including G from over half my content. But then iamlost :)

JamesSC

4:53 am on Jul 30, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks, iamlost. Makes sense.

So that's what we have here: a Google-serving, in multiple senses of the term, CDN. I'm still not sure what role it's playing in my particular situation, though, nor whether I might have as efficient tools to deal with it if I need to. I already know if I use mod_authz in my .htaccess file to deny by domain I'll screw up my logs; not worth the merely hypothetical threat at this juncture. Maybe something using mod_setenvif.

Thanks again to all.

Steven29

4:56 pm on Jul 30, 2020 (gmt 0)



"From page 7 of GGC Installation and Operation Guide [docshare04.docshare.tips]"

How does this make sense? If the requests were coming to IMAGES or JAVASCRIPT to serve cached content... that may speed things up...

In my experience, the requests are only requesting "NO-CACHE" pages.... So how would this even work?

"Our Edge Network is how we connect with ISPs to get traffic to and from users"

What does Google have to do with any user connecting to any website? Do they have the power to re-route traffic and send "cached content" or anything else they seem fit?

iamlost

8:23 pm on Jul 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It’s how Google gets YouTube et al content to users faster as over half (whatever is popular) Requested content is cached closer (users ISP) than Google’s own data centres.

Steven29

9:40 pm on Jul 30, 2020 (gmt 0)



"It’s how Google gets YouTube et al content to users faster as over half (whatever is popular) Requested content is cached closer (users ISP) than Google’s own data centres."

I don't use YouTube at all and I specifically have my pages to No-Cache and have different versions of the website depending on the type of browser.

Does what your saying mean that users are allowed to "bypass" my Firewall, because they are now being served content directly from Google and the ISP and therefore do not need to request from my server?

How do I opt-out?

iamlost

10:27 pm on Jul 30, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There is no ‘opt out‘. It is simply traffic. As the referring IPs are diverse, hundreds if not thousands of ISPs - remember that this is Google hardware racks located at ISPs to speed their users requests - all rDNS resolve to cache.google regardless of actual location. The only ‘tell tale’ is that cache.google resolution so most of not all other referral identification methods simply aren’t in play. Of course, unfortunately, rDNS resolving mass traffic is impractical for various reasons for many/most sites.
Note: I haven’t investigated this for over a decade so there may be some other ID method, if so great; I’m not aware of one however.

Steven29

1:07 am on Jul 31, 2020 (gmt 0)



I fail to see how the caching of anything on my website would help benefit the end user to load faster. Maybe if they are showing an image through their domain, as a snippet or something? Why can't the Google Bot serve the file to these "nodes"?

Just think about it? Does Google also cache my SSL certificate and key/response or are they making SSL certificates for every website on the internet so they can serve the cached file? Just downloading the SSL Certificate for 1 file and then downloading the real SSL Certificate would be slower than requesting the original file.

I can see how caching of their own products would have an increase of speed, but there should be no reason they are requesting every single webpage on a website.

I saw these requests before.. but here and there. I have now blocked the entire cloud network ranges and now see these requests for every piece of content.

Maybe something like this has been compromised? Check this out: [medium.com...]

If so, there is a huge venerability with the google services being intercepted, as the SSL would most likely be compromised as well if installed on these nodes (why wouldn't it be?).

Any ISP with this "service" would be able to decipher any traffic to Google services. They are trusting "thousands of isp's" globally with their key and response? What a bad idea.

Why aren't we being told about this New Awesome Technology anywhere? Is it like the "Li-Fi" not to be confused with "Wi-Fi" networks?

"The term was first introduced by Harald Haas during a 2011 TEDGlobal talk in Edinburgh.[1]. In technical terms, Li-Fi is a light communication system that is capable of transmitting data at high speeds over the visible light, ultraviolet, and infrared spectrums. In its present state, only LED lamps can be used for the transmission of visible light.[2]"

JamesSC

2:18 am on Jul 31, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



My understanding from iamlost is that the only thing supposed to be flowing through cache.google.com is copies of Google content stashed geographically closer to the participating ISPs users: a Google-only CDN to distribute only Google content to participating ISP users.

To date, I have not detected any overtly malicious activity from these GGC-based requests - with two caveats.

1. They began sometime after I advertised and promoted my site in what can only be described as a cyber version of the original Star Wars bar: script kiddies began hatching from the woodwork, which in turn provided an excellent ongoing survey of bad guy farms. So far, all controllable at will.

2. What tipped me off at all to these GGC servers was my receiving, as external requests, queries whose formulations I have only previously seen internally in WordPress. I checked my header logs, and even the search engines which spider such content only use the external URL versions. Differently put, it seemed the calls were coming from within the house.

Now at this point, all these "suspected proxy servers" may make sense: they may appear to artificial programs assessing them no different from actually dedicated proxies. Except that in my case they also appeared to be actually being uniformly used as proxies.

So the phenomena I've been experiencing behaves like proxies and formulates requests like someone with back end access to my site.

Do I block GGC indiscriminately? The jury is still out for me. So far, no discernible damage, but the creepiness factor remains very high.

So what is it in my power to do, if I wish to?

My little SetEnvIf code fragment hasn't definitively done anything yet, and, since I'm on an Apache server, it's almost a certainty HostnameLookups has been turned off anyway, stopping rDNS lookups at the server level.

I do employ a true WAF firewall as the belt to my .htaccess suspenders, and it has the power to block certain critical things by ASN, so that, should I desire to, I can

1. Obtain any IP revealing GGC as hostname.

2. Block the entire ISP by AS number through the WAF.

Given the destroy the village to save it nature of much of that, I'd really like something more elegant, but, on the other hand, if these GGC nodes really can be used as proxies at will by either insiders or outsiders with the know-how, what we are really talking about here are not compromised zombie computers, but rather compromised zombie ISPs.

Corrections and better solutions are welcome.

Steven29

7:29 pm on Jul 31, 2020 (gmt 0)



In my experience, these requests are not as simple as blocking them as they are tied in with millions of other ip addresses. If you focus on blocking 1 door, the requests will come through another door.

It has taken me almost 2 years to stop these what I call these attacks.

If you are being flooded by cache.google.com you are on the right track of blocking the bots! I know they are directly tied to a competitor that magically posts whatever I do, no matter the time of the day, within 20 minutes in most cases.

I can see the request come through and then they post it immediately after! I now have made it much harder for them, as i'm blocking sometimes 1,000 requests from all new ips before something will get through. It has made the rate in which the copying is happening much slower and I am finally able to get some pieces of content up without them finding it!

For these requests, I made a simple database that records every ip that visits the website and uses a cron job to check the rDNS. If the IP is cache.google.com I will store it forever and re-run checks on them every 2 weeks to make sure they are still resolving that way. When first setting up, you will want a way to add bulk ips into the list for the cron job to check.

I wish I could provide more information, as i've tried to get it many times but even asking on the Google Forums will get your post terminated for "Violating their Rules"?

tomorrow

11:55 pm on Aug 2, 2020 (gmt 0)

5+ Year Member



hello there. are you using chrome? i noticed a while back that under certain circumstances, some URLs entered into the chrome omnibox would be met shortly afterwards by a visit from a cache.google.com IP address posing as a recent (but not necessarily the most recent) version of chrome. i noticed this because purely internal URLs were showing up in apache logs within seconds or minutes of hitting them. the calls were coming from inside the house, as you put it. i could not replicate this on other computers, in other browsers on the same computer, or in any other scenario. just on one computer and only in chrome. i set up several honeypot URLs across several domains and was able to replicate this over and over again on command. until i could not because it just stopped suddenly.

the best guess i have put together is that google is scanning any URL it finds externally from the browser. this could be a part of its safebrowsing program or some other threat scanning mechanism? this might explain why there is no documentation about it anywhere. i imagine that google doesn’t want to make public knowledge all of the methods it uses to discover new risks. by employing server side URL scanning separately from indexing/crawling, they could arguably make a case for ignoring robots.txt directives and/or privacy concerns about indexing information that isn’t meant to be crawled

i will note that all things put into the chrome omnibox - including perfectly good URLs - are sent to google as a search query. if i’m correct in my guess, google scans all URLs (or at least some portion of them) that it comes across separately from its typical crawling/indexing and uses this for some other purpose (my guess is threat assessment)

this appears to be new behavior and it doesn’t seem to be entirely consistent but there seem to be a few anecdotes around the internet regarding calls coming from inside the house attached to IPs that rDNS to cache.google.com

JamesSC

4:36 pm on Aug 3, 2020 (gmt 0)

5+ Year Member Top Contributors Of The Month



hello there. are you using chrome?


Extemely interesting.

No, I don't, but the UA both Steven29 and I have observed as the agent in these visits has been Chrome for Windows on a desktop:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36

I say extremely interesting for several reasons: A external actor somehow commandeering a private CDN as large as GGC as his own personal proxy pool to used as desired requires some explaining. The CDN doing so itself for its own reasons requires much less.

I my case, I have yet to observe real damage of the sort of scraping Steven29 reports, but the requests were abnormal enough to make me take notice.

One obvious question on the table remains: what is the real difference between Bad Hacker being able to use GGC as a proxy and Google itself using GGC in the way you described, clearly beyond its supposed brief of simply delivering cat videos to remote grannies more rapidly? Each is using the same network to take liberties with my site I would prefer neither take.

Steven29

5:29 pm on Aug 8, 2020 (gmt 0)



"I have yet to observe real damage of the sort of scraping Steven29 reports, but the requests were abnormal enough to make me take notice.".

Here is how it affects me, but the scenario would be different for each niche.

Let's say for example, somebody was doing this to this forum for new interesting news...

Every time a new topic would be posted, the post would be replicated on 3 or 4 Tier #1 websites that each have 500,000 - 1.5 million Facebook fans starting within 20 minutes and posted on 1 after another until they take the #1 spot.

If that doesn't work after posting on the 4 websites, in comes Tier #2 websites of spun content (sometimes they rank, so Google doesn't understand grammer?) in hopes to grab the position.

If all else fails, then comes a spam network of hundreds of those hacked links (see screen shot above) that will use your exact text and flood the search until your page is removed and only shows if you click "If you like, you can repeat the search with the omitted results included."

That is how this network is affecting me and can all be accomplished within 24 - 48 hours maximum.

I have many ideas I blog and post about, I even wait days or post at odd hours - but the exact same thing happens.

Since blocking all of the cloud networks, VPN Networks and these requests, I can see a huge improvement with this happening.

The first way they try to "get the ranking", is if my post was done at 7:31 AM.. they will change their WordPress post time to 7:01 AM which makes Google think i'm copying them.

There are LOTS of shady and straight violations to both Google and Facebook, but no action has ever been taken.

"Like all of our posts and leave comments to be entered into a drawing. The most posts you like and comment on, the better your chances are. Good luck!"

^------ the above scenario has been posted 2 times daily for the past 2 years and winners are not even contacted, but rather posted on a page you have 8 hours to check before it "resets".

There are LOTS more things that make you go WTF?

P.S. The 4 "Tier #1" websites are actually Tier #2 (if their plan would work), as there is a domain name bought for $500,000 that is waiting to be the Tier #1.

This happens for every single piece of content they post, weither it be from me or another competitor and their domains cover multiple niches and they are doing the same thing to those other niches. Reporting them will result in violations to YOUR account.

If this type of activity allowed (after lots of trying, I finally got a response that said "We are not the police"), say so and I will blow them out of the water.

BotNumber1

10:45 am on Feb 12, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



I was wondering about this same thing myself and through a Google search found this thread. I found this website about this that's interesting:
https://www.thyngster.com/google-using-isps-to-cache-google-analytics-endpoint


Now in my case I suspect shenanigans from IPs who's PTRs are cache.google.com and here's why:

I run a forum much like this one and have recently had a user create more than one account which I don't allow. Now due to my layers of security they were forced to use their real IP address from a legit popular ISP to sign up. Now when I looked in my CloudFlare firewall logs I see two IPs that were blocked that were withen minutes if not seconds of each other and I'm pretty sure they are related somehow. For one, the one IP that was blocked made a query to activate the user account which you can see in the image below. Like this forum you have to activate your account via email. Well, looking at the query in that blocked IP and comparing that to the sent email from my forum for this user they match. So I know it's the same user, just two different IPs. One from the legit IP that was used to sign up and another from an IP that was blocked with a PTR/host of cache.google.com. Now Googling these two IPs has lead to some hits. In one website hit it was in a log of some sort from a guy trying to get Lets Encrypt to work.

I don't really understand this cache.google.com thing though. What's really interesting is the IPs used for registration on my forum. A legit IP and another IP for clicking the activation link in email.

I can see legit ISPs caching Google stuff, but to see a legit IP sign up on my forum and then use another IP with a cache.google.com host to activate the account via email is weird. And there are a total of three IPs here. The legit IP, and two I'll show images below.

Shodan and Censys have nothing and neither does Greynoise or AbuseIPDB. If you haven't heard of those, well there you go. Here's another.
https://hidemy.name/en/proxy-checker/


Images:

IP shown is different than the registration IP. Both are different providers, but this one was a cache.google.com host. I must have accidentally removed that bit from the pic. [imgur.com...]

This was the second or first hit within minutes or seconds apart. [imgur.com...]

The common denominator is; Chrome, Win 10 and an old HTTP version.

Any guesses as to what this cache.google.com thing is? Is that even a factor here or am I looking for something else?