Google-AMPHTML bot

Forum Moderators: open

Message Too Old, No Replies

Google-AMPHTML bot

Undocumented Google crawler?

Giacomo

8:50 am on Jul 13, 2021 (gmt 0)

Hi all, long time no write here. :)

I've been seeing dozens of daily requests by a "Google-AMPHTML" User-agent that I never noticed before, coming in from various verified Google IP addresses, such as:

66.102.8.219 - - [12/Jul/2021:09:15:55 +0200] "GET [snip] HTTP/1.1" 200 6512 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.90 Mobile Safari/537.36 (compatible; Google-AMPHTML)"

I looked up the U-A in Google's documentation [developers.google.com] and it is not mentioned. A Google search brought me here: [user-agents.net...]

Looks like the bot is following incoming links from Twitter. I see that it doesn't support JavaScript nor cookies, and ignores robots.txt as it's been requesting URLs under a Disallow'ed path on this website (yep, I'm 110% sure about that) -- not even an AMP site, BTW.

Anyone seen this guy before and/or have any idea what it is?

Thanks!

lucy24

6:40 pm on Jul 13, 2021 (gmt 0)

66.102 is Google but it isn't a crawl range; it's one of those other Googloid areas, meaning it could be anything, legit or not. AMPHTML does sound like it means AMP + HTML, doesn’t it.

incoming links from Twitter. I see that it doesn't support JavaScript nor cookies, and ignores robots.txt

Well, that's rude of it, especially when you consider that the Twitterbot does seem to honor robots.txt. Did you mean that it asks and doesn't honor, or doesn't ask in the first place? And, tangentially, have you met robots that do use cookies? I haven’t. (I have one set of pages that redirect smartphones if there isn't a cookie saying “Yes, yes, I’ve been here before and know what I’m doing”; the mobile Googlebot always gets redirected.)

Giacomo

9:49 pm on Jul 13, 2021 (gmt 0)

AMPHTML does sound like it means AMP + HTML, doesn’t it.

No doubt it’s an AMP pages bot/proxy. The odd thing, as I said, is that the website getting crawled is not using AMP HTML. Maybe those tweets linking to it have been reposted to some AMP pages somewhere, go figure.

Did you mean that it asks and doesn't honor, or doesn't ask in the first place?

I haven’t seen a single request of /robots.txt by Google-AMPHTML in my logs.

And, tangentially, have you met robots that do use cookies?

Yeah. Not regular search engine crawlers, but bots nonetheless.

Giacomo

8:17 am on Jul 15, 2021 (gmt 0)

66.102 is Google but it isn't a crawl range; it's one of those other Googloid areas, meaning it could be anything, legit or not.

Could you please elaborate on that conclusion? What do you mean by "not legit"? Thanks.

lucy24

4:55 pm on Jul 15, 2021 (gmt 0)

In this case, all I mean is that there's no comprehensive list of everything that uses the range 66.102.0.0/18, or what they use it for. What, for example, does Google Preview do? (It used to be an adjunct of searching, but I don't think that function has existed for years.) Is Google Translate used strictly for law-abiding human purposes? (Some sites block it because it can be a way for scrapers to sneak in.)

:: detour to recent logs, looking only at this specific range (there are others that behave similarly) ::

Google Favicon can be considered legitimate. But I'd forgotten about Google Docs. In fact, if I ever knew what it does or is supposed to do, I've forgotten that too. Chrome-Lighthouse is another one that I Have Doubts about; perversely, the fact that I see it once requesting robots.txt makes me more suspicious rather than less. I've never been sure of GoogleImageProxy, especially since its UA continues to say Firefox 11 (!).

Total for 66.102.0.0/18 (i.e. this range by itself takes up four times as much IPv4 real estate as the entire crawl range) in calendar year 2021:

One (1) Feedfetcher

Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)

which I consider a bad_agent.

Two (2) apparent humanoids: one an image from Google Search; one a page (only) with the dreaded ",gzip(gfe)" tacked on to the UA. (Possibly a repeat visitor via translate, judging by /piwik/ request coming in at the same time from a non-Google IP.)

A handful (less than 2% of total) weblight:

Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19

I block these under the designation botnet_agent

about 3% of total: Chrome-Lighthouse

Mozilla/5.0 (Linux; Android 7.0; Moto G (4)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4143.7 Mobile Safari/537.36 Chrome-Lighthouse

I used to block this UA, but currently don't. Worth noting that one of the requests was for /asset-manifest.json -- a file I do not have, never did have, and probably isn't intended for humans anyway.

about 4% of total: ImageProxy

Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)

another 4%: snippet

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google (+https://developers.google.com/+/web/snippet/)

Blocked, probably for deficient headers.

yet another 4%: SearchByImage

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.7; Google-SearchByImage) Gecko/2009021910 Firefox/3.0.7

These are all blocked, probably due to the Firefox/3.

edging up to 5%: GoogleDocs

Mozilla/5.0 (compatible; GoogleDocs; documents; +http://docs.google.com)

about 1/6 of total: Favicon

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon

That includes most but not all the 301s (from HTTP to HTTPS), which is understandable because part of its job involves GSC.

and finally Preview, accounting for more than half of all requests:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/89.0.4389.112 Safari/537.36

You can see what a mishmosh it all is.

not2easy

5:56 pm on Jul 15, 2021 (gmt 0)

Google uses the 66.102. ranges for site verification in GSC. I had it blocked due to mischievous users and had to open it some. I have no idea how frequently they want to look at that but my notes show:
66.102.128.0/20 11/18
66.102.0.0/20 8/20

Giacomo

7:36 pm on Jul 15, 2021 (gmt 0)

In this case, all I mean is that there's no comprehensive list of everything that uses the range 66.102.0.0/18, or what they use it for.

Lucy24, I don’t think Google ever published such lists, did they?

The only official way that I’m aware of to “verify if a web crawler accessing your server really is Googlebot (or another Google user agent)” is this: [developers.google.com...]

When I said “verified Google IP addresses”, I meant just that. Nothing more, nothing less.

So, Google-AMPHTML does appear to be a legitimate (meaning genuine, not spoofed) Google crawler. We just don’t know what it’s used for, being undocumented.