Googlebot will start crawling over HTTP/2 in November 2020

Forum Moderators: open

Message Too Old, No Replies

Googlebot will start crawling over HTTP/2 in November 2020

phranque

11:42 pm on Sep 22, 2020 (gmt 0)

Quick summary: Starting November 2020, Googlebot will start crawling some sites over HTTP/2.

(source: Google Webmaster Blog
Googlebot will soon speak HTTP/2 [webmasters.googleblog.com]
Thursday, September 17, 2020)

also worthy of note:

How to opt out

Our preliminary tests showed no issues or negative impact on indexing, but we understand that, for various reasons, you may want to opt your site out from crawling over HTTP/2. You can do that by instructing the server to respond with a 421 HTTP status code when Googlebot attempts to crawl your site over h2.
...

fyi:
https://tools.ietf.org/html/rfc7540#section-9.1.2
The 421 (Misdirected Request) Status Code

phranque

11:55 pm on Sep 22, 2020 (gmt 0)

interesting that the HTTP/2 spec went with a 421 status code instead of the preexisting 505 from the HTTP/1.1 spec:
https://tools.ietf.org/html/rfc7231#section-6.6.6
(505 HTTP Version Not Supported)

lucy24

1:12 am on Sep 23, 2020 (gmt 0)

Is it possible 505 is intended for servers that are genuinely not able to support /2.0, as opposed to 421 “we could, but we just don’t feel like it”? (Query: But ... why?)

I don't know what response a /2.0 request would get if the server can't support it (Apache version < 2.4.whatever-it-is, for example). Whatever it is, it doesn't make it as far as logs.

iamlost

8:08 pm on Sep 23, 2020 (gmt 0)

Server push: This feature is not yet enabled; it's still in the evaluation phase. It may be beneficial for rendering, but we don't have anything specific to say about it at this point.

is an intriguing mention.

Typically push allows server to push resources to a browser prior to receiving an explicit request on the principles that (1) the server knows required resources of requested page and (2) the server may have a better understanding of current network connection for optimisation of page delivery/load time.

It definitely can help with browser request to render time; where it helps with a bot returning resources to its home server I'm uncertain. Yes, it could certainly decreases wait time and googlebot numbers of waits definitely add up, however that is only true when the resources pushed are the capped at the resources required/wanted. And I can, in my opaque rubber ball, foresee 'SEOs' sending all sorts of crap in all sorts of configurations in attempts to game the system. Google might well spend more time dropping or cleaning received data than is saved by accepting a push. And, currently I see far too many poorly configured HTTP/2 servers especially mangled push attempts; I expect such will only increase as the mass of less tech savvy webdevs join the parade.

lucy24

8:31 pm on Sep 23, 2020 (gmt 0)

Typically push allows server to push resources to a browser prior to receiving an explicit request

That's interesting, because I have occasionally wondered what happens if a server sends out material that hasn't been requested--for example if a malign robot puts a fake IP on its request, so the material is sent to some unsuspecting stranger. Where does it go? What happens to it at the end? (Obviously not especially relevant in the current situation, since non-HTTPS and HTTP/2 are pretty much mutually exclusive.)

blend27

8:51 pm on Sep 23, 2020 (gmt 0)

--- because I have occasionally wondered what happens --

If the sign on the door says no shoes, no shirt = no access then that one can not really get in, if one can not get in then that one can't get out, not even with a residue of honey on their beard.

@iamlost .. :all sorts of crap in all sorts of configurations....

If I could I would of voted trice for that comment.

JorgeV

9:37 pm on Sep 23, 2020 (gmt 0)

Hello,

It was about time. HTTP/2 is an official standard since 2015 !

SumGuy

1:36 pm on Sep 27, 2020 (gmt 0)

> Googlebot will start crawling some sites over HTTP/2.

What does that look like? What shows up in my server logs? What files are requested (and on what ports) according to what-ever this http/2 thing is?

lucy24

3:33 pm on Sep 27, 2020 (gmt 0)

What shows up in my server logs?

Are you on Apache? Look for the GET element:
"GET / HTTP/1.0"
"GET / HTTP/1.1"
"GET / HTTP/2.0"
Ordinarily they'll be using the same 443 port as the HTTPS requests you have (I hope!) been getting for years.

As I understand it, HTTP/2.x requires HTTPS, so you will not see anything new on the HTTP side, assuming your server keeps separate logs. (Mine does, and it's shared hosting so I assume this is standard practice.)

phranque

11:37 pm on Sep 27, 2020 (gmt 0)

What files are requested (and on what ports) according to what-ever this http/2 thing is?

same files and same port.
the number of connections would change.
(i.e., most of the difference is not in the "HTTP layer")

SumGuy

12:14 am on Sep 28, 2020 (gmt 0)

I'm running Abyss web server on a Win-NT4 box. IIS4 is serving HTTP, and Abyss is taking care HTTPS. I have a lot of PDF files that I've been re-directing to HTTPS and Abyss has been up and running serving a copy of the site for about 2+ years now and google has been crawling (and linking to) the HTTPS site for a while now but it's only been during the past month that I've been re-directing html file-gets from http to https.

All that said, I never see the "http/x.y" method show up in the logs. If it's a get, or head, that's all I see. But I know (now) that Abyss (at least what I'm running) does not serve http 2.

lucy24

1:08 am on Sep 28, 2020 (gmt 0)

I never see the "http/x.y" method show up in the logs. If it's a get, or head, that's all I see.

If you think the knowledge would be useful, it is probably possible to customize this element, assuming you control the server. Looking from the other direction, some site administrators like to restrict HTTP/1.0 access, since not many legitimate robots--and no humans--use it.

There's definitely no difference in request headers between 1.x and 2.0; that was the first thing I looked at when I saw my server starting to report HTTP/2.0 requests.

dstiles

2:46 pm on Sep 28, 2020 (gmt 0)

Is there something in Apache that has to be set/changed to allow HTTP/2.x?

lucy24

3:40 pm on Sep 28, 2020 (gmt 0)

For starters, you have to be on 2.4.17 or higher, which you probably are by now. One information source is here [http2.pro], or if you prefer there's the horse’s mouth [httpd.apache.org]. In particular there’s a necessary module, “aptly named mod_http2” (thank you, Apache, for that touch of humanity).

dstiles

12:45 pm on Sep 29, 2020 (gmt 0)

Thanks, Lucy. I'll investigate. :)

engine

11:26 am on Nov 16, 2020 (gmt 0)

Googlebot latest documentation [developers.google.com...]

lucy24

5:50 pm on Nov 16, 2020 (gmt 0)

I’d forgotten this thread, so I checked my logs for HTTP/2 from the Googlebot IP (66.249.blahblah). Interestingly, there was a scattering of HTTP/2 requests throughout October--where “a scattering” = about 1/70 of all Googlebot requests during the month--but none so far in November. They seem to be entirely random: an HTTP/2.0 request for some page might be followed by an HTTP/1.1 request for some other page a few minutes later. And the same page will be requested by HTTP/2 one day, HTTP/1.1 another day.

Edit: Pages only, until October 31 when they requested a couple of stylesheets. And these are interesting, because unlike the page requests, the HTTP/2 stylesheets were intertwined with HTTP/1.1 requests for the same two stylesheets, all with the same page as referer. This, in turn, let me see that the response size is smaller for 2.0 than for 1.1.

engine

2:25 pm on Jan 14, 2021 (gmt 0)

Google is now sending out notifications to site owners that their site is now being crawled over HTTP/2

aristotle

7:47 pm on Jan 22, 2021 (gmt 0)

Maybe this is a dumb question, but how does google send these notifications?

[edited by: not2easy at 10:24 pm (utc) on Jan 22, 2021]
[edit reason] Author request [/edit]

phranque

10:28 pm on Jan 22, 2021 (gmt 0)

how does google send these notifications?

afaik via gmail to GSC account owners.