First week of logs from new https site - Webmaster General forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

First week of logs from new https site

Looking at Google and Bing hits

SumGuy

5:02 am on Nov 16, 2018 (gmt 0)

Top Contributors Of The Month

To summarize: My HTTP website has been running on IIS4 / NT4 since 1998. A month or so ago I began looking into HTTPS but couldn't get a workable cert for IIS4, so I have installed Abyss web server and replicated all my site's original files over to the Abyss server. Other than making sure my site answers on both http (IIS4) and https (Abyss) I have no re-direct (301 - yes?) from http to https.

Looking at the Abyss logs from Nov 8 to 14, Google has hit the https site and requested only these files:

/.well-known/assetlinks.json (got 404)
/ads.txt (got 404)
/index.html
/robots.txt

It has requested the above multiple times during those 6 days (total = 31) from 66.249.64.x and 66.249.66.x. So it has not been crawling the site at all. Bing has been more active (more hits) but it's only asking for /index.html and (rarely) /robots.txt. Bing hits from 157.55.39.x and 207.46.13.x. - about a couple hundred requests in total. So Bing isin't crawling the https site either.

I'm wondering why google and bing, after getting the index.hml page, aren't using it to crawl the site yet. They've been doing that to the http site for many years.

Yesterday it appears I did get an actual hit to the https site (from an IP in Israel) with these parameters:

IP: 83.130.239.X
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
Referer:

htt ps://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=(lots of alpha-numberic-characters)&url=https%3A%2F%2F(my-domain)%2F&usg=(more alpha-numeric)

Accept:text/html, application/xhtml+xml, image/jxr, */*
Accept-Language:en-US,en;q=0.7,he;q=0.3
Accept-Encodinggzip, deflate

This is my first experience seeing extended logging info like the accept headers. Am I missing anything?

Regarding the above referer URL - I'm a bit surprised or puzzled by the extended string. I don't see that in the logs for the http site. Is there any use I can make out of this extended URL?

phranque

5:29 am on Nov 16, 2018 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

are you you using GSC and BWT to track crawling and indexing of the http and https sites?

I have no re-direct (301 - yes?) from http to https.

why not?
you should 301 redirect http requests to the same path on the https hostname.

I'm wondering why google and bing, after getting the index.hml page, aren't using it to crawl the site yet.

are you internally linking to the https urls?

lucy24

6:12 am on Nov 16, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm wondering why google and bing, after getting the index.html page, aren't using it to crawl the site yet.

To put everyone's mind at rest, please say explicitly that B and G have not just been requesting index.html, but have been receiving it (200 response).

Or rather--ahem, cough-cough--what I'd really prefer to hear is that they have been requesting example.com/index.html and receiving a 301, followed immediately by a request for example.com/ and-that's-all. (I don't remember about bing, but last time I looked closely, Google follows redirects almost immediately--within a few minutes--unless they happen to have crawled the redirect target within the last hour or so.)

Meanwhile, what's going on with http requests? Are they separately logged? The search engines should be requesting everything on their ordinary shopping lists, and all of those requests should be receiving 301 responses to https. (301 rather than the default 302 because, well, you're not going back.) So even if for some reason they hadn't stumbled across the https site, they'd be made aware of it by new redirects.

While you're at it, please double-check robots.txt and make sure it doesn't accidentally have any Disallows that you put in there temporarily while you were making the transition.

You should expect a full top-to-bottom Google crawl within no more than 24 hours after they discover the existence of https://example.com/ And probably sooner still, if you've got a big site. (Again, I'm reporting my personal experience.) So, yeah, something is Not Right.

justpassing

8:52 am on Nov 16, 2018 (gmt 0)

Top Contributors Of The Month

Do you have canonical meta tag, or server header tag ?

If your HTTPS index.html has a canonical meta tag to your non-HTTPS index.html this is telling search engine to prefer the non-HTTP site.

In your log, was the size of the HTTPS index.html requested was of the right size? (meaning the page was served as it should)

For your internal navigation are you using relative or absolute URLs ? (if all your navigation links from index.html points to the non-HTTPS page, crawler will never browse the HTTPS version.

SumGuy

2:51 pm on Nov 16, 2018 (gmt 0)

Top Contributors Of The Month

> please say explicitly that B and G have not just been requesting
> index.html, but have been receiving it (200 response).

> In your log, was the size of the HTTPS index.html requested was of the
> right size? (meaning the page was served as it should)

Yes, B&G have been requesting HTTPS and receiving HTTPS index.html with 200 response. Bytes transfered indicates the entire file was received.

> I'd really prefer to hear is that they have been requesting
> example.com/index.html and receiving a 301, followed immediately
> by a request for example.com/ and-that's-all.

Abyss is serving files to G&B (except for ads.txt and well-known-assets thing because I don't have those files). In order for Abyss to serve anyone any files, they have to be requesting HTTPS: \\mydomain.com\what-ever directly because I have not yet implimented a re-direct from the http site to the https site.

In other words, G&B "discovered" my HTTPS server by themselves, by testing for it. They must do this at least once a day, and do it continuously because of how quickly they hit the Abyss server once I had it up and running. They must always be in a HTTPS search/discovery mode for any/all HTTP domains/sites in their archive - they are not depending on http -> https redirect. Now if G&B treat a non-redirected HTTPS site as a different site than the http version of the same site in terms of search / content, I don't know.

> Meanwhile, what's going on with http requests? Are they separately logged?
> The search engines should be requesting everything on their ordinary
> shopping lists, and all of those requests should be receiving 301 responses
> to https.

IIS4 is still serving and logging HTTP requests like it has been for the past 20 years with no regard or awareness that the Abyss HTTPS server is now operating on port 443. G&B is still crawling it like they have for years, no change in frequency or the variety of files they are requesting. I haven't yet modified the HTTP site-files (index.html) to re-direct to HTTPS - I'm not opposed to that, I do intend to do that, I'm just not in any hurry.

> While you're at it, please double-check robots.txt

I've edited (cut down) the robots file (made it less restrictive) on the https site. It never had any mention of G/B anyways.

> You should expect a full top-to-bottom Google crawl within no more than
> 24 hours after they discover the existence of https://example.com/ And
> probably sooner still, if you've got a big site. (Again, I'm reporting
> my personal experience.) So, yeah, something is Not Right.

As I've said, G/B discovered the HTTPS functionality within 24 hours of the Abyss server coming on-line, even though I have no http -> https redirect.

> Do you have canonical meta tag, or server header tag ?

I have <meta NAME="KEYWORDS" and <meta NAME="DESCRIPTION" and <title> lines. But I think that's all. Those haven't changed for 10+ years. The site was designed back in 2000 and the word "canonical" does not appear in the html code anywhere. I've just looked up what a "canonical tag" is and I guess I should incorporate that into our site-code.

> If your HTTPS index.html has a canonical meta tag to your non-HTTPS
> index.html this is telling search engine to prefer the non-HTTP site.

Well ok, but I don't have the canonical tag.

By the way, the hit from Israel (with google referer) I mentioned in the first post was a full HTTPS hit. Their browser grabbed all files (gifs, etc) from the Abyss server. The IIS4 HTTP server didn't serve any files to them.

> For your internal navigation are you using relative or absolute URLs ?

I've just looked at the code for the index.html page - all links are relative. The string "http" does not appear anywhere in the index.html file.

justpassing

3:12 pm on Nov 16, 2018 (gmt 0)

Top Contributors Of The Month

Then, I am not sure what is going on.

Now, there is a possibility that if G/B find the same site, on both HTTPS and HTTP, but without redirection from HTTP to HTTPS that they continue to consider the HTTP one to be the official site.

SumGuy

3:30 pm on Nov 16, 2018 (gmt 0)

Top Contributors Of The Month

> Now, there is a possibility that if G/B find the same site, on both
> HTTPS and HTTP, but without redirection from HTTP to HTTPS
> that they continue to consider the HTTP one to be the official site.

Except that on Nov 14, someone in Isreal did a google search for something, and google gave them a result URL that pointed to us (index.html) using HTTPS. Meanwhile google is continuing to give HTTP results (index.html) pointing to us as usual.

One change I will make to the html files today on the HTTPS site is to hard-code all links (change them from relative) to HTTPS my-domain. See of that kicks google into crawling them.

lucy24

5:40 pm on Nov 16, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have not yet implemented a re-direct from the http site to the https site

Why in the world not, for ### sake? The redirect should have been set up the moment you verified that the https is working properly.

justpassing

5:44 pm on Nov 16, 2018 (gmt 0)

Top Contributors Of The Month

One change I will make to the html files today on the HTTPS site is to hard-code all links (change them from relative) to HTTPS my-domain. See of that kicks google into crawling them.

Not a good idea in my opinion. There is just one way to do things right when switching, this is redirecting all HTTP requests to HTTPS.

phranque

1:09 am on Nov 17, 2018 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

what I'd really prefer to hear is that they have been requesting example.com/index.html and receiving a 301, followed immediately by a request for example.com/

^^^this^^^

tangor

1:51 am on Nov 17, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sounds like you are confusing the se's by attempting to have cake and eat it too...

redirect the http server to the https server and get on with life.

There IS a difference between http and https and running both simply will not work properly.