Welcome to WebmasterWorld Guest from 54.167.29.208

Forum Moderators: goodroi

robots.txt and HTTPS

     
10:59 pm on Aug 20, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


To redirect or not to redirect?

When I changed my personal site to HTTPS about three months ago, I put in a universal redirect: all requests, no exceptions, were redirected from HTTP to HTTPS. Since then, I've noticed one law-abiding search engine regularly asking for material in a disallowed directory--but only on the HTTP side, never HTTPS.

This makes me wonder if I should poke a hole for robots.txt, leaving it as the one file accessible both ways. What do other people think?
1:24 am on Aug 21, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10631
votes: 630


Some bots cache for up to a month, so that may explain the request. Personally, I wouldn't change my step to accommodate an agent that is so cheap they can't update at least every 24 hours like most bots.

Just curious though, if they are "regularly asking for material in a disallowed directory" why would you poke a hole to allow it, either HTTP or HTTPS? Doesn't disallow mean disallow no matter the protocol?
2:53 am on Aug 21, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


My hypothesis is that they're thinking of http and https as two separate sites, just like with-and-without www. Since they have no current robots.txt on file for the http site, they're behaving as though it doesn't exist. They never ask for disallowed material on the https side, where they get robots.txt on a regular basis.

While typing the above, I was running a multi-file search to verify my “on a regular basis” assertion. They used to ask pretty consistently once a day, but sine January 2016 they’ve gone berserk, getting robots.txt many times a day, a behavior that continues up to the present. (Hm. Maybe they bought up the bingbot's contract?)

Now here's a weird detail: before the https move, they never once got a 301 response on robots.txt, meaning that they have never once asked for the wrong-www form. Maybe they really are confusing protocol with hostname.
4:44 am on Aug 21, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10631
votes: 630


Well if it's Google, Bing or Yandex you can manually submit your current robots.txt via their Webmaster tools utility.

But really, I wouldn't worry about it. You have a 301 redirect that works correctly and it seems they have a bug.

Not much else to do.
5:12 am on Aug 21, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3556
votes: 196


Is the disallowed directory listed in the robots.txt they fetch via https and they request it anyway but via http? That doesn't make sense because the "Disallow: /directory" doesn't use any protocol. It would be the same as if you were to disallow a given page using "Disallow: /page.html" and they request it via http - it is only requests to that disallowed directory that come in requesting with http protocol?

Does their request for /disallowed-directory via http return a 301? It makes me wonder whether their request is due to finding a link on some site to that directory via http. I don't see the harm in serving robots.txt via either/or, but it is peculiar.
3:49 pm on Aug 21, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


Does their request for /disallowed-directory via http return a 301?

Yes, absolutely everything does, without exception. That's why I wondered if it might be a good idea to serve robots.txt--ONLY--as-is, rather than redirecting to https.

The robot in question is Seznam. While poring over their info pages I learned, incidentally, that they understand "Allow:". Some sources have claimed that nobody but Google recognizes this usage, but evidently this is in error.

I've written to inquire, but thanks to time zones I don't expect them even to see the question until tomorrow.

:: irritably wondering if skies will remain firmly overcast until precisely 11:30, when it will no longer matter ::
6:49 pm on Sept 3, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


:: bump ::

I discovered while looking for something else that Google (yes, keyplyr, the real Googlebot from the real crawl range) very briefly did the same thing--or rather, the exact opposite of the same thing. As part of its comprehensive crawl of the https site it requested several URLs in roboted-out directories.

Next time I move a site to https I'm going to exempt robots.txt from the redirect and see if anything is different.
11:50 am on Oct 19, 2017 (gmt 0)

New User

joined:Feb 2, 2017
posts:20
votes: 1


You should definetely do both redirect AND change robots.txt.