Forum Moderators: open

Message Too Old, No Replies

Bingbot, Robots.txt and Port 80

or When Will Bingbot Grow Up?

         

dstiles

3:20 pm on Apr 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ever since moving sites to HTTPS (port 443) Bingbot has spent part of its energy hitting robots.txt on Port 80, which begets 403 responses. Meanwhile, Bingbot ignores my requirement in robots.txt to not visit CSS files and suchlike, especially for preview scans. It occurs to me the two may be linked.

How can I open port 80 to JUST bingbot - or is it even worth it? Bing is not a major SE in the UK so it may be as well to ignore this idiocy and save my time. On the other hand I have a couple of sites of international use.

JorgeV

3:35 pm on Apr 25, 2021 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello,

Your port 80 is opened, otherwise your server will not issue a 403 code.

In all events, you need to have your port 80 open, to handle and redirect non https requests.

As for the problem of Bingbot not respecting your robots.txt file, you can always post the content of this file here, may be there is a mistake in the syntax.

lammert

3:43 pm on Apr 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Returning 403 on port 80 may not be the best approach. Bots (and more important, humans) following old existing links to the http content may get the error and never return. A universal 301 redirect from port 80 to the equivalent URL on port 443 should do the trick.

not2easy

3:46 pm on Apr 25, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You can add a line to your https rewrite:
RewriteCond %{SERVER_PORT} 80
This is supposed to be good for old incoming links that were never updated to https also.

lucy24

5:08 pm on Apr 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have always exempted robots.txt requests from all canonicalization: first www and later https. Some robots seem to get confused when a request for robots.txt is redirected, and you don't want to give them any excuse (“I TRIED to read robots.txt, honest I did, but I just got a redirect :: whine ::”)

But why on earth would requests to port 80 receive a 403? They should get a redirect to the https version of the page. (Matter of fact, I do block one narrow category of http requests, involving a few specific pages and one specific UA, because it marks the behavior of one botnet. But certainly not everyone, everywhere.) In years past, the redirect might have been limited to requests that include the Upgrade-Insecure-Requests header, but these days all human browsers can do https, and robots simply need to get with the program.

phranque

10:25 pm on Apr 25, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would reinforce what others have said here - the 301 redirect from recognizable port 80 requests to a known https: url would be the most technically robust solution and the best for user experience.

among other problems with this solution, the 403 response effectively wastes any value if inbound links to legacy http: urls.

dstiles

1:49 pm on Apr 26, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry, folks, brain fade. I should have added that Port 80 is open for the web sites and that 301 etc is in place and working for all relevant requests from almost every source except bing on robots.txt. 403 is only issued to baddies (mostly non-bot) and Bing for robots.txt ONLY. According to the other_hosts_access log, bing is offered a redirect for robots.txt AND for pages such as index.php. It follows thte latter but not the former.

Googlebot sometimes hits Port 80 for robots.txt but is never issued a 403 so presumably it follows the redirect; in fact there is evidence it does.

SumGuy

12:07 am on May 8, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



If search engines ask for robots.txt on port 80, I give it to them. If they ask on port 443, they they'll get it there as well. Same for ads.txt (which only google asks for). I don't see the problem serving it up to a search bot on port 80 if that's how it wants to get it. Edit: All other site file requests on port 80 get 301 redirect to https, regardless who is asking (bot or human).

dstiles

9:09 am on May 8, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is not a case of "wanting". As far as I can tell I have allowed all bots to access the server on 80 and 443 but for some reason bingbot (and only bingbot) asks for robots.txt without providing ANY protocol (eg http/1.2) and for some reason my server will not allow that - or something else that's missing. Whatever, it ends up going to errdoc...
(my tests...)
IP: 40.77.167.46 20210507 19:01:23 403 443 TLSv1.2
Host: www.example.co.uk Page: /errdoc.php URL: /robots.txt
(bot headers...)
Cache-Control: no-cache
Connection: Keep-Alive
Pragma: no-cache
Accept: */*
Accept-Encoding: gzip, deflate
From: bingbot(at)microsoft.com
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Non-robots header...
(my tests...)
IP: 207.46.13.64 20210508 0:04:01 200 443 TLSv1.2
Host: www.example.co.uk Page: /eye-07.php
bot: bing
ips: bing:207.46.13.64
http: ok:HTTP/1.1
browser: Safari:Safari/953
(bot headers...)
Cache-Control: no-cache
Connection: Keep-Alive
Pragma: no-cache
Accept: */*
Accept-Encoding: gzip, deflate
From: bingbot(at)microsoft.com
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

I've even made an exception for missing protocol for robots.txt and bingbot...
<if "! (%{REQUEST_URI} =~ m#robots\.txt#i) || ! (%{HTTP_USER_AGENT} =~ m#bingbot#i) ">
SetEnvIf Request_Protocol HTTP/(0\.9|1\.0) proto=too_low:$0 http=bad:$0
</if>

lucy24

4:01 pm on May 8, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



bingbot (and only bingbot) asks for robots.txt without providing ANY protocol (eg http/1.2)
What does this look like in access logs?

Looking it up, I discovered that
(1) at some time when I wasn't paying attention, bingbot started using HTTP/2.0 sporadically (not consistently), and
(2) robots.txt requests almost always come in pairs-or-more. (Casual eyeballing turned up an instance of twelve consecutive requests.) I would have assumed this means with-and-without www--as noted elsewhere, I don't canonicalize--but headers say no.

dstiles

9:00 am on May 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For one site, two consecutive hits this morning:

40.77.167.51 - - [09/May/2021:09:39:29 +0100] "GET /robots.txt HTTP/1.1" 403 3500 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
40.77.167.51 - - [09/May/2021:09:39:30 +0100] "GET /robots.txt HTTP/1.1" 200 5124 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

No sign of it in other_vhosts nor error.log.
bingbot started using HTTP/2.0

And that could be it!

I assumed when setting up a brand new apache server a few weeks ago that it would handle 2.0 out-of-the-box. I really should do more checking, but I was eager to get the server online and had other things to contend with. Ok, no excuse.

I spent an hour or more yesterday trying to set up http/2 on apache. Not as easy as one would expect. I still cannot get it working: an error switching to mpm_event (or mpm_worker) from mpm_prefork - multi-threading required and have to recompile the module with some switch or other. WHY? Shouldn't apache provide ready-compiled modules for this eventuality? Or even (heaven forbid!) provide an http/2 system ready-working?

lucy24

3:30 pm on May 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



two consecutive hits this morning
As I was saying ... ;)

But “HTTP/1.1” is right there. What did you mean when you said they’re not providing a protocol?

Apache's HTTP/2.0 information is here [httpd.apache.org], associated with the helpfully named mod_http2 [httpd.apache.org]. Key quote:
You must enable HTTP/2 via Protocols in order to use the functionality described in this document.
where “Protocols” is a Core directive.

dstiles

4:24 pm on May 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> But “HTTP/1.1” is right there

Yes. Sorry. Two consecutive hits but the baddy does not appear in site logs nor any other (except those I create myself - see above for the 8th).

I'm working through the installation setup for http/2 now. It's a pain becuase it's a live site and I do not want to kill it. I have a spare server I'm bringing online for mail; I'll practice on that. What's so hard about them adding a switch? :(

JorgeV

5:37 pm on May 10, 2021 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Hello,

Apache has been extremely helpful in the development of the Internet, but it's now facing its too old architecture/conception, you should consider switching to more modern web servers, like Nginx, LiteSpeed, Caddy, etc...

dstiles

9:13 am on May 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did consider nginx recently but it's a learning curve I have no time for. :(

lammert

9:32 am on May 11, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Apache is not dead, and still in second position world-wide on public facing websites [news.netcraft.com]. With Nginx often just used as a proxy-server/load-balancer for heavy sites offloading the traffic to more versatile back-end servers, the percentage of active Apache servers may well be higher. I feel Apache is still a safe bet to serve sites in 2021.

The standard of HTTP/2.0 allows both un-encrypted connections to port 80 and encrypted connections to port 443. But all major browser manufacturers have stated that they will only support HTTP/2.0 on connections with TLS encryption. It is one of their many subtle ways to force all webmasters to move to https.

If you are switching to HTTP/2.0, serving all files, including robots.txt, over port 443 is therefore in practice mandatory.

dstiles

10:32 am on May 12, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm going to switch to http/2 but it will have to wait a few weeks. I'm still in the process of moving sites from Windows and setting up a new mail server - upon which I will try out http/2.

Active apache servers - depends on the meaning of "active" but I run two off-line apache servers on my local network for very minor applications. Plus, on the online mail server, a copy to aid letsencrypt certs renewal; it used also to run squirrelmail until I decided it was no longer useful and becoming dangerously out of date.