Forum Moderators: open

Message Too Old, No Replies

Bingbot 403

Hitting robots.txt and getting a 403 rejection

         

dstiles

10:51 am on Nov 3, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Anyone else seeing this? It's been happening for at least a year and is the only "legal" bot to do it.

The log entry (IP is various but msnbot RDNS) is:
"GET /robots.txt HTTP/1.1" 403 3854 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

This is sometimes, but not always, followed immediately by a 200 hit. The headers are identical.

jmccormac

11:03 am on Nov 3, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The wonderfully clueful people in Bing seem to have added a pile of new IPs without reverse DNS and are actively crawling from them. Some of the IPs are from ranges that have been the source of bad activity. Perhaps the first hit was a from a blocked IP range and the second was from an MSN/Bing range with rdns? (The Microsoft UK thread has some of the ranges. [webmasterworld.com...] )

Regards...jmcc

dstiles

1:47 pm on Nov 3, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No, sorry. These are from long-term IPs with correct RDNS. The robots.txt file is explicitly allowed and the bingbot test is:
<if "-R '13.66.139.0/24' || -R '40.77.167.0/24' || -R '157.55.39.0/24' || -R '207.46.13.0/24' ">
SetEnvIfExpr "%{REMOTE_ADDR} =~ /(.+)/" ips=bing:$0
BrowserMatch bingbot bing bot=bing
Require env bing
</if>

jmccormac

2:05 pm on Nov 3, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There were some other Microsoft UK ranges that have been hitting some sites with the Bing UA and no rdns.They are listed on the other thread. That 403 activity is a bit odd if it is not one of them. I think that the some of the 13.66.cc.dd IPs did not have rdns set (would have to check back through logs).

Regards...jmcc

lucy24

4:56 pm on Nov 3, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it in your power to figure out which mod issued the 403, and on what grounds? (Sometimes it's impossible to tell, even when it's your own server. But logging environmental variables helps.)

Require env bing
I don't understand this line. Doesn't the <If> envelope ensure that "bing" will always be defined?

Do you have a <Files> envelope that lets everyone see robots.txt? If you have 403s issued by mod_rewrite, do you also have a preliminary [L] rule letting everyone get through to robots.txt? It's hard to understand why any robots.txt request, ever, would receive a 403. Sure, different visitors might see different content, but they don't need to know that. (Heh, heh.)

dstiles

8:01 pm on Nov 4, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jmccormac - I am discounting the "unofficial" ms ranges but I haven't encountered them anyway.

lucy - robots.txt is not covered by the set-env file I use (generally a replacement for most of what would otherwise be in htaccess). I record the header vars for each hit but they are the same for 403 robots and for 200 other pages. The htaccess-replacement file, included in every web site conf file, is added thus:
<Directory "/srv/brisacu">
DirectoryIndex index.php
AllowOverride All
Include /etc/apache2/use-setenv.conf
</Directory>

The final lines of the use-setenv.conf file are roughly:
Require env favicon
Require expr %{REQUEST_URI} =~ m#/robots\.txt#
Require expr %{REQUEST_URI} =~ m#favicon\.ico|apple-touch-icon\.png|apple-touch-icon-precomposed\.png#i
<RequireAll>
Require method GET POST HEAD
<RequireNone>
Require env (several of these trapping bad things)
...
</RequireNone>
</RequireAll>

The only other reference in the file to robots.txt is for tagging bad bots...

<if " ! %{HTTP_USER_AGENT} =~ m#((Apple|bing|Exa|Google|istella|Twitter)bot|(Mojeek|Seznam|Yandex)Bot|BingPreview|DuckDuck|facebook|Let's Encrypt|Qwantify|Vagabondo|Yeti)# && ! %{REQUEST_URI} =~ m#/robots\.txt#">
BrowserMatch .{0,10}([Bb]ot|crawl|rank|review|spider).{0,10} bot_is=bad_robot:$0
</if>

The reason for the line:
Require env bing

is to shortcut the processing, otherwise it may be recorded by something further down the file.

I know it's possible I've goofed somewhere but bingbot is the only one (as far as I can recall) that generates a 403, though not by any means every time (yesterday it was 18 403s). I do not have an easy access to successful calls to robots.txt, I just see them go past in the site logs, but I know they exist. The baddies, by the way, apply to all sites on the server.

Before someone asks, the only things in most of the htaccess files are "header set" lines and the occasional site-specific clause.