Forum Moderators: phranque

Message Too Old, No Replies

Openai tried to scan my site

         

Scooter24

3:36 pm on May 12, 2024 (gmt 0)

10+ Year Member Top Contributors Of The Month



But ignored robots.txt, tried to index disallowed directories and got blocked by the automated malicious bot protection system (which adds a deny from *** line to .htaccess).

The declared user agent is GPTBot/1.0; +https://openai.com/gptbot) and the IP range is 52.230.152.*

So far it's 29 IP addresses in that range, which are now denied in .htaccess.

I'm a bit puzzled that OpenAI are so unprofessional.

Brett_Tabke

3:46 pm on May 12, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Interesting. You specifically disallowed GPTBot?

User-agent: GPTBot
Disallow: /

Scooter24

4:02 pm on May 12, 2024 (gmt 0)

10+ Year Member Top Contributors Of The Month



No, just the IP addresses (one by one, automatically) from which the bot tried access the disallowed directories.

not2easy

5:00 pm on May 12, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Instead of blocking them one by one, you could block all of them with 52.224.0.0/11
It s is a MSFT range that is directly allocated.

Scooter24

5:52 pm on May 12, 2024 (gmt 0)

10+ Year Member Top Contributors Of The Month



Well, the point I'm trying to make that it is strange that GPTBot ignores robot.txt.

Scooter24

5:53 pm on May 12, 2024 (gmt 0)

10+ Year Member Top Contributors Of The Month



Sorry, I meant robots.txt

lucy24

12:14 am on May 13, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The declared user agent is GPTBot/1.0; +https://openai.com/gptbot)
Is that the full UA string? I find a few thousand blocked requests from
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
most but not all from 52.230. The IP, in combination with a minor header deficit, gets them soundly blocked.

robots.txt is tricky because of the similarly named, but law-abiding, ChatGPT-User. Honorable robots are supposed to heed anything in robots.txt that could conceivably apply to them--which is not to say that they actually do.

Scooter24

4:59 pm on May 13, 2024 (gmt 0)

10+ Year Member Top Contributors Of The Month



The full UA string is indeed what you have there.

The directives in robots.txt are as follows:

User-agent: *
Disallow: /folder1/
Disallow: /folder2/

and so on.

The bot should follow these directives and not try to scan these folders.