Forum Moderators: open

Interesting set of hits using forged searchbot User Agents

Including DeepSeek and xAI (?!)

         

SumGuy

1:38 am on Jun 27, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



You may recall that I am IP-blocking a large percentage of the IPv4 (currently 34.7%) from hitting my web server. This leaves a reduced IPv4 universe from which hosts perhaps a small and diverse set of rogue hosting players that get through, such as this:

23.161.169.62 (AS400529 Infraly, LLC)

Today's several dozen hits from that IP consisted of blindly asking for files in various /.env, /.git and /api paths and json files like config.json, firebase-adminsdk.json, google-credentials.json, secrets.json and service-account.json - none of which I have. It also crawled part of my site (html files only). I never see that sort of combination (probe for vulnerabilities and then crawl the site a little).

But here's the thing - it alternated these hits using a variety of user-agents. The complete list:

CCBot/2.0 (https://commoncrawl.org/faq/)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/bot)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Google-CloudVertexBot; +https://cloud.google.com/vertex-ai-bot)
Mozilla/5.0 (compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot)
Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Mozilla/5.0 (compatible; xAI-SearchBot/1.0; +https://x.ai)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)

(I don't think I've seen Google-CloudVertexBot before, a topic for another thread?)

I find this list very useful - because of the presence of these two UA's:

Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/bot)
Mozilla/5.0 (compatible; xAI-SearchBot/1.0; +https://x.ai)

I have only seen them once before - in April this year from a rogue IP (208.92.235.45 - AS399244) - another crack-pot entity (now IP-blocked). It asked for a handful of .env and .git files. So I'm counting those as fake deepseek and xAI hits.

I rarely (and I mean rarely) have ever seen a hit claiming a main-line search-bot UA that was forged. And even then it was not part of session that systematically worked through a list of search bots.

But the more important thing here for me is the DeepSeek and xAI searchbot UA's. I have never seen an actual legit hit from either of those 2 bots, so I can only wonder if those UA's above represent actual working UA's or fabricated speculation of what they might look like.

Has anyone here ever had hits from those 2 bots? Legit hits, not forged?

lucy24

4:26 pm on Jun 28, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



DeepSeekBot: a whopping total of 8 (eight) over the past year and a half, five of them for .json. Seven 403s and a 429.
xAI-SearchBot: total of 13 (thirteen) in the same period, all blocked except one that began surprisingly by requesting robots.txt. (Could it have been legit? With this pattern of requests, it seems an academic distinction, in the same way that I don't much care if that Chinese visitor is really from Baidu.)

Out of those 21 requests, all but one were to secondary sites, mainly two test sites.

Side excursion to logged headers tells me all of them had significant header deficits--the same 3 each time. This is personally reassuring, since header-based blocking is no longer as reliable as it was ten years ago when I started doing it. It means that even if they didn't come from blocked IPs--header logs wouldn't say--they would still be blocked. Mwa ha ha.

<tangent>
Gosh, it's been ages since I've seen a fake Googlebot. Remember when they used to be ubiquitous? In the past year-and-a-half I find exactly 2619 of them, where once it would have been in the tens if not hundreds of thousands.

none of which I have
A nifty alternative to a 403 is to return a manual 404. Less work for the server--it doesn't have to go looking for the file to establish that it doesn't exist--and conveys no information to the requester.
</tangent>

Martin Potter

8:20 pm on Jun 28, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



<tangent>
I am just a small player here, but some years ago I took lucy's advice and set things up to return a stated "404" for both 404 and 403 responses. The 403 response also triggers making a separate log entry which I find useful for more easily pinpointing things to look up in the server logs, by giving me a specific date/time of the visit or IP address of the visitor. Mind you, I still crawl through the server logs but life is a little easier with the separate "403" log.
</tangent>