Forum Moderators: open

Message Too Old, No Replies

Verifying SemrushBot

Learning to understand what's a spoof and what's not

         

Solution2

9:06 am on Jan 18, 2021 (gmt 0)

5+ Year Member



I've written a PHP routine for my sites, that verifies bots, according to how the larger search engines specify to do that. Get the hostname of the IP address, check whether it's from the specified domains, and then check whether the hostname found also has the IP that was started out with. Works fine for Googlebot variations, Bingbot, and various others.

Not for SemrushBot, though. Of course, they don't specifiy to verify this way. In fact, they don't specify anything for verification that a SemrushBot useragent is really a SemrushBot. However, I require that for full access to my site. I saw that they have a domain bot.semrush.com, so I thought I'd use that for verification.

Checking on the WSL Linux commandline what happens:
$ host 85.208.98.21
21.98.208.85.in-addr.arpa domain name pointer bot.semrush.com.
$ host bot.semrush.com
bot.semrush.com has address 104.17.153.1
bot.semrush.com has address 104.17.154.1

So, different IP addresses, which explains why my standard PHP routine doesn't see the bot as valid.

That raises the question what this means? How come that the IP addresses are different? Can I take the hostname bot.semrush.com as sufficient indication anyway, that this is a real SemrushBot?

When, a couple of weeks ago, I disallowed SemrushBot in robots.txt (removed that later), I saw three IP's with hostname bot.semrush.com, but with useragent Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). The IP's were 46.229.173.66, 46.229.173.67 and 46.229.173.68.
$ host 46.229.173.66
66.173.229.46.in-addr.arpa domain name pointer bot.semrush.com.

If the hostname bot.semrush.com is sufficient to validate SemrushBot, then this bot apparently does not obey robots.txt, and spoofs Googlebot for getting under the radar.

not2easy

12:38 pm on Jan 18, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There is a bot information directory here, courtesy of lucy24 and the 2020 edition has a good listing for the semrush bot, a little over halfway down the page here: [webmasterworld.com...]

46.229.173 is listed there as well as the 85.208.98. IPs and more behavioral info.

Solution2

1:08 pm on Jan 18, 2021 (gmt 0)

5+ Year Member



I am going to continue with my old username Solution1, as I found the password for it.

Solution1

1:08 pm on Jan 18, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



(Continuing with my old username, now that I found the password for it.)

Thanks @not2easy.
I had already read through that post.

I'm interested in understanding this, and a general solution that I can reliably implement on my websites.

What I am looking to understand in particular is why this happens:
$ host 85.208.98.21
21.98.208.85.in-addr.arpa domain name pointer bot.semrush.com.
$ host bot.semrush.com
bot.semrush.com has address 104.17.153.1
bot.semrush.com has address 104.17.154.1

Why are the IP addresses found with the latter command different from the IP address I started with?
How much can I trust that that first command tells me that this is really SemrushBot?

not2easy

1:57 pm on Jan 18, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Welcome back Solution1! Glad you recovered your years of history.

104.17.153.1 belongs to Cloudflare, a CDN. I have no clue where/how you are obtaining the IPs listed, so I'm not much help to tell you why they differ, sorry.

I do things differently, determining visitors' IPs from my access logs and deciding whether they are human or not based on their activities. UAs could be anyone, I just shared the available information on Semrush.

Solution1

2:29 pm on Jan 18, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Oh, I see. So, Cloudflare acts as a CDN for SemrushBot, and that's why host bot.semrush.com doesn't get the original IP back.

Since Cloudflare is a trustworthy party, I suppose I can trust that the hostname really means that this is the SemrushBot.
But I'm still wondering how trustworthy the hostname belonging to an IP address is, since Google, Bing and others request that you get the IP belonging to the found hostname too, in order to verify.

lucy24

5:27 pm on Jan 18, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Semrush is complicated. If you email them, they will send you a list of their current UAs with associated IP. The version I've got--probably dating back a year or two, so I recommend re-checking--includes (showing only the variable part of the UA):
213.174.146.211-213
SemrushBot-BA
192.243.55
213.174.152
SemrushBot/1.0~bm
46.229.164-168
SemrushBot/1.2~bl
SemrushBot/2~bl
SemrushBot/3~bl
46.229.161.131
SEMrushBot

46.229.173.66-67
SemrushBot-SA
213.174.147.83
192.243.56.76
SemrushBot-SI
213.174.153.121
ContentAnalyzerBot/1.0
I haven't personally seen the last three UAs.

Making it harder, the assorted UAs do not all send identical headers. I won't give details, but I currently have separate rules for
BrowserMatch SemrushBot
BrowserMatch SemrushBot-BA
BrowserMatch ^SEMrushBot$

On second thought, it may be more than a year or two. Current logs show mainly
46.229.168.138
SemrushBot/6~bl
185.191.171.23
SemrushBot/7~bl
where the latter showed up around November 2020. In all cases, UA and IP tend to match down to the last digit, which should make it easier. I haven't personally seen them from IPv6, but don't treat this as dispositive, since only my personal site has an IPv6 address.

Solution1

6:06 pm on Jan 18, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks @lucy24.

Looks like the IP addresses will change too much to reliably verify with precision.
I guess I'll email them, whether they think that using the bot.semrush.com hostname can be reliably checked for the longer term.

Kendo

9:25 pm on Jan 18, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Looks like the IP addresses will change too much to reliably verify with precision.

I block all of these bots because they are collecting data for competitors. I began by blocking them by IP at the firewall but found that some were using too many IP blocks.

But then I found that their weakness was that they can't help themselves but spruik in their user agent. I now action this per site and they get redirected to misinformation.

Solution1

7:43 am on Jan 19, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I have Adsense on my site, and I thought that Semrush is used by advertisers to look for sites where to advertise. That's my reasoning for allowing SemrushBot, despite the bad behavior of spoofing Googlebot when disallowed in robots.txt.

Kendo

5:21 am on Jan 20, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Semrush is used by advertisers to look for sites where to advertise

First I have ever heard this. But your competitors will be using it though to analyse your keywords and strengths.

Solution1

11:12 am on Jan 20, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Going by what keyplyr said in MSG#4890247 in thread "So blocking semrush. block bot or IP?"
If you publish ads, Semrush can be a highly beneficial agent to allow. Do the research.

tangor

1:32 am on Jan 21, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Aside: semrush obeys robots.txt (at least for me), so never had a need to investigate any of the above.

Solution1

11:12 am on Jan 21, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



There's this thread on this forum "SEMRush posing as Googlebot?" So apparently they've been spoofing Googlebot for a longer time.

lucy24

5:47 pm on Jan 21, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's not much point to spoofing googlebot [webmasterworld.com], since any self-respecting site will block visitors who call themselves Googlebot while coming from anywhere other than 66.249.blahblah. And surely Semrush knows this, or has learned it since 2017.

I didn't do a comprehensive search, but looking for recent “Googlebot” from 46.anything turns up a whopping total of four hits, all from early 2020.

Solution1

6:21 pm on Jan 21, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Google actually warns against hardcoding an IP range for Googlebot.
[developers.google.com ]

lucy24

10:27 pm on Jan 21, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google actually warns against
I’m sure they do, but has anyone yet met a bona fide googlebot from anything other than their one known /20 crawl range?

Solution1

10:27 am on Jan 22, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



@lucy24, No, but generally IPs change sooner or later.
With more and more use of IPv6, the big search engines will probably start using IPv6 as well at a certain point.

When Googlebot starts using IPv6, they'll probably notify websites, just as they're sending notifications for crawling over HTTP/2 right now. But I rather not take the risk, especially as they publish a fail-safe way of verifying Googlebot.

Bingbot does the same thing [bing.com ]. They warn to not use hardcoded IP addresses or address ranges for verifying Bingbot.
Other bots too: Yandex [yandex.com ], Baiduspider [help.baidu.com ], BLEXBot [webmeup-crawler.com ].

lucy24

7:02 pm on Jan 22, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Googlebot is an outlier in that it has always managed to do all its crawling from a single /20, while other search engines such as Bing and Yandex can come from all over the map. (Has anyone ever figured out how the heck they manage this?) BLEXbot is distributed, so it’s no use trying to keep up.

Quick detour to logs reveals that, in addition to Seznam--predictably an early adopter--BLEX and MJ12 both use IPv6 addresses at least some of the time. But so far no other search engine.

Solution1

9:25 pm on Jan 22, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I got these IPv6 addresses that are verifiably from Yandex:

YandexBot
2a02:6b8:c08:3408:0:492c:3b19:0
2a02:6b8:c0a:1293:0:492c:addf:0
2a02:6b8:c14:6ca0:0:492c:d4ea:0

YandexMetrika
2a02:6b8:c1a:2eaf:0:4f77:3fee:0

All Yandex network from AS13238 = 2a02:6b8::/32.

Solution1

9:37 pm on Jan 22, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



BLEXBot distributed? You mean that it would run from people's computers on IP addresses from ISP's?
The IP's (verified on *.webmeup.com domains) that I got the past week were all from Hetzner datacenter:
176.9.1.27
49.12.131.247
94.130.18.160
94.130.18.163

lucy24

10:18 pm on Jan 22, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You mean that it would run from people's computers
Whoops, no, just that they range all over the server-farm map. (Do there exist any non-malign robots that run off individual humans' computers? I don't think I have ever met one.)

phranque

10:26 pm on Jan 22, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



(Do there exist any non-malign robots that run off individual humans' computers? I don't think I have ever met one.)

MJ12Bot? [mj12bot.com]

for more details:
[majestic12.co.uk...]

Solution1

11:35 am on Jan 29, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I emailed SEMrush about verifying their bot. I received an Excel sheet from them with all kinds of details, but for most of their bots, that boils down to validation by ASN check.

The IP addresses need to be in AS209366. Currently, that is these IP ranges:
85.208.96.0/22
85.208.98.0/24
185.191.171.0/24

This goes for their user agents:
SemrushBot-SWA, SemrushBot-CT, SemrushBot, SemrushBot-SI, SemrushBot|SemrushBot-BA|*, SemrushBot-SEOAB

For their other user agents they recommend a reverse DNS check. That's for their user agents:
SemrushBot-SA|SiteAudit, SemrushBot-BA, SemrushBot
The excel sheet mentions bot.semrush.com, but that should probably be semrush.com, including prefixes.