Single Sign On + Identifying Search Bots

Forum Moderators: open

Message Too Old, No Replies

Single Sign On + Identifying Search Bots

question about websites switching to a metered paywall system

tube

6:11 pm on Oct 5, 2022 (gmt 0)

Hi all - I have a question about a set of websites that are switching to a metered paywall system and are trying to identify searchbots and avoid the login requirements for users.

May I ask a few questions about this scenario:

- Developing in-house single-sign on solution (SSO) that works across sites on some browsers.
- This SSO development is in advance of a third-party metering solution.
- When users visit a page, they 302 redirect temporarily to a log-in page.
- We want search engines not to meet the 302.
- The in-house developers believe the solution is to identify Search Bots based on their IP address lists, which, Google and Bing publish for example.
- Is this the best way to go about this for SEO? Or is there a better industry standard? How would the devs be able to stay on top of the ip address lists for diverse search engines?

Thank you!

not2easy

7:33 pm on Oct 5, 2022 (gmt 0)

Hi tube and welcome to WebmasterWorld [webmasterworld.com]

This does not address any SEO aspects, I am not dealing with any paywall issues personally but I'm sure there are others here who can share their suggestions and experiences.

IF you have a determined set of User Agents you would like to allow you can sort by UA - BUT most distributed bots can claim to be any UA they like. Google has a known range. If Bing also does, that is something new. It has been wished for for a very long time.

MSFT hosting ranges send all kinds of rogue bots and use some of those same IPs for their own bots afaik. I block IP ranges that ignore robots.txt but most of that comes from log examination. If you are tracking headers you may get more insight as to which bots are who they claim to be. I know of no simple set of rules, either UA or IP rules to keep out unwanted bots.

I would start by examining recent logs to see what bots are already visiting and deciding which you see as useful.

dstiles

7:58 am on Oct 6, 2022 (gmt 0)

Not just MSFT but loads of other server farms PLUS compromised IPs which can be server or broadband from any part of the world. And not just bots but hacking, injection and even just so-called "measurement" attempts.

My own take, which is fairly successful and originally advised by lucy24 hereabouts is to gather a set of parameters that indicate a human browser, whitelist those and then block everything else. Works much better than blacklisting. Make a few exceptions for the main site for genuine bots if that's relevant.

I suggest reading a few old postings from this forum and from the php and apache forums, which have a few specifics.

tube

4:46 pm on Oct 7, 2022 (gmt 0)

not2easy - Thank you for the welcome! Here is the bingbot list: bing.com/toolbox/bingbot.json

dstiles - I will look for the old postings...meanwhile, in general is the approach I noted in my original post the most typical way to separate search engines from human users? In other words, despite security risks, is this an accepted method? And its not cloaking or anything?

Will look for older threads now!

not2easy

5:07 pm on Oct 7, 2022 (gmt 0)

That URL appears to go to a script result file.

Peculiar sorting, I've sorted it to:

13.66.139.0/24
13.66.144.0/24
13.67.10.16/28
13.69.66.240/28
13.71.172.224/28
20.36.108.32/28
20.43.120.16/28
20.125.163.80/28
40.77.167.0/24
40.79.131.208/28
40.79.186.176/28
51.8.235.176/28
51.105.67.0/28
52.167.144.0/24
52.231.148.0/28
139.217.52.0/28
157.55.39.0/24
191.233.204.224/28
207.46.13.0/24

It seems they just scatter their bots in with the "keep-out" bunch. Not very intuitive for webmasters.

tangor

1:29 am on Oct 8, 2022 (gmt 0)

How would the devs be able to stay on top of the ip address lists for diverse search engines?

By expending a lot of time and effort (and probably salary dollars as well) to create an internal black list ... remember: g and b are not the ONLY bots out there!

Might explore the concept of white listing?

YMMV!

dstiles

8:31 am on Oct 9, 2022 (gmt 0)

tube - As I said, determine the characteristics of "human" browsers and allow those in.

Most genuine search engines come from fixed IP ranges which you have to establish, but ONLY let in bots if their UA is an expected one for the IP. An alternative is to check the reverse DNS which SHOULD be set up to indicate a bot, although some companies (eg bing) have not always held to that and google often uses a proxy for some of its lesser bots, so I've never subscribed to that theory.

But a blacklist of bad IPs will never really work - new machines get compromised every hour whilst other IPs stop being compromised and rejoin the living so it's a losing fight.

One thing you can do is block MOST server farms - bearing in mind a few wanted bots may come from there and that letsencrypt uses various well-known ranges such as digital ocean to check for renewals ON PORT 80.

tube

10:48 pm on Oct 13, 2022 (gmt 0)

Dstiles and tangor - I don't know where the idea that we want to block bad ips came in, but the devs definitely want to whitelist search engines. My concern was with how difficult this becomes when we start dealing with smaller search engines than Google and Bing. But your notes and recommendations really seem to set me at ease here - thank you for all the time you put into replying.

phranque

11:14 pm on Oct 13, 2022 (gmt 0)

Is this the best way to go about this for SEO?

i would suggest familiarizing yourself with this:
Structured data for subscription and paywalled content (CreativeWork) [developers.google.com]

This page describes how to use schema.org JSON-LD to indicate paywalled content on your site with CreativeWork [schema.org] properties. This structured data helps Google differentiate paywalled content from the practice of cloaking, which violates spam policies [developers.google.com]. Learn more about subscription and paywalled content [developers.google.com].

developers.google.com

Subscription and Paywalled Content Markup Google Search Central Documentation Google Developers

Structured data can help subscription and paywalled content to be indexed by Google. Learn SEO best practices for paywalled content with this guide.

SumGuy

2:36 pm on Oct 15, 2022 (gmt 0)

Regarding the bingbot IP list, has anyone ever seen a bingbot hit from

13.66.144.0/24
13.67.10.16/28
13.69.66.240/28
13.71.172.224/28
20.36.108.32/28
20.43.120.16/28
20.125.163.80/28
40.79.131.208/28
40.79.186.176/28
51.8.235.176/28
51.105.67.0/28
52.167.144.0/24
52.231.148.0/28
139.217.52.0/28
191.233.204.224/28

because I haven't.

lucy24

5:30 pm on Oct 15, 2022 (gmt 0)

has anyone ever seen a bingbot hit from

Huh. Within the present calendar year, I haven't seen anyone from those ranges. (To save bother, I just checked /24 on all.) Cut back to /16 and there are bingbots from 13.66, but not .144, just 13.66.139, a popular IP since 2015 or so.

:: deeper delve into archived logs ::

Nope, absolutely nothing, where �ever� = in the last 10 years or so. (I think my original site logs go back to late 2011.) Excluding 13.66 I also checked the rest of the list for msnbot, but still nothing. And again, that's the whole /16, not /24 let alone /28.

Is that list of IPs from bing's own site?

SumGuy

11:53 pm on Oct 15, 2022 (gmt 0)

> Is that list of IPs from bing's own site?

Supposedly it's from here: bing.com/toolbox/bingbot.json
Regarding 13.66.139.0/24, I've only seen bingbot from there between Dec 2018 to Feb 2022.

I tried a few of those on the Verify Bingbot Tool

[bing.com...]

Verdict for IP address 13.67.10.16:
Verdict for IP address 20.36.108.32:
Verdict for IP address 20.125.163.80:
Verdict for IP address 51.105.67.0:

Yes - (all those) IP addresses is a verified Bingbot IP address.

I then took another entry -
20.43.120.16/28

and modified it to:
20.43.220.16

and the tool reported:

Verdict for IP address 20.43.220.16:
No - this IP address is NOT a verified Bingbot IP address.

See also:
[bing.com...]

There is a remote possibility that those other bingbot IP's are used when you interact with or direct Bing webmaster tools to scan or index your site. Or another remote possibility is that they're for indexing sites in very different locales (Asia? Europe? South America?)

For a slight diversion, check out the last post on this page:

[moz.com...]

dstiles

8:18 am on Oct 16, 2022 (gmt 0)

I have the following IP ranges enabled for bingbot (UK) but I don't think I get hits on all ranges. 157.55.39.0/24 is the usual one.

13.66.139.0/24
40.77.167.0/24
157.55.39.0/24
207.46.13.0/24