Forum Moderators: open

Message Too Old, No Replies

Paywall content and getting indexed in Google.

Can you eat your cake and have it too?

         

sun818

8:20 pm on Jun 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Interesting article in Bloomberg about Wall Street Journal (WSJ) having lower rank after implementing their paywall [bloomberg.com]. Is it possible for news/content sites to eat their cake and have it too? I would think there is a way to whiteliste robots (GoogleBot) full access to articles while presenting a paywall to new users.

lucy24

9:35 pm on Jun 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The details will obviously depend on your server type. But yes, it's extremely easy to set up multiple access criteria: “To see this content, you must either be logged in or be the googlebot.” The question is whether you’d want to. As a human, it disgusts me when something promising comes up in a search--and then when I click on the link I'm presented with two lines of text and a “You must be logged-in to read this article” message. Do a lot of humans use search engines to find content on sites they already subscribe to?

sun818

10:51 pm on Jun 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wall Street Journal stated they want to have a paywall and fully have their content indexed by GoogleBot. WSJ obviously found SE referral traffic to be more important than serving non-subscribed visitors. From a technical standpoint, I don't understand how you can allow GoogleBot from Google, yet prevent someone spoofing GoogleBot and view the full content for free. Makes me think WSJ perhaps did not have a competent web server admin and lost out revenue from search engine traffic because their change was implemented poorly.

lucy24

11:42 pm on Jun 5, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't understand how you can allow GoogleBot from Google, yet prevent someone spoofing GoogleBot

Any website worth its salt will categorically block visitors who claim to be the Googlebot but come from a non-Google IP.

thiennp

2:17 am on Oct 3, 2017 (gmt 0)



Thank Lucy24, how can I do that for my site?

topr8

6:27 am on Oct 3, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>>Thank Lucy24, how can I do that for my site?

the usual way is to test the UA for the expected googebot string - there are different googlebots, you may only want to allow some of them or you may want to allow them all.
if the UA is a googlebot, then do a forward and reverse DNS lookup to check it is coming from a google IP address. you can also maintain a list of IP addresses that google uses for bots, or a combinaion of the two - for instance do the forward/reverse dns lookup and somehow cache the result for a set period of time, so that next time check the cache first, before doing the lookup (as it is obviously quicker)

jokopohannes

4:13 pm on Apr 15, 2018 (gmt 0)

5+ Year Member



Any website worth its salt will categorically block visitors who claim to be the Googlebot but come from a non-Google IP.


And how on earth do you know all the IP addresses Google bots are using? You might know some range but if you're not the one deploying their server parks you have no idea about all the ranges.

tangor

10:32 pm on Apr 15, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You don't have to know all the ranges, only the ranges of the bots you allow. That's a much smaller slice.

lucy24

2:06 am on Apr 16, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And how on earth do you know all the IP addresses Google bots are using?
If you have personally met the bona fide Googlebot crawling English-language content from a range other than 66.249.blahblah, many people hererabouts would like to hear about it. A few years back, Google said they were going to start using other, non-ARIN ranges. But I can't remember anyone posting hard evidence that they were actually doing so.

phranque

4:18 am on Apr 16, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you shouldn't always rely solely on whitelisting IP ranges.

this is the proper way to verify googlebot IPs:
[support.google.com...]

Travis

7:32 am on Apr 16, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



And how on earth do you know all the IP addresses Google bots are using?

You don't have to. Since you are speaking about dynamically generated page (PHP or other), at the beginning of your script, you just resolve the IP address of the client accessing the page, and it has to be "xxxx.googlebot.com." (or "xxxx.google.com."). Then to be sure, you do a reverse DNS on "xxxx.googlebot.com." and you have to get the same IP address. That's all, and this is explained here : [support.google.com...]

Now, the thing is, this is gray area to serve different content to Googlebot and human visitors. Big-big sites, might not risk anything, but publishers like us, risk to get our site banned by doing so...

phranque

8:15 am on Apr 16, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



there's no risk of being banned if the implementation is done properly.

How to indicate paywalled content

Publishers should enclose paywalled content with structured data to help Google differentiate paywalled content from the practice of cloaking, where the content served to Googlebot is different from the content served to users. If no structured data is provided to indicate the paywall, the paywall may be mistaken for a form of cloaking and the content could be removed from Google.

For detailed specifications on implementing the structured data, visit our Developer documentation.

We encourage publishers to experiment cautiously with different amounts of free sampling so as not to unintentionally degrade user experience and reduce traffic coming from Google.

source: [support.google.com...]

perhaps wsj's implementation provides a "degraded user experience".


google's Developer documentation:
Subscription and paywalled content [developers.google.com]