Questioning the wisdom of using fat pings to deal with scrapers - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Questioning the wisdom of using fat pings to deal with scrapers

1
2
»

1script

2:38 am on Nov 19, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

So, I'm [slowly] trying to wrap my head around the concept of SubPubHubbub, and the only reason I looked at a protocol with the name like this is because tedster recommended it :)

But perhaps both MC (at 2011 Pubcon) and tedster had something other than the default implementation in mind when they suggested that it might give your site a preference in terms of content authorship as opposed to your scrapers, especially those scrapers with some authority if you immediately "fat ping" Google (i.e. push through a hub) the entire content of your new post.

Perhaps I am not getting some important detail here but it looks like the most popular WP plugin for "fat pings" (not "fat pigs", silly spell checker! ) pushes it to two "default" hubs - Demo hub on Google App Engine and SuperFeedr.

Both hubs (as well as any other PSHbbb hub for that matter) are completely open and anyone can subscribe to your pings just like you hope Google will. In other words, whether or not Google will actually subscribe to read your "fat pings" through the open hub is an open question, but you can be sure the scrapers would LOVE to get your full article content the second it gets published. More so considering that you normally only publish excerpts in the RSS feed, which is what they used to aggregate before.

So, I think I'm missing an important bit of info here: how to make sure that Google gets the fat ones and the scrapers (pardon, aggregators) don't?

Can anyone more experienced with fat pigs chime in?

TheMadScientist

6:34 pm on Nov 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I personally try to avoid the fat pigs, but I can tell you if I want Google to get my content before someone else can scrape it, the easiest way I've found is to tweet a link, even before I link the content (and sometimes from an 'alternate, for link tweeting only account' to keep the page 'quiet' and 'off anyone else's radar' until GBot has been by).

Google has access to the 'data pipe' on twitter and gets to the link before it's converted to nofollow ... When I tested GBot was usually there when I refreshed the access stats to the page I tweeted the link to (within 5 seconds), but sometimes it seemed to be a bit busy and kept me waiting for a while (like up to 30 seconds).

Anyway, not sure if that'll help with the specific situation, but if getting your content in front of GBot before anyone else gets the chance to scrape it is what you're trying to do, tweeting a link to the page is the fastest, most convenient / reliable way I've found.

blend27

9:35 pm on Nov 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

+ what TheMadScientist said!

I usually restrict access to fresh content til all bots(well 4, no slurp though) get to it by placing programming logic layer around the content.

Something like "coming soon".. But then all that is wrapped in White Listed Residential(repeat visitors), COUNTRY Blocks, Hosting Ranges/Data centers Firewall Blocks. And new content chuncks on pages for real visitors are served via AJAX till the bots get to it.

Dymero

11:47 pm on Nov 20, 2012 (gmt 0)

10+ Year Member

@TheMadScientist

Google has access to the 'data pipe' on twitter

Unless something has changed recently, they lost access to the Twitter pipe over a year ago. The nofollow Twitter link thing no longer works.

TheMadScientist

11:55 pm on Nov 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Really? Wow! That's news to me ... I haven't tested it in over a year though, so it's entirely possible ... Thanks for the heads up. I'll give it a test again over the next day or two and let people know the results here.

Robert Charlton

6:15 am on Nov 22, 2012 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Really? Wow! That's news to me ...

This is when it happened....

Google Temporarily Shuts Off Real-time Feeds
Jul 4, 2011
http://www.webmasterworld.com/google/4334849.htm [webmasterworld.com]

Google Inc. has temporarily shut down a search engine feature that allows users to find real-time updates from Twitter, Facebook, FriendFeed and other social networking sites.

A message posted early Monday on Twitter by the team behind Google Realtime says the search feature has been temporarily disabled while Google explores how to incorporate its recently launched Google+ project into the feature. The tweet tells readers to "stay tuned

Turns out, Google's Twitter deal expired.

Robert Charlton

6:41 am on Nov 22, 2012 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

So, I'm [slowly] trying to wrap my head around the concept of SubPubHubbub, and the only reason I looked at a protocol with the name like this is because tedster recommended it

1script - Glad you posted this, as most others here seem to be more interested in making noise than in looking into this.

I posted the following on another thread, and am reposting here....

Tedster has laid out approaches he's been trying in several posts that might be helpful to you. I'll link to a couple of threads here, with some excerpted comments from each....

How do we tell Google we were wrongly Pandalized?
http://www.webmasterworld.com/google/4387503.htm [webmasterworld.com]

From tedster's Nov 22, 2011 posts...
I want to emphasize that my ideas about correlation between widespread syndication and being wrongly Pandalyzed are my own conjecture, nothing proven and nothing officially communicated. It's just what seems to make the most sense for the cases that have me scratching my head....

...What I'm trying with one site is to ramp up every "we are the canonical source" signal I can muster, including authorship tagging, pubsubhubbub, delayed RSS, no more full RSS feeds, etc, etc. I'll let the forum know if it works.

There's evidence that Google "wants to" credit the original source in the SERPs, but many times a more authoritative source who is quoting in full or syndicating (even with full acknowledgement) will still rank higher.

And, on Jan 15, 2012...

Article pages not ranking since Panda 1.0
http://www.webmasterworld.com/google/4406778.htm [webmasterworld.com]

>>my articles get picked up by other sites<<

I recently worked with a site that had a similar issue. We made a couple of changes that seemed to improve indexing and ranking immediately.

1. Inaugurated authorship mark-up
2. Used pubsubhubbub (PuSH) to send Google "fat pings" immediately at publication
3. Delayed the standard feed until the PuSH feed was received

Please take a look at the threads and provide more relevant details about what you've done, what type of site is ranking in your place, the feed situation, and the timeline regarding when this happened.

As I'm beginning to understand it, integration PuSH Google+ authorship with PuSH feeds to notify Google before wider publication of the standard feed is the trick, but I don't have enough hands on experience to describe how to implement this. Others have mentioned some success with Google+ authorship on other threads, and I've suggested they post further details here.

PS: Off-topic comments and general noise will be deleted. Please see... IMPORTANT - The Focus of This Forum [webmasterworld.com].

levo

7:12 am on Nov 22, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The trick is, adding a delay to your public feed, and using a private full feed to ping pubsubhubbub server.

Check post #9 [webmasterworld.com...]

TheMadScientist

8:59 am on Nov 22, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Fascinating...

When I ran a test to see what I go from tweeting a link the 'interesting' visits were (via either IP or User Agent) the following:

Topsy (Butterfly)
TweetMeMe
Microsoft (IE 7.0 User Agent - Microsoft via IP Address)
Amazon AWS (x 2)
TwitterBot

So, TopsyLabs, TweetMeMe, Some 'oddly identified' Microsoft Bot, 2 x Amazon hosted bots (1 identified as "JS-Kit URL Resolver", 1 identified as "Mozilla/5.0 (compatible NN.NN.NN.NNN" - Yes, it was identified as a partial user agent) and TwitterBot found the URL within Seconds, but after 5 minutes there was no GoogleBot...

I'm glad someone pointed this out, because I would not have ever thought to double check and see if the largest search engine on the planet could find a URL from a tweet on Twitter when I Know if I don't block them Amazon bots can and do.

Seriously? A 'minibot' (compared to GBot) can find a tweet within seconds, but somehow GBot can't? You Gotta Be Kidding Me ... Maybe GoogleBot's low on funding? Er, uh, wait they're G$$gle.

Lorel

2:47 pm on Nov 22, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

As stated in another thread I have a website that is 12 years old that had been previously plagged by scrapers and was continually chasing them down but since putting up Google+ authorship I haven't found any scrapers harming my ranking.

I have not "pinged" my articles after posting them and I don't use an RSS feed for them either. A few times I submitted them directly to Google but haven't done that in a longtime.

The method I have used for about the last 6 months is to post the articles on my website or blog, however, when I post an article I Tweet about it with a link, and also post a note about it in Google+ or Facebook and also link to it from my website or blog depending on which one I used, i.e., I try to link to it from 3 other sources immediately upon publishing, which helps Google get notice of it right away.

tedster

2:57 pm on Nov 22, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I should add that I don't see full full RSS feeds as being a sane thing to do these days. Some people used to recommend them for end user purposes but the way I see it, full feed just makes scraping too easy for spammer purposes. Even as a feed user, if I like what I see in the feed reader, I'd rather see the complete article in formatted its "full glory" on the site itself.

ergophobe

4:27 pm on Nov 23, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

When I ran a test to see what I go from tweeting

And if you run a test to see what you get from posting to Google+...

1script

5:31 am on Nov 26, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Guys, belated thanks for your responses! I have to admit, I didn't check that little "notify" check mark and just thought that the thread did not generate any comments. I'm trying to catch up now.

This is fascinating stuff and I'm intrigued to see that people go as fas as cloaking to [try to] protect the first publication (i.e. authorship) rights. I mean, how else would you describe what blend27 does? And that's not a jab at you, blend27, I'm amazed you had the guts to use a technique frowned upon since Sergey and Larry attended grade schools :)

Unfortunately, I'm not finding many references to "PuSH Google+ authorship" either except for "Google is being pushy with their Google+ thing", and even in general, many of the sites I help out with are forums and therefore the whole premise of authorship won't work for this type of site, not that I can see. But I'll keep looking.

Thanks, levo for the tip on how to push-ping. I do understand the mechanics better how than when I did first posting this but the issue still remains: if you ping the hub (any open hub) with a full RSS feed, then THAT (full RSS) is what gets pushed to your subscribers, and one of those can be Google but can also be your scraper. What good is it that the full RSS is "private" (I assume, no link to it from your site?) if its content gets delivered anyway, AND, you don't even have a log of the scraper's RSS bot hitting your Web server, so you cannot 403 them the next time? What am I missing here?

Cheers!

levo

10:33 am on Nov 26, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

if you ping the hub (any open hub) with a full RSS feed, then THAT (full RSS) is what gets pushed to your subscribers, and one of those can be Google but can also be your scraper.

I keep the full Atom feed URL private (unlike the WP plugin). Scrapers have to know the URL to subscribe it. I just add the full feed to Google Reader (for debugging) and Google Webmaster Tools (as sitemap).

1script

4:28 pm on Nov 26, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

@levo: thanks for the clarification!

I just add the full feed to Google Reader (for debugging) and Google Webmaster Tools (as sitemap)

Why would you add the full RSS as a sitemap (and risk that I just might be discovered by scrapers, too) if Google only reads URLs from sitemaps?

I guess that is precisely the part I'm not getting: what is that Google's facility that we are trying to feed the full content to? Google has been accepting URLs either through the submission form in the ancient times or via sitemaps more recently, but I've never heard of them ever accepting the content as such, and only once has MC mentioned it and then - not even a peep more since that 2011 Pubcon.

Sgt_Kickaxe

6:40 pm on Nov 26, 2012 (gmt 0)

Step #1 - Before you hit publish on your new content load up your Google Webmaster Tools account.

Step #2 - Publish the content and copy the canonical url

Step #3 - Go to "Fetch as Google" in the health section of your GWT account and fetch that url

Step #4 - press "submit to index" when you see the "success" message, you're done.

There is a limit of 10 submissions per day like this but it works, fast. Add a delay to your public feed for good measure.

levo

7:05 pm on Nov 26, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

No worries, there is no information/documentation on how Googlebot discovers and subscribes to the pushed feeds.

The only clue I can find is, you can't diagnose your feed without any subscribers, and once you add it to WMT, diagnostics works. So, I assume Google(bot?) subscribes to push notifications once I add the feed to WMT.

Why would you add the full RSS as a sitemap (and risk that I just might be discovered by scrapers, too) if Google only reads URLs from sitemaps?

Is there a risk? Can someone find feeds that I've added to WMT?
BTW, I use "X-Robots-Tag: noindex, follow" header for feeds.

1script

7:14 pm on Nov 26, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Sarge, thanks for the tip! I have done that in the past though I admit I have never been diligent enough with this technique. I did, however, reason that if I visit my own post a split second after it's been published AND the post has AdSense code on it, the Mediapartners bot will do the same job for me. Question is how quickly the bot shares the data with the main indexing engine and whether the datastamp actually survives. But I have never had problems with quick indexing/caching of the page per se, it's either the AdSense code and the immediate Mediapartners bot visit or the simple ping of Feedburner for the new RSS (I haven't tried RSS delay yet).

Anyhow, thanks for your tips guys, but the reason I started this post was not to ask if there's *any* way to ping Google about a new page - I've been using submission forms and later sitemap and RSS pings since last millennium - but rather what on Earth is that "fat ping" that MC talked (and tedster later echoed) about and how it could possibly be helping with establishing authorship?

The particular issue I'm struggling with is not the fact that you ping an open hub - I assume the hub retains the data stamp and so your post will have an older data stamp than the scraper's - but rather what benefit is there in sending "fat" pings (XML with content) as opposed to "slim"(?) pings with URL only. Can Google now index a page before even visiting it, based only on the content of the "fat ping"? I sincerely doubt it - they don't trust self-created signals. What other benefit could there possibly be then?

Thanks!

incrediBILL

12:11 am on Nov 27, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Why would you add the full RSS as a sitemap (and risk that I just might be discovered by scrapers, too) if Google only reads URLs from sitemaps?

You can easily limit your RSS feed's access to just a whitelisted number of sites, like Google, Bing, etc. using .htaccess assuming you have mod_access installed which is pretty common.

Here's an untested example of how you might do this just for Google:

<FilesMatch "\.rss$">
Header set X-Robots-Tag "noarchive, nosnippet"
Order allow,deny
#allow google IPs
allow 64.18.0.0/20 64.233.160.0/19 66.102.0.0/20 66.249.80.0/20
allow 72.14.192.0/18 74.125.0.0/16 173.194.0.0/16
allow 207.126.144.0/20 209.85.128.0/17 216.239.32.0/19
deny all
</FilesMatch>

I threw in a NOARCHIVE to make sure it's not cached and accessible by scrapers in the search engine as well.

These things aren't that hard to do once you learn how to do it.

ergophobe

1:06 am on Nov 27, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is there an equivalent for getting googlebot IPs to the way you can grab the google mail servers in your SPF with include:_spf.google.com (see [support.google.com...] )

It's nice because if the IPs change, you don't need to go in an hand edit. I suppose for .htaccess you would need to write a script that would grab the include IPs and write them to a file, but that should be easy.

incrediBILL

1:59 am on Nov 27, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is there an equivalent for getting googlebot IPs

Nope. I've asked Matt if they'd do something like that which I tentatively called "Spiders.xml" which would reside on their root domain and the crawl team appeared to have ZERO interest.

Maybe if hundreds or thousands of people requested it they would bow to peer pressure or perhaps getting other search engines to start providing it they would do it.

Heck, Google proposed the round trip rDNS validation method for Googlebot and they won't even do THAT for any of their other bots and stuff.

Google doesn't care about providing the simple tools people need to easily solve real world problems and instead will spends tons of money dealing with issues that can be abated with relative ease.

Not so smart for such bright people.

Bewenched

2:06 am on Nov 27, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Well there's always going into WMT and doing a googlebot fetch, but that's no guarantee that it will be retrieved quickly nor will it provide them any proof of initial authorship.

I did a search the other day for some of our homepage copy and found it had been completely scraped over 34 times.... spent a good couple of hours doing DMCA reports, but no idea if anything will come of it. And that was just our homepage.

I've found snippets and full page scrapes of some of our products as well... there's no way to stay completely on top of it all.. unless you're a small blog, but we're ecommerce. Sadly the sites that had scraped content were running adsense ads all over OUR content... it just makes my eye twitch thinking about it.

Sgt_Kickaxe

11:17 am on Nov 27, 2012 (gmt 0)

I hear ya Bewenched, I just did the same last weekend with 20 or so verbatim copies myself. The sites trying to pretend to be my site were removed but the aggregating sites that show my site AND someone else's were denied. How they ranked with my content didn't matter one bit. It seems that scraping is somewhat protected if done in bulk which isn't surprising since that's essentially what Google does too.

As for RSS feeds, it's best not to have any for the majority of sites. You can then use logs to catch scrapers since they'll be the ones hitting up your pages fast and furious. Oh, and if anyone finds of a way to completely disable feeds in wordpress(as in OFF, not hidden) let me know! Wordpress has built in heavy workarounds for most disabling attempts.

JohnRoy

3:10 am on Nov 28, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

op} Can anyone more experienced with fat pigs chime in?

experienced with WHAT ? ;)

Bewenched

5:08 am on Dec 1, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You can then use logs to catch scrapers since they'll be the ones hitting up your pages fast and furious

Sometimes they do it slowly too though.

I spend about an hour every night looking over the log from the day. I end up blocking about 2-3 user agents a night. It's run rampant especially out of the Ukraine, which has really forced us to block the country and a lot of russia. Shame we do get a few customers a year from Russia. And dont even get me started about China. You have to be careful blocking there since some of the ranges actually support Australia too.

You would think that having webmaster tools and giving the pages you want crawled and the date of modification that somehow google would take that into account.

Rosalind

3:59 pm on Dec 1, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

As for RSS feeds, it's best not to have any for the majority of sites.

I came across a survey of 500 people done in November that indicated people are using RSS less and less. It's hard to find data on how many real people are using it, because it's muddied by those who don't know what the technology is, but may still be using RSS through apps and whatnot to get their news.

I'm contemplating getting rid of my feeds, since so much of it goes straight to scrapers, even though they attract some visitors.

1script

9:27 pm on Dec 2, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Guys, thank you for all your replies so far! They've been a valuable input and quite a few interesting insights. IncrediBILL, your .htaccess rules are a thing of beauty, I'll use this one as prescribed as well as in a slightly different version to deal with unwanted indexing of my gzipped stimeaps (unrelated to this topic).

I cannot help it however but notice that noone actually talks about the supposedly new approach MC had suggested back in 2011. All of the types of Google-polling or Google-pinging that we've discussed here so far are tried and true WMT-centered "traditional" pings, and as far as I can tell, all of the techniques mentioned in this thread are based on just making Google aware of a new URL recently created on the site. "Here is a new URL - go fetch its content so you can index it later".

The way I understood what MC (and later tedster here) were talking about is "pushing" the actual content to Google and (this is completely my conjecture) I understood that in case of a "fat ping" the URL would then become secondary or at the very least not as important as the content itself. The content gets recorded with a data stamp and later copies of this same content would be checked against that data stamp to figure out the actual originator of the content.

I still have a feeling that these "fat pings" are either the best kept Google secret weapon for fighting duplicate content or yet another smoke screen. Or perhaps one of many MCs proclamations that sound like "it works like this" but in reality mean "we wish it worked like this" or "if we can get a human to look at it, it will work like this".

Is there anybody out there who has ever tried what MC calls "fat pings" to Google and lived to tell the tale?

tedster

9:50 pm on Dec 2, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Here's an update on the success story I mentioned a few months ago. The site continues to achieve new record levels of search traffic and PubSubHubbub (PuSH) is still in place. As I see it this is a speed thing - Google consistently learns that this domain published new content first. The problem showed up 18 months ago, and PuSH was installed more than a year ago.

I'm sure it also helps that the site had a many year history of success before the problem developed. I'm sure it also helps that the site has a substantial group of respected writers who are creating the content. But those two factors alone were not enough to stop the scrapers from taking over rankings. And the fat ping solution of PuSH did reverse the damage within a few weeks, after half a year of struggle.

1script

10:46 pm on Dec 2, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks, tedster, I was hoping you'd post an update of that story. I'm sure you cannot offer up much specifics, but considering these are all open hubs, which PuSH hub(s) you are pushing to?

Given that everyone sees the PuSH subscription address and can subscribe, I am still amazed that this works to the original publisher's advantage. You would think that scrapers would not be far behind Google to subscribe to your PuSHes (if anything, I would not be surprised if they would subscribe earlier than Google, which could be never)

This is from the PuSH protocol definitions:

As in almost all pubsub systems, the publisher is unaware of the subscribers, if any.

Unless there is some serious cloaking going on to let ONLY Google access what's been PuSHed, I cannot see any advantage in using the technique, and in fact it looks rather dangerous if you push the location of a full RSS/Atom feed because you don't know who else has subscribed. And how do you know Google has subscribed in the first place?

The entire purpose of PuSH seems to be to help in dissemination of content of RSS feeds, and many people already find it disseminated too much (see Rosalind's post above)

Anyway, I understand that my utter confusion is caused by my missing some important information, so whatever you can divulge, I'll be grateful. Thanks!

tedster

5:25 am on Dec 5, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The main difference between PuSH and a basic RSS feed is that with PuSH you actually "push" out the information to Google exactly when it's published, instead of waiting for Google to crawl and discover it whenever. Even if another site is also getting your PuSH besides Google, they simply cannot scrape, publish and then get credit for your content earlier than the time and date when you published.

I've read that one reason that Blogger sites can be so problematic when they scrape your content is that PuSH technology is integrated into Blogger by default. This gives those sites an unfortunate advantage, as I see it, unless you also use the technology.

The site I've been talking about was built on a common CMS, and most CMS these days have plug-in modules available for PuSH. Those modules allow the site's server to be its own hub. For example, Wordpress [wordpress.org], Drupal [drupal.org] and MoveableType [plugins.movabletype.org]

I grant you that if you're not using a plug-in for a common CMS, then this technology can sound pretty geeky and confusing :( - howevder, the programmer I worked with said it was a lot easier than he thought it was going to be.

This 32 message thread spans 2 pages: 32

1
2
»