Adsense pub ID scraping issue (revisited)

Forum Moderators: martinibuster

Message Too Old, No Replies

Adsense pub ID scraping issue (revisited)

Blekko and your Adsense ID

numnum

8:39 pm on Nov 3, 2011 (gmt 0)

We've broached this topic a few times in the last year here, but I want to revisit it.

Background: Blekko lists all sites using the same Adsense publisher. Just bring up any Adsense publisher site in the SERPs and click on the "Adsense" link. The feature is free for users at the expense of the publisher's privacy.

As far as I can determine, the only way to thwart this wonderful feature is to remove Adsense from all URLs on all my sites (not just some of them). But this is not a viable option for me. (If you remove it from one site it will still be referenced by the others in Blekko's SERPs.)

In an older thread on this subject (I can't locate it right now), someone suggested encrypting your HTML. I tried that with a free Web-based encryption tool, but that tool encrypts only the HTML -- not Javascript. Has anyone here figured out another way to thwart this intrusive Blekko feature? And in any event, encrypting Adsense code *might* violate Google's TOS, right?

CMidd

9:39 pm on Nov 3, 2011 (gmt 0)

use a middle man adserver that load adsense. i thing blekko only pull the source and not exe javascript.

numnum

12:47 am on Nov 4, 2011 (gmt 0)

use a middle man adserver that load adsense. i thing blekko only pull the source and not exe javascript.

Can someone explain to me what "middleman adserver" means, how it works, how to implement it, and whether this approach has worked for you personally in masking your publisher ID so that it doesn't appear in the source code without interfering with Adsense delivery to your site.

incrediBILL

3:16 am on Nov 4, 2011 (gmt 0)

Just block Blekko from your sites like I do and my AdSense doesn't show in Blekko.

They don't provide any traffic yet, waste of resources, and expose your AdSense.

BLOCK BLEKKO! [webmasterworld.com] ;)

koan

4:23 am on Nov 4, 2011 (gmt 0)

Like the Joker said "It's simple, we kill the Blekko". Blocked in my robots.txt file.

onepointone

6:11 am on Nov 4, 2011 (gmt 0)

That's fine, but you can find out the same thing using google...

incrediBILL

6:35 am on Nov 4, 2011 (gmt 0)

That's fine, but you can find out the same thing using google...

Yikes! There are a whole bunch of sites doing this!

Time to see if I can't find the bots doing all this, sheesh.

Lame_Wolf

7:55 am on Nov 4, 2011 (gmt 0)

Yikes! There are a whole bunch of sites doing this!

They've been there for ages. I am surprised you didn't know about them.

piatkow

9:45 am on Nov 4, 2011 (gmt 0)

I am surprised you didn't know about them

It probably hadn't even occurred to most publishers to look for them. That's a fundamental problem with the web, the information is there if you know that you want it. If you don't know that it exists then you won't search for it.

incrediBILL

10:42 am on Nov 4, 2011 (gmt 0)

Here's a couple to block:

webdetail.org [webmasterworld.com...]
seeallweb.org [webmasterworld.com...]

I'll post more later.

Many of the sites seem to share data from a single source, might take a while to crack that nut but it will be cracked ;)

netmeg

2:45 pm on Nov 4, 2011 (gmt 0)

Well let me know when you do.

(I have Blekko blocked as well, but they consider my sites spam because I use NOARCHIVE anyway)

incrediBILL

12:59 am on Nov 6, 2011 (gmt 0)

Got a couple of more to block:

spyrush.com [webmasterworld.com...]
abouthedomain.com [webmasterworld.com...]
w3who.net [webmasterworld.com...]

I'm also seeing a lot of shared data from single sources, I've never tried, but if you can opt out of Alexa, Compete and a bunch of the others, besides blocking the ones I've found crawling, you might be able to get your AdSense and GA IDs off the 'net when your site is no longer available in the original aggregator sources.

You'll also want to use NOARCHIVE [noarchive.net] in your pages to stop SE page caching, it's another source these sites can access to get your AdSense ID.

Additionally, block Archives.org in robots.txt, Alexa uses this data I think, but I hear they often ignore robots.txt, who knows:

User-agent: ia_archiver
Disallow: /

Now you know why members in the spider forum do what they do because it's easier to keep your data private if you don't let it get scraped in the first place! Whitelist trusted spiders like Google, Bing, etc. and don't let them cache with NOARCHIVE and it's pretty secure and you'll still get traffic.

topr8

4:09 pm on Nov 6, 2011 (gmt 0)

>> Has anyone here figured out another way to thwart this intrusive Blekko feature?

well the obvious thing to do if you don't block blekko altogether is to just not serve the adsense ads to the blekko bot, pretty straightforward.

numnum

11:48 pm on Nov 6, 2011 (gmt 0)

well the obvious thing to do if you don't block blekko altogether is to just not serve the adsense ads to the blekko bot, pretty straightforward.

How do I do that? (I'm the OP.)

From what I've read among the responses here, it seems that the best course of action is to create a short whitelist in my .htaccess file. (Google and Bing-Hoo account for 99%+ of my search traffic, and so whitelisting just those few should suffice.)

But, if I block all crawlers but the few I've whitelisted, wouldn't Blekko and their ilk continue to make available what they've already archived, like the Wayback Machine does? Or would they remove a URL and any archived version of it from their index when they try crawling it the next time and can't?

As for inserting a string of no-crawl requests in my robots.txt tag, aren't these requests increasingly ignored by the crawlers? If so, why bother? Just block them at the server level instead, right?

incrediBILL

1:36 am on Nov 7, 2011 (gmt 0)

numnum, security it like an onion, you implement it in layers and don't skip a layer just because some don't honor it.

For instance, the bad bots ignore robots.txt but the polite bots honor it, so we implement robots.txt to be nice to the good bots and then use .htaccess to enforce it. It's like posting speed limits and then setting cops in speed traps to make sure you really don't speed after all.

Once you've blocked them, it'll take quite some time for some sites to remove your URLs as they may not revisit your site in weeks or even months before reading your updated robots.txt, and then it make take weeks or more before they update your index.

At this point, you need patience, because once the scraped genie is out of the bottle it takes a lot longer to get it back into the bottle than it would have if you never let it get scraped in the first place.

FYI, If they don't honor the robots.txt removal in the Wayback Machine you can directly contact them via email and insist on having sites removed, they will do it, I've done it. Just make sure your robots.txt and whatever else they require is set properly before contacting them so they can see it's a valid request.

numnum

2:29 am on Nov 7, 2011 (gmt 0)

incrediBILL,

Yes, I understand that I'll need to be patient since the cat (my pub-ID) is out of the bag. Your understanding is that Blekko will delete all cached versions of my pages the next time their bot attempts but is blocked from crawling my site, right? I'm just a bit skeptical about this. After all, by continuing to make cached pages publicly available they provide an archival service for content scrapers.

And yes, I was thinking the same thing: that I should implement as many prophylactics as possible. And I suppose a three-bagger would be in order:

1. whitelist at the server level (.htaccess)
2. noindex (robots.txt or meta tag) and keep adding to the list as needed
3. noarchive (robots.txt or meta tag)

Regarding #3 above, is there any reason why a site owner might want any search engine to archive any URL at all -- assuming the Webmaster stays current with 301 redirects, etc.?

FYI, If they don't honor the robots.txt removal in the Wayback Machine you can directly contact them via email and insist on having sites removed, they will do it, I've done it. Just make sure your robots.txt and whatever else they require is set properly before contacting them so they can see it's a valid request.

Yes, I'm aware of this. In fact, I've intentionally NOT requested that IA remove my site from its archives because my archived content there provides authorship evidence predating my registering that content with the U.S. Copyright Office. What's more, IA stopped archiving my site the very month I started including Adsense code on my pages. And I assume that is why they stopped: their bot is (or at least was at that time) allergic to javascript. So IA is not an issue with me as far as Adsense pub-ID scraping is concerned.

incrediBILL

3:20 am on Nov 7, 2011 (gmt 0)

After all, by continuing to make cached pages publicly available they provide an archival service for content scrapers.

Actually, Blekko implemented meta NOARCHIVE improperly.

Not only don't they cache your page, they totally remove it from the index!

Idiots.

Blekko is the least of your problems.

Now that I've been looking there are dozens of other sites exposing AdSense and GA info, and perhaps they're gleaning it from Blekko, hard to say.

All I know is some of the sites are being 100% blocked from my servers yet get the AdSense ID so it's being obtained from a 3rd party service.

Still working on figuring out who that 3rd party service is as well, since I have Blekko blocked!

Regarding #3 above, is there any reason why a site owner might want any search engine to archive any URL at all -- assuming the Webmaster stays current with 301 redirects, etc.?

Archives, including the Wayback Machine, are a double-edged sword.

You can use them to your advantage at times, at other times you can get screwed severely when scrapers, lawyers, competitors, SEOs and others use them against you.

I recommend people use private site archival services instead, there are a few, which can also be referenced when push comes to shove regarding copyright, etc. but they are private and not fodder for scrapers, lawyers, SEOs, etc.

numnum

9:12 pm on Nov 8, 2011 (gmt 0)

Actually, Blekko implemented meta NOARCHIVE improperly....Not only don't they cache your page, they totally remove it from the index!...Idiots!

I've just implemented NOARCHIVE across my entire site. (I don't see any reason not to.) If you're correct, incrediBILL, then the next time Blekko tries to crawl me, I'll disappear entirely from their index. Perfect!

All I know is some of the sites are being 100% blocked from my servers yet get the AdSense ID so it's being obtained from a 3rd party service....Still working on figuring out who that 3rd party service is as well, since I have Blekko blocked!

Interesting. Perhaps a below-the-radar scraper solicits and sells the information. If so, they might be hard to identify unless you go undercover by posing as a bot/SE operator. Just a theory from someone who knows next to nothing about that business.

I recommend people use private site archival services instead, there are a few, which can also be referenced when push comes to shove regarding copyright, etc. but they are private and not fodder for scrapers, lawyers, SEOs, etc.

I'm using a public archival service called the U.S. Copyright Office, which should serve my interests nicely should an infringement issue arise.