homepage Welcome to WebmasterWorld Guest from 54.166.8.138
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

This 64 message thread spans 3 pages: < < 64 ( 1 [2] 3 > >     
Blekko Does Not Honor NOARCHIVE?
topr8




msg:4227709
 11:40 am on Nov 7, 2010 (gmt 0)

Many sites go to great lengths to prevent scrapers from stealing their content.
These same sites also generally prevent the major search engines from cache-ing their pages, using the robots noarchive tag
<meta name="robots" content="noarchive">

notice how WebmasterWorld doesn't have a 'cached' link in the SERPS, this is an example of noarchive in use, all the major search engines support it.

the reason being, is that a search engine cache is a well known backdoor for scrapers, who can scrape your content through their cache instead of directly from your site.

however blekko, the new search engine, has decided that it will not respect the noarchive tag.

I approached blekko to ask them about this and Robert Saliba, of Blekko Inc said :
"we think that the meta noarchive tag is counter to providing our users with transparent information
regarding the ranking and display of search results."

luckily though, for web admins who do use the noarchive tag, he had a solution, as he also said this...
"We also want to respect the wishes of website administrators. Accordingly,
we are making changes so that In the future, we will treat the meta
noarchive tag as a meta noindex tag."

 

buckworks




msg:4248356
 8:28 pm on Jan 2, 2011 (gmt 0)

will treat any meta noarchive pages it encounters as meta noindex, and will not index them


Then there's not much value in allowing Blekko to crawl our sites, is there?

skrenta




msg:4248357
 8:43 pm on Jan 2, 2011 (gmt 0)

Of course, it's up to you whether you want blekko crawling your site or not.

noarchive is not a panacea for scraping, and I doubt whether you let blekko crawl your content or not will have any effect on the level of copying we see on the internet. There are myriad ways to get content from sites, and going through a 3rd party rate-limited site to pull cached content would be pretty far down my list of techniques.

The question about whether we could support noarchive opens up a can of worms for us, given our position on data transparency and our seo features. If we shut off the cache view, do we also have to suppress duptext report? backlink / anchortext info for outbound material? term stats? We're all about making web data available to the public, so if a site wants to limit usage of their data, we'd prefer to exclude them entirely.

From our perspective, noarchive content is highly correlated with spam, bait & switch paywalls, and other user-unfriendly material. We believe that omitting this content will result in an overall quality boost in our index.

frontpage




msg:4248358
 8:50 pm on Jan 2, 2011 (gmt 0)

Take for example the website: NYtimes.com

If you look at the header of that index it shows:

<title>The New York Times - Breaking News, World News &amp; Multimedia</title>
<meta name="robots" content="noarchive,noodp,noydir">
<meta name="description" content="Find breaking news, multimedia, reviews &amp; opinion on Washington, business, sports, movies, travel, books, jobs, education, real estate, cars &amp; more.">


Go to the Blekko search result for the NYTimes and click on the cache function with this URL - [blekko.com...]

When you use your browser like Firfox to "View Page Source" you get this:

<head>
<title>The New York Times - Breaking News, World News &amp; Multimedia</title>
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml">


As you can see, all the meta data has been stripped out of the page.

Only when you click on the Blekko link on the upper right - [blekko.com...]

"view source", do you see the original source code.

Perhaps, that's what the other poster was talking about.

noarchive content is highly correlated with spam, bait & switch paywalls, and other user-unfriendly material.


Right, that explains why most content providers use it to prevent scrapping, just ask the NYTimes.

wheel




msg:4248369
 9:35 pm on Jan 2, 2011 (gmt 0)

From our perspective, noarchive content is highly correlated with spam, bait & switch paywalls, and other user-unfriendly material. We believe that omitting this content will result in an overall quality boost in our index.

It's also highly correlated with savvy webmasters. So you just removed all their content (including for example, the 10-20K of unique content in competitive niches that I have scattered around my various sites).

Frankly, none of the other SE's seem to think that the use of noarchive is a unique indicator of spam. Perhaps they've found other, more reliable indicators.

Right now the only use Blekko will have is for my competitors and scrapers/spammers to grab my data and/or review info about my site. There's about 0 chance that they'd send through any buying traffic. And for that reason, I've IP blocked them at the server level. All downside, no upside.

incrediBILL




msg:4248370
 9:38 pm on Jan 2, 2011 (gmt 0)

This is not correct. blekko does not alter the cached view in any way. It shows the exact bytes that we received at the time of crawl.


That's not true at all - the AdSense code is stripped from sites in your cache

noarchive is not a panacea for scraping


Not a panacea, but it's another tool to be used.

I have all bots blocked except a few, everything whitelisted, browsers are validated, and then there are anti-scraping controls for those stealth bots pretending to be browsers, my content distribution is locked down like Fort Knox except to real humans.

... until Blekko crawls, then it's the Wild Wild West all over again, cache to be scraped.

I put tracking bugs in my pages and proved over and over again that scrapers that couldn't steal from my site were simply scraping the search engines, as have many others.

If you think we have no control over how our content is being distributed, or whether you have the right to display full copyrighted pages without permission, you're WRONG!

Saying only spammers use NOARCHIVE is idiotic, spammers use everything the white hats use, so blocking white hats from using the tools we use to thwart scrapers and spammers is PUNISHING THE GOOD GUYS!

You're new to this game, we've been at it much longer fighting spammers and scrapers, we need NOARCHIVE, that's just the way it is.

Not to mention cache pages often cause people serious problems. People have been sued over cached pages. Lawyers come telling people to remove trademarks and copyrighted data from a site, webmaster complies, yet the cache page and copies in archive.org still exist beyond the webmasters immediate control so they end up in court, it's happened many times already.

Go read [noarchive.net...] and you'll learn why cached pages should be avoided.

Otherwise, I guess you're not serious about being a respected search engine that webmasters will want to be a part of because we're not about to give back our hard earned gains in content control just because some new company comes along without knowing the history of the situation and has some wacky ideas why they should screw everyone that relies on NOARCHIVE.

Do whatever you want - until NOARCHIVE is resolved, BLOCKED!

[edited by: incrediBILL at 9:54 pm (utc) on Jan 2, 2011]

frontpage




msg:4248374
 9:52 pm on Jan 2, 2011 (gmt 0)

Will you have an opt-out function for webmasters who's sites were cached despite specifically requesting not to with the noarchive meta?

Or, will you automatically remove our data once you respider and encounter the noarchive meta again and treat it as a noindex?

skrenta




msg:4248375
 10:03 pm on Jan 2, 2011 (gmt 0)

Incredibill, happy to chat here, but please stop shouting at me. Thanks.

Frontpage - the content will be pulled as soon as we've recrawled/reindexed your content. ScoutJet also honors robots.txt. You can also 403 or ip ban, but the interval for recrawl/removal will be the same.

incrediBILL




msg:4248379
 10:06 pm on Jan 2, 2011 (gmt 0)

Incredibill, happy to chat here, but please stop shouting at me. Thanks.


SHOUTING? I'm emphasizing, not shouting.

Ask around, you'll know when I'm shouting. :)

frontpage




msg:4248380
 10:09 pm on Jan 2, 2011 (gmt 0)

Thanks for the quick reply Skrenta. I see that your server is using Nginx, thats a nice compliment to that system to handle the traffic.

Brett_Tabke




msg:4248386
 10:47 pm on Jan 2, 2011 (gmt 0)

We've corrected this, so going forward blekko will treat any meta noarchive pages it encounters as meta noindex, and will not index them. This will take a little time before it is pushed to our production servers and makes it into our indices, so please be patient.


Thanks so much Rich. I can only imagine the number of exceptions, issues, and "oops" details you have had to deal with while building Blekko. It is nice to know you guys can take some time to address those issues issues when they pop up. This goes a long way to long term success.

incrediBILL




msg:4248405
 11:44 pm on Jan 2, 2011 (gmt 0)

We've corrected this, so going forward blekko will treat any meta noarchive pages it encounters as meta noindex, and will not index them. This will take a little time before it is pushed to our production servers and makes it into our indices, so please be patient.


That's not a fix whatsoever.

Now you've bastardized the intent and use of the NOARCHIVE meta directive.

You're better off requiring blocking in robots.txt instead of muddying the waters and confusing webmasters on the use of that meta tag.

<shakes head in dismay>

Just change the name to blecch-o if you're going to be messing everything up.

BTW, implement "revisit-after" while you're at it...

In other words, if you aren't going to do it right, please don't do it at all.

[edited by: incrediBILL at 11:55 pm (utc) on Jan 2, 2011]

Angonasec




msg:4248406
 11:47 pm on Jan 2, 2011 (gmt 0)

skrenta said,
Q/
Frontpage - the content will be pulled as soon as we've recrawled/reindexed your content. ScoutJet also honors robots.txt. You can also 403 or ip ban, but the interval for recrawl/removal will be the same.
/Q

Kindly confirm for all of us, just so there's absolutely no room for misunderstandings.

Those of us who have chosen to EITHER ban your bot's IPs, or have disallowed your bots via robots.txt, or have done BOTH... will ALL content gathered from our site BEFORE the blocking methods were put in place be removed from your search engine?

Angonasec




msg:4248407
 11:59 pm on Jan 2, 2011 (gmt 0)

Brett_Tabke said:
Q/
and archive.org offers you NO way of removing back content. Their system in place, simply doesn't work. The only way to remove it, is via a lawyer.
/Q

Greasy though they are, I was surprised to find it straight-forward to remove ALL our material from archive.org, just by using the methods they outline on their site. I had about a dozen old sites pulled. Now we block them on robots.txt and IP (I know, I know, but belt and braces :)

netmeg




msg:4248412
 12:47 am on Jan 3, 2011 (gmt 0)

Wow. Just... wow.

incrediBILL




msg:4248434
 3:15 am on Jan 3, 2011 (gmt 0)

Seriously, why do I care anyway?

I tried to help them make the right decision but in the end they can fall my the wayside like Cuil, Mahalo, SearchMe, dozens (hundreds?) of others, etc.

Just a matter of time before the VC's pull the plug and my blood pressure will return to normal.

true_INFP




msg:4248551
 1:32 pm on Jan 3, 2011 (gmt 0)

From our perspective, noarchive content is highly correlated with spam, bait & switch paywalls, and other user-unfriendly material. We believe that omitting this content will result in an overall quality boost in our index.

lol

Staffa




msg:4248654
 6:58 pm on Jan 3, 2011 (gmt 0)

Seriously, why do I care anyway?


Exactly !
Any start-up SE only has to read through the boards here, there are numerous does and don'ts to find for them written by webmasters who know what they are talking about. Those who don't heed it are sure to fall by the wayside as many have before them.

I have blocked this particular bot since I first noticed it as it had then nothing to offer to me in return.
With their current attitude it will remain blocked.

I already had a stroke and a heart attack a few years ago (fully recovered though) and no bot with attitude is ever going to raise my blood pressure again ;o)

netmeg




msg:4248727
 10:13 pm on Jan 3, 2011 (gmt 0)

Actually, if noarchive = noindex, you don't even have to block it.

Brett_Tabke




msg:4248729
 10:17 pm on Jan 3, 2011 (gmt 0)

> remove ALL our material from archive.org

It doesn't work. I've tried for five years to get 2 doa sites removed that I just don't want on the web. There is no way to do it short of a lawyer.

> That's not a fix whatsoever.

Sure it is - it is perfect.

> Now you've #*$!ized the intent and use of the
> NOARCHIVE meta directive.

This was a command Google thought up out of thin air. Other se's are under no obligation. Is it copyrighted? By placing a proprietary Google tag on your page are you quietly entering into an agreement?

> You're better off requiring blocking in robots.txt instead of
> muddying the waters and confusing webmasters on the use of that meta tag.

There is no confusion. You use noarchive on Blekko and it is the same as noindex. If there is muddy waters here, it is on Google who instituted the rebroadcasting1 of websites with their name at the top in the url. Any se that supports caching is opening themselves up to the same copyright issues Google is still open too. Blekko is probably wise not to even deal with that mess and treat noarchive as noindex. This may well be a decision made by the legal team.

1 Google called it caching in an attempt to seek cover as an ISP DMCA safe harbor rules. Googles version of 'caching' has never been tested in a court.

physics




msg:4248732
 10:28 pm on Jan 3, 2011 (gmt 0)

blocked with a 403 now, forget this meta and robots.txt stuff :)

incrediBILL




msg:4248761
 11:54 pm on Jan 3, 2011 (gmt 0)

Blekko is probably wise not to even deal with that mess and treat noarchive as noindex. This may well be a decision made by the legal team.


Yeah, and let's start mixing the metaphors and create confusion.

NOINDEX means NO INDEX
NOARCHIVE mean NO CACHE PAGE

Totally different things and it's a complete BS argument because the legal team should tell them not to display cache at all if they have half a brain, not to confuse the matter by misinterpreting the meaning of commonly accepted meta tags!

So what if Google just made it up?

As I pointed out long ago in this thread that Bing, Yahoo, Ask, Gigablast, Nutch and others adopted it as-is.

Why should these guys come along and do something different?

Not a good engineering principle to muddy the waters with a non-standard implementation but it appears they aren't interested in good engineering with such attitudes which brings the quality of the SE sand it's standards into question IMO.

wheel




msg:4248763
 11:56 pm on Jan 3, 2011 (gmt 0)

There is no confusion. You use noarchive on Blekko and it is the same as noindex.

Fair enough. But then there's no halfway, were we can have our content indexed (to both our and the SE's benefit) and not subject ourselves to a cache. They're throwing the baby out with the bathwater.

In my case, I've got a lot of sites on my server and I'm not about to go through them one by one and change my htaccess or robots.txt over this. They got IP blocked at the server level. How's that for noarchive? :).

The underlying problem here is the hubris that's being displayed. It's worthy of a Google employee. They've ignored the webmaster community and are seeing the reaction - you're blocked. They lose valuable content for their index. There are ramifications for ignoring the wants of those whose content you are taking. And I'm not like Incredibill - I let pretty much everyone scrape the bejeepers out of my content.

TheMadScientist




msg:4248898
 11:17 am on Jan 4, 2011 (gmt 0)

Why should these guys come along and do something different?

Uh... Because they can?

IMO It's actually really amusing how people get bent out of shape over silly ish sometimes... The really interesting thing is there's a bunch of people here up in arms who have them blocked, but somehow Blekko still works... It seems Blekko having the ability to return results doesn't hinge on the sites that have them blocked... Go figure there's billions of pages on the Internet and they can show results without yours... Or mine for that matter, because not only do I run noarchive on my sites I have them blocked at the server level, but it really doesn't bother me a bit if they decide to treat a tag differently than some others... What about nofollow do they have to treat it the same way as the other SEs do too or is it only the one directive they have to treat like everyone else?

[edited by: TheMadScientist at 11:30 am (utc) on Jan 4, 2011]

londrum




msg:4248903
 11:26 am on Jan 4, 2011 (gmt 0)

i blocked them with robots.txt back in november (at the beginning of the thread) and my entire site is still showing in their index - its just that they haven't recrawled it. so all the pages are now 2 months out of date.

so maybe the way they honour noindex is a little bit different as well.

wheel




msg:4248909
 11:56 am on Jan 4, 2011 (gmt 0)


IMO It's actually really amusing how people get bent out of shape over silly ish sometimes...

It's not amusing when people take my content for their personal gain.My content is my livelihood, I don't know why you find that amusing.

TheMadScientist




msg:4249054
 7:23 pm on Jan 4, 2011 (gmt 0)

They're not showing it in the index (or say they're not) if you use a noarchive tag...
(It's in the process of or will shortly be removed from what I've read.)

How is that using your content for their gain?

<amusing>
Blekko you Bast**** you spidered my page, scanned my noarchive tag and now you don't show the page in your index or even a cache of the page. It's like it's not even there when I search... STOP USING MY CONTENT FOR YOUR GAIN!
</amusing>

They're using your content Less Than Anyone Else when you put a no archive tag on the page. Starting to look like a bash Blekko for being different thread...

If you all don't like Blekko block and don't use 'em and if you think you know how it should / has to be done to be successful, quit telling Blekko, get off your a** and do it... My guess is you can't, even though you keep telling others how they have to.

I actually think their search features are cool...

incrediBILL




msg:4249069
 7:59 pm on Jan 4, 2011 (gmt 0)

Did we just get our collective legs pulled?

Go check out WebmasterWorld's cache pages on blekko:
[blekko.com...]

I see snippets and when I click cache "Error: No content" so for WebmasterWorld it appears to be implemented to support NOARCHIVE while maintaining the snippets in the index.

This is what it's supposed to do.

So why not just come out and admit they did it right or are they changing it?

londrum




msg:4249074
 8:04 pm on Jan 4, 2011 (gmt 0)

doesnt work like that with my site. the link still shows a cached copy. and my pages have been noarchived since the dawn of time.

Angonasec




msg:4249207
 2:33 am on Jan 5, 2011 (gmt 0)

Brett_Tabke said:
Q/
There is no way to do it short of a lawyer.
/Q

Flat wrong! Simply reading this thread alone you'll see three of us had content removed quickly simply by following the archive.org instructions on their site. No lawyers involved in pulling any of our sites.

incrediBILL




msg:4249208
 2:38 am on Jan 5, 2011 (gmt 0)

Found out why the WebmasterWorld listings in Blekko honor NOARCHIVE, because Blekko has never crawled WebmasterWorld and the listings are being pulled from some 3d party SE API

They've never crawled WebmasterWorld ever, blocked by robots.txt, yet still display listings for WebmasterWorld, huh?

So the deal is if you don't let them crawl your site, they'll honor NOARCHIVE when they access your data anyway thru a 3rd party API.

How the hell do you opt-out of Blekko if they use a 3rd party API to get your data even if you block them via robots.txt. By pulling data from a 3rd party for sites that have blocked them from robots.txt they give the surfers of their site the impression that you're permitting them to use your data, even when the crawler is blocked.


BLEKKO IS HOTEL CALIFORNIA! YOU CAN CHECK IN BUT YOU CAN NEVER LEAVE!

topr8




msg:4249376
 2:55 pm on Jan 5, 2011 (gmt 0)

>>Go check out WebmasterWorld's cache pages on blekko:

interestingly the top 2 results are for

WebmasterWorld.com

and

www.WebmasterWorld.com

despite the fact that webmasterworld.com is redirected to www.WebmasterWorld.com

This 64 message thread spans 3 pages: < < 64 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved