Blekko Does Not Honor NOARCHIVE? - Alternative Search Engines forum at WebmasterWorld - WebmasterWorld

Forum Moderators: bakedjake

Message Too Old, No Replies

Blekko Does Not Honor NOARCHIVE?

1
2
3
»

topr8

11:40 am on Nov 7, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Many sites go to great lengths to prevent scrapers from stealing their content.
These same sites also generally prevent the major search engines from cache-ing their pages, using the robots noarchive tag
<meta name="robots" content="noarchive">

notice how WebmasterWorld doesn't have a 'cached' link in the SERPS, this is an example of noarchive in use, all the major search engines support it.

the reason being, is that a search engine cache is a well known backdoor for scrapers, who can scrape your content through their cache instead of directly from your site.

however blekko, the new search engine, has decided that it will not respect the noarchive tag.

I approached blekko to ask them about this and Robert Saliba, of Blekko Inc said :
"we think that the meta noarchive tag is counter to providing our users with transparent information
regarding the ranking and display of search results."

luckily though, for web admins who do use the noarchive tag, he had a solution, as he also said this...
"We also want to respect the wishes of website administrators. Accordingly,
we are making changes so that In the future, we will treat the meta
noarchive tag as a meta noindex tag."

tangor

11:45 am on Nov 7, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Like the cutting nose off despite face attitude.

Then again most SEs these daze (sic) are not worthy of attention.

B, G and Y is top of the charts, all the rest are... all the rest...

Staffa

4:01 pm on Nov 7, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

we will treat the meta
noarchive tag as a meta noindex tag

Here we go again, their diaper isn't dry yet and already it's my way or no way.
A sure fire way to get blocked from the start. They need our content, not the other way around.

londrum

6:36 pm on Nov 7, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

blocked them already

topr8

11:16 pm on Nov 8, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

well i had allowed them for quite a while, so i was very disappointed about the failure to support noarchive as i imagine many others will be too!

oh well, i'm going to have to block them too now! mind you i'm not sure i'd want people analysing my site using their data anyway.

tedster

3:07 am on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Rich Skrenta, CEO of Blekko, has just clarified their policy for noarchive in a discussion on Twitter:

sorry for the confusion. we debated having noarchive mean noindex on blekko, but did not go that way in the end

[twitter.com...]

topr8

8:15 am on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

... but will noarchive be treated in the same way as all other significant search engines? or will it be ignored?

incrediBILL

7:45 pm on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Blekko just doesn't get it:

we ignore noarchive and do nothing with it

[twitter.com...]

Webmasters should be able to opt into Blekko without allowing Blekko to display cached copyrighted pages.

Every serious search engine supports NOARCHIVE, and there are many valid reasons to not permit cached pages to be in a search engine.

They claim NOARCHIVE is used by spammers, when in reality they're going to allow SCRAPERS to grab anyone's content that permits Blekko to cache their pages.

We can block scrapers from our sites but we can't block them from taking cached pages elsewhere without controls like NOARCHIVE.

Besides, displaying the entire page without permission, ie ignoring NOARCHIVE, is a blatant copyright violation, snippets are fair use, whole pages are copyright violations, wonder how they'd respond to a DMCA notice?

tedster

9:54 pm on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'm you, incrediBILL. At least they dropped the ridiculous blackmail of noarchive=noindex, but ignoring a widely accepted standard is still not the way to go at all. That standard evolved for very solid reasons.

That Twitter discussion just keeps sliding all over the place. It's way to slippery for my taste.

netmeg

9:58 pm on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I NOARCHIVE everything, for myself and for clients, and I'm not the only one by a long shot. I think they will have to change their policy on this, or dissolve as all the others have.

frontpage

9:59 pm on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just a reminder from my post in the other Blekko thread.

If you use ModSecurity 2.x, here is a rule to serve that ScoutJet user agent a 403 Forbidden page.

SecRule HTTP_User-Agent "ScoutJet" "deny,log,status:403"

According to Blekko, ScoutJet crawls from the following IP ranges:

64.13.159.*
38.99.96.*, 38.99.97.*, 38.99.98.*, 38.99.99.*

frontpage

10:01 pm on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

At least they dropped the ridiculous blackmail of noarchive=noindex

I have had no archive for years, yet Blekko still managed to cache/index 94 pages in one domain before I caught what they were doing.

So it looks like they did not even honor the noarchive=noindex to begin with.

jmccormac

10:17 pm on Jan 1, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The most obvious question: If they are not become yet another Cuil, how do they intend monetising their operation?

Regards...jmcc

wheel

12:40 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks for raising this. I just issued the following on my webserver to block their crawler:
iptables -A INPUT -s 64.13.159.0/24 -j DROP
iptables -A INPUT -s 38.99.96.0/24 -j DROP
iptables -A INPUT -s 38.99.97.0/24 -j DROP
iptables -A INPUT -s 38.99.98.0/24 -j DROP

Smell ya later!

Angonasec

1:14 am on Jan 2, 2011 (gmt 0)

They are blocked in our robots.txt and, so far, appear to honour that at least.

But I had a very unpleasant personal encounter with Mr. Skrenta when he was running Dmoz, now CEO of Blekko.

No thanks!

frontpage

1:38 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Wow, "DMOZ". I have not heard about it in a long time.

I am suprised it still exists. It is the fiefdom of the worst in human edited directories.

Mods who have a financial interest in topics they administer prevent legitimate websites from being listed or delete listings and poor webmasters have no recourse to get inclusion.

frontpage

1:41 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

When I read this, I needed all I needed to know about this company.

Blekko was the name of company CEO Rich Skrenta's first networked computer. Skrenta was 15 years old when he wrote the Elk Cloner virus that infected Apple II machines in 1982; it is believed to have been the first large-scale self-spreading personal computer virus ever created. Skrenta went on to work on the Amiga at Commodore, then at Sun Microsystems, then co-founded the Netscape-acquired Dmoz and the Tribune/Gannett/Knight Ridder-acquired local news search engine Topix.

Brett_Tabke

2:04 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Best Post Of The Month

Yes Blekko honors robots.txt. You do not have allow yourself to be indexed.

Rich Skrenta in another thread said: [webmasterworld.com...]

ScoutJet is me, it is a good robot. It has a 45-second min delay between fetches per-ipaddr. Of course you are free not to let it in, it obeys robots.txt of course.

incrediBILL, totally agree on the poor value from niche search engines. Not our intent. Full scale real web search is so much more interesting.

> noarchive

Was never endorsed or proposed by any standards body. New engines are not obligated to honor another search engines proprietary commands.

Before you run over a cliff with wild bs, you really should checkout Blekko, it has some awesome features. You plugged your nose when Google came around with all it's own issues (like 'caching') - give Blekko the same chance.

There will be updates and changes to their engine as they progress. No engine is going to be without mistakes or oversights when they first get going.

> Skrenta

Rich [webmasterworld.com] is a very old friend of this community and long time user of WebmasterWorld.

incrediBILL

3:05 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

> noarchive

Was never endorsed or proposed by any standards body. New engines are not obligated to honor another search engines proprietary commands.

That argument doesn't work because many de facto standards come about without ever being proposed by any standards body, they happen because of majority adoption, which later end up in standards.

Besides, x-no-archive used in a header is an actual RFC standard that was adopted by the search engines from usenet and mutated into the NOARCHIVE meta directive: [en.wikipedia.org...]

Before you run over a cliff with wild bs, you really should checkout Blekko, it has some awesome features. You plugged your nose when Google came around with all it's own issues (like 'caching') - give Blekko the same chance.

We were/are giving it a chance.

The BS started when the CEO came right out and said they wouldn't support NOARCHIVE.

You can either let them post full cache pages, which opposed to fair use snippets is a violation of copyright, with no other option than to completely opt-out of Blekko.

If they force us to opt-out just to protect our content, how is that giving them a chance?

Supporting one simple NOARCHIVE command that's fairly universally supported solves this problem.

Google, Yahoo, Bing, Ask and even Gigablast supports it: [gigablast.com...]

Even open source NUTCH supports it: [issues.apache.org...]

Though not strictly a bug, this issue is potentially serious for users of Nutch who deploy live systems who might be threatened with legal action for caching copies of copyrighted material. The major search engines all observe this directive (even though apparently it's not stanard) so there's every reason why Nutch should too.

Such universal adoption pretty much spells standard IMO.

frontpage

4:46 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This article by Skrenta pretty much put the coffin nails in DMOZ back in 2006.

Skentra: "Similarly I think the ODP is suffering from its closed, stultifying culture."

[skrenta.com...]

Yet a new search engine that relies on human editing to produce 'slashtags' is going to be any more successful than ODP in the long run given their similar input style remains to be seen.

true_INFP

9:57 am on Jan 2, 2011 (gmt 0)

10+ Year Member

What is Blekko?

What worries me more are major sites like the Wayback Machine archive.org ignoring the noarchive tag...

By the way, we considered white-listing bots in robots.txt (thus banning all unknown robots). However, we concluded that we would ban many important search engines in countries we know nothing about. Individually, none of those search engines shows as a significant traffic source in your stats, but together they are an important and strong source. You may know the Russian Yandex and Chinese Baidu, but there are dozens of smaller countries, each with their own popular search engine. That is why we do not white-list, but black-list bots in robots.txt.

zdgn

11:17 am on Jan 2, 2011 (gmt 0)

10+ Year Member

Skrenta was 15 years old when he wrote the Elk Cloner virus that infected Apple II machines in 1982; it is believed to have been the first large-scale self-spreading personal computer virus ever created.

Gotta love the information age when things like this are almost boasted and reported as medal of honour, eh?

It's like a new barber in your neighbourhood known for saying he'd pricked customers with virus-infected needles at his dad's shop in his teenage years. :-)

[edited by: zdgn at 11:20 am (utc) on Jan 2, 2011]

Rosalind

11:18 am on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I find it interesting that the noarchive tag isn't merely ignored, it's also deleted from the cached copy.

I'm going to use robots.txt until Blekko changes its stance on this one.

Brett_Tabke

12:52 pm on Jan 2, 2011 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Best Post Of The Month

> I'm going to use robots.txt until Blekko changes its stance on this one.

Yep. That's the answer for now.

I will ask Rich to comment.

> What worries me more are major sites like the Wayback Machine
> archive.org ignoring the noarchive tag...

and archive.org offers you NO way of removing back content. Their system in place, simply doesn't work. The only way to remove it, is via a lawyer.

incrediBILL

1:48 pm on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

By the way, we considered white-listing bots in robots.txt (thus banning all unknown robots). However, we concluded that we would ban many important search engines in countries we know nothing about.

Not really, because either with black or white listing you still have to keep an eye on what's new making requests on your server. The difference is with white-listing you get to decide how your data is exported but with blacklisting it's too late, data is out the gate, trying to stop it from being used after the fact it a real problem. By monitoring what's asking for robots.txt you can find things that would actually honor robots.txt, and add anything useful to your whitelist.

Robots.txt is really toothless. What most don't do is also build a whitelisted .htaccess file, which is extremely important as it's a hard block to stop things from getting past robots.txt that aren't allowed.

Trust me, monitoring new robots.txt requests is a lot less work than monitoring for bad activity that needs to be blocked.

It's doing work smart vs. doing it the hard way, as blacklisting is an infinite time suck and whitelisting is finite.

Besides, if you're currently getting referral traffic from these search engines you already know which ones to add to your white list.

If you aren't getting any referral traffic, they're just a drain on your resources.

I'd like to whitelist Blekko, I'm just a NOARCHIVE away from doing it! :)

and archive.org offers you NO way of removing back content.

I did everything they documented to stop archive.org from crawling or showing my sites on archive.org and it didn't work.

However, they provide an email address, I wrote to them, and now my content is blocked from searching in archive.org, they were very prompt about it too.

> I'm going to use robots.txt until Blekko changes its stance on this one.

Yep. That's the answer for now.

Not really, blekko doesn't check it very often.

For instance:

Crawled: 23h ago
Robots: http://www.webmasterworld.com/robots.txt (last fetched: 20d ago)

Crawled WebmasterWorld a day ago but hasn't check robots.txt in 20 days, saw other sites where robots hadn't been checked in months "(last fetched: 79d ago)" so changing robots.txt won't stop them anytime soon.

IMO, robots.txt should be checked every 24h, but that's a whole new thread.

frontpage

3:04 pm on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I did everything they documented to stop archive.org from crawling or showing my sites on archive.org and it didn't work.

I requested that archive.org delist our sites and they complied.

Blocked Site Error.

domain .com is not available in the Wayback Machine.

However, they still try to spider our sites after the fact.

So, again... I use the magic of Mod-Security to ban them and monitor their IP range.

SecRule HTTP_User-Agent "ia_archiver" "deny,log,status:403"

An example from today:

Access denied with code 403 (phase 2). Pattern match "ia_archiver" at REQUEST_HEADERS:User-Agent. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "393"]

wheel

5:43 pm on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

People that write and distribute computer viruses don't get second chances in my book. How much damage did he cause with his first virus?

frontpage

7:40 pm on Jan 2, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The amazing thing is that as of 2007, Skrenta was still bragging about his exploit on his blog.

"The joy of the hack"
[skrenta.com...]

Here is another facet of the Blekko search engine that uses your Facebook data.

Blekko Makes Your Facebook Likes Searchable
[blog.searchenginewatch.com...]

skrenta

8:19 pm on Jan 2, 2011 (gmt 0)

10+ Year Member

Thanks everyone for their thoughts on this.

I dug into the issue on our side and learned that we had in fact decided to treat meta noarchive as equivalent to meta noindex, but that this had not been turned on in the code.

We've corrected this, so going forward blekko will treat any meta noarchive pages it encounters as meta noindex, and will not index them. This will take a little time before it is pushed to our production servers and makes it into our indices, so please be patient.

Thanks, and happy new year.

skrenta

8:22 pm on Jan 2, 2011 (gmt 0)

10+ Year Member

Also wanted to point out a correction:

I find it interesting that the noarchive tag isn't merely ignored, it's also deleted from the cached copy.

This is not correct. blekko does not alter the cached view in any way. It shows the exact bytes that we received at the time of crawl.

This 64 message thread spans 3 pages: 64

1
2
3
»