homepage Welcome to WebmasterWorld Guest from 54.161.155.142
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

This 64 message thread spans 3 pages: 64 ( [1] 2 3 > >     
Blekko Does Not Honor NOARCHIVE?
topr8




msg:4227709
 11:40 am on Nov 7, 2010 (gmt 0)

Many sites go to great lengths to prevent scrapers from stealing their content.
These same sites also generally prevent the major search engines from cache-ing their pages, using the robots noarchive tag
<meta name="robots" content="noarchive">

notice how WebmasterWorld doesn't have a 'cached' link in the SERPS, this is an example of noarchive in use, all the major search engines support it.

the reason being, is that a search engine cache is a well known backdoor for scrapers, who can scrape your content through their cache instead of directly from your site.

however blekko, the new search engine, has decided that it will not respect the noarchive tag.

I approached blekko to ask them about this and Robert Saliba, of Blekko Inc said :
"we think that the meta noarchive tag is counter to providing our users with transparent information
regarding the ranking and display of search results."

luckily though, for web admins who do use the noarchive tag, he had a solution, as he also said this...
"We also want to respect the wishes of website administrators. Accordingly,
we are making changes so that In the future, we will treat the meta
noarchive tag as a meta noindex tag."

 

tangor




msg:4227713
 11:45 am on Nov 7, 2010 (gmt 0)

Like the cutting nose off despite face attitude.

Then again most SEs these daze (sic) are not worthy of attention.

B, G and Y is top of the charts, all the rest are... all the rest...

Staffa




msg:4227761
 4:01 pm on Nov 7, 2010 (gmt 0)

we will treat the meta
noarchive tag as a meta noindex tag


Here we go again, their diaper isn't dry yet and already it's my way or no way.
A sure fire way to get blocked from the start. They need our content, not the other way around.

londrum




msg:4227807
 6:36 pm on Nov 7, 2010 (gmt 0)

blocked them already

topr8




msg:4228237
 11:16 pm on Nov 8, 2010 (gmt 0)

well i had allowed them for quite a while, so i was very disappointed about the failure to support noarchive as i imagine many others will be too!

oh well, i'm going to have to block them too now! mind you i'm not sure i'd want people analysing my site using their data anyway.

tedster




msg:4248031
 3:07 am on Jan 1, 2011 (gmt 0)

Rich Skrenta, CEO of Blekko, has just clarified their policy for noarchive in a discussion on Twitter:

sorry for the confusion. we debated having noarchive mean noindex on blekko, but did not go that way in the end

[twitter.com...]

topr8




msg:4248047
 8:15 am on Jan 1, 2011 (gmt 0)

... but will noarchive be treated in the same way as all other significant search engines? or will it be ignored?

incrediBILL




msg:4248116
 7:45 pm on Jan 1, 2011 (gmt 0)

Blekko just doesn't get it:

we ignore noarchive and do nothing with it

[twitter.com...]


Webmasters should be able to opt into Blekko without allowing Blekko to display cached copyrighted pages.

Every serious search engine supports NOARCHIVE, and there are many valid reasons to not permit cached pages to be in a search engine.

They claim NOARCHIVE is used by spammers, when in reality they're going to allow SCRAPERS to grab anyone's content that permits Blekko to cache their pages.

We can block scrapers from our sites but we can't block them from taking cached pages elsewhere without controls like NOARCHIVE.

Besides, displaying the entire page without permission, ie ignoring NOARCHIVE, is a blatant copyright violation, snippets are fair use, whole pages are copyright violations, wonder how they'd respond to a DMCA notice?

tedster




msg:4248137
 9:54 pm on Jan 1, 2011 (gmt 0)

I'm you, incrediBILL. At least they dropped the ridiculous blackmail of noarchive=noindex, but ignoring a widely accepted standard is still not the way to go at all. That standard evolved for very solid reasons.

That Twitter discussion just keeps sliding all over the place. It's way to slippery for my taste.

netmeg




msg:4248138
 9:58 pm on Jan 1, 2011 (gmt 0)

I NOARCHIVE everything, for myself and for clients, and I'm not the only one by a long shot. I think they will have to change their policy on this, or dissolve as all the others have.

frontpage




msg:4248139
 9:59 pm on Jan 1, 2011 (gmt 0)

Just a reminder from my post in the other Blekko thread.


If you use ModSecurity 2.x, here is a rule to serve that ScoutJet user agent a 403 Forbidden page.


SecRule HTTP_User-Agent "ScoutJet" "deny,log,status:403"

According to Blekko, ScoutJet crawls from the following IP ranges:

64.13.159.*
38.99.96.*, 38.99.97.*, 38.99.98.*, 38.99.99.*

frontpage




msg:4248140
 10:01 pm on Jan 1, 2011 (gmt 0)

At least they dropped the ridiculous blackmail of noarchive=noindex


I have had no archive for years, yet Blekko still managed to cache/index 94 pages in one domain before I caught what they were doing.

So it looks like they did not even honor the noarchive=noindex to begin with.

jmccormac




msg:4248146
 10:17 pm on Jan 1, 2011 (gmt 0)

The most obvious question: If they are not become yet another Cuil, how do they intend monetising their operation?

Regards...jmcc

wheel




msg:4248174
 12:40 am on Jan 2, 2011 (gmt 0)

Thanks for raising this. I just issued the following on my webserver to block their crawler:
iptables -A INPUT -s 64.13.159.0/24 -j DROP
iptables -A INPUT -s 38.99.96.0/24 -j DROP
iptables -A INPUT -s 38.99.97.0/24 -j DROP
iptables -A INPUT -s 38.99.98.0/24 -j DROP

Smell ya later!

Angonasec




msg:4248182
 1:14 am on Jan 2, 2011 (gmt 0)

They are blocked in our robots.txt and, so far, appear to honour that at least.

But I had a very unpleasant personal encounter with Mr. Skrenta when he was running Dmoz, now CEO of Blekko.

No thanks!

frontpage




msg:4248184
 1:38 am on Jan 2, 2011 (gmt 0)

Wow, "DMOZ". I have not heard about it in a long time.

I am suprised it still exists. It is the fiefdom of the worst in human edited directories.

Mods who have a financial interest in topics they administer prevent legitimate websites from being listed or delete listings and poor webmasters have no recourse to get inclusion.

frontpage




msg:4248185
 1:41 am on Jan 2, 2011 (gmt 0)

When I read this, I needed all I needed to know about this company.

Blekko was the name of company CEO Rich Skrenta's first networked computer. Skrenta was 15 years old when he wrote the Elk Cloner virus that infected Apple II machines in 1982; it is believed to have been the first large-scale self-spreading personal computer virus ever created. Skrenta went on to work on the Amiga at Commodore, then at Sun Microsystems, then co-founded the Netscape-acquired Dmoz and the Tribune/Gannett/Knight Ridder-acquired local news search engine Topix.

Brett_Tabke




msg:4248187
 2:04 am on Jan 2, 2011 (gmt 0)

Yes Blekko honors robots.txt. You do not have allow yourself to be indexed.

Rich Skrenta in another thread said: [webmasterworld.com...]


ScoutJet is me, it is a good robot. It has a 45-second min delay between fetches per-ipaddr. Of course you are free not to let it in, it obeys robots.txt of course.

incrediBILL, totally agree on the poor value from niche search engines. Not our intent. Full scale real web search is so much more interesting.


> noarchive

Was never endorsed or proposed by any standards body. New engines are not obligated to honor another search engines proprietary commands.

Before you run over a cliff with wild bs, you really should checkout Blekko, it has some awesome features. You plugged your nose when Google came around with all it's own issues (like 'caching') - give Blekko the same chance.


There will be updates and changes to their engine as they progress. No engine is going to be without mistakes or oversights when they first get going.


> Skrenta

Rich [webmasterworld.com] is a very old friend of this community and long time user of WebmasterWorld.

incrediBILL




msg:4248204
 3:05 am on Jan 2, 2011 (gmt 0)

> noarchive

Was never endorsed or proposed by any standards body. New engines are not obligated to honor another search engines proprietary commands.


That argument doesn't work because many de facto standards come about without ever being proposed by any standards body, they happen because of majority adoption, which later end up in standards.

Besides, x-no-archive used in a header is an actual RFC standard that was adopted by the search engines from usenet and mutated into the NOARCHIVE meta directive: [en.wikipedia.org...]

Before you run over a cliff with wild bs, you really should checkout Blekko, it has some awesome features. You plugged your nose when Google came around with all it's own issues (like 'caching') - give Blekko the same chance.


We were/are giving it a chance.

The BS started when the CEO came right out and said they wouldn't support NOARCHIVE.

You can either let them post full cache pages, which opposed to fair use snippets is a violation of copyright, with no other option than to completely opt-out of Blekko.

If they force us to opt-out just to protect our content, how is that giving them a chance?

Supporting one simple NOARCHIVE command that's fairly universally supported solves this problem.

Google, Yahoo, Bing, Ask and even Gigablast supports it: [gigablast.com...]

Even open source NUTCH supports it: https://issues.apache.org/jira/browse/NUTCH-167
Though not strictly a bug, this issue is potentially serious for users of Nutch who deploy live systems who might be threatened with legal action for caching copies of copyrighted material. The major search engines all observe this directive (even though apparently it's not stanard) so there's every reason why Nutch should too.


Such universal adoption pretty much spells standard IMO.

frontpage




msg:4248212
 4:46 am on Jan 2, 2011 (gmt 0)

This article by Skrenta pretty much put the coffin nails in DMOZ back in 2006.

Skentra: "Similarly I think the ODP is suffering from its closed, stultifying culture."


[skrenta.com...]

Yet a new search engine that relies on human editing to produce 'slashtags' is going to be any more successful than ODP in the long run given their similar input style remains to be seen.

true_INFP




msg:4248243
 9:57 am on Jan 2, 2011 (gmt 0)

What is Blekko?

What worries me more are major sites like the Wayback Machine archive.org ignoring the noarchive tag...

By the way, we considered white-listing bots in robots.txt (thus banning all unknown robots). However, we concluded that we would ban many important search engines in countries we know nothing about. Individually, none of those search engines shows as a significant traffic source in your stats, but together they are an important and strong source. You may know the Russian Yandex and Chinese Baidu, but there are dozens of smaller countries, each with their own popular search engine. That is why we do not white-list, but black-list bots in robots.txt.

zdgn




msg:4248256
 11:17 am on Jan 2, 2011 (gmt 0)

Skrenta was 15 years old when he wrote the Elk Cloner virus that infected Apple II machines in 1982; it is believed to have been the first large-scale self-spreading personal computer virus ever created.


Gotta love the information age when things like this are almost boasted and reported as medal of honour, eh?

It's like a new barber in your neighbourhood known for saying he'd pricked customers with virus-infected needles at his dad's shop in his teenage years. :-)

[edited by: zdgn at 11:20 am (utc) on Jan 2, 2011]

Rosalind




msg:4248257
 11:18 am on Jan 2, 2011 (gmt 0)

I find it interesting that the noarchive tag isn't merely ignored, it's also deleted from the cached copy.

I'm going to use robots.txt until Blekko changes its stance on this one.

Brett_Tabke




msg:4248275
 12:52 pm on Jan 2, 2011 (gmt 0)

> I'm going to use robots.txt until Blekko changes its stance on this one.

Yep. That's the answer for now.

I will ask Rich to comment.

> What worries me more are major sites like the Wayback Machine
> archive.org ignoring the noarchive tag...

and archive.org offers you NO way of removing back content. Their system in place, simply doesn't work. The only way to remove it, is via a lawyer.

incrediBILL




msg:4248286
 1:48 pm on Jan 2, 2011 (gmt 0)

By the way, we considered white-listing bots in robots.txt (thus banning all unknown robots). However, we concluded that we would ban many important search engines in countries we know nothing about.


Not really, because either with black or white listing you still have to keep an eye on what's new making requests on your server. The difference is with white-listing you get to decide how your data is exported but with blacklisting it's too late, data is out the gate, trying to stop it from being used after the fact it a real problem. By monitoring what's asking for robots.txt you can find things that would actually honor robots.txt, and add anything useful to your whitelist.

Robots.txt is really toothless. What most don't do is also build a whitelisted .htaccess file, which is extremely important as it's a hard block to stop things from getting past robots.txt that aren't allowed.

Trust me, monitoring new robots.txt requests is a lot less work than monitoring for bad activity that needs to be blocked.

It's doing work smart vs. doing it the hard way, as blacklisting is an infinite time suck and whitelisting is finite.

Besides, if you're currently getting referral traffic from these search engines you already know which ones to add to your white list.

If you aren't getting any referral traffic, they're just a drain on your resources.

I'd like to whitelist Blekko, I'm just a NOARCHIVE away from doing it! :)

and archive.org offers you NO way of removing back content.


I did everything they documented to stop archive.org from crawling or showing my sites on archive.org and it didn't work.

However, they provide an email address, I wrote to them, and now my content is blocked from searching in archive.org, they were very prompt about it too.

> I'm going to use robots.txt until Blekko changes its stance on this one.

Yep. That's the answer for now.


Not really, blekko doesn't check it very often.

For instance:
Crawled: 23h ago
Robots: http://www.webmasterworld.com/robots.txt (last fetched: 20d ago)


Crawled WebmasterWorld a day ago but hasn't check robots.txt in 20 days, saw other sites where robots hadn't been checked in months "(last fetched: 79d ago)" so changing robots.txt won't stop them anytime soon.

IMO, robots.txt should be checked every 24h, but that's a whole new thread.

frontpage




msg:4248301
 3:04 pm on Jan 2, 2011 (gmt 0)

I did everything they documented to stop archive.org from crawling or showing my sites on archive.org and it didn't work.


I requested that archive.org delist our sites and they complied.

Blocked Site Error.

domain .com is not available in the Wayback Machine.



However, they still try to spider our sites after the fact.

So, again... I use the magic of Mod-Security to ban them and monitor their IP range.

SecRule HTTP_User-Agent "ia_archiver" "deny,log,status:403"

An example from today:

Access denied with code 403 (phase 2). Pattern match "ia_archiver" at REQUEST_HEADERS:User-Agent. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "393"]
wheel




msg:4248323
 5:43 pm on Jan 2, 2011 (gmt 0)

People that write and distribute computer viruses don't get second chances in my book. How much damage did he cause with his first virus?

frontpage




msg:4248344
 7:40 pm on Jan 2, 2011 (gmt 0)

The amazing thing is that as of 2007, Skrenta was still bragging about his exploit on his blog.

"The joy of the hack"
[skrenta.com...]

Here is another facet of the Blekko search engine that uses your Facebook data.

Blekko Makes Your Facebook Likes Searchable
[blog.searchenginewatch.com...]

skrenta




msg:4248354
 8:19 pm on Jan 2, 2011 (gmt 0)

Thanks everyone for their thoughts on this.

I dug into the issue on our side and learned that we had in fact decided to treat meta noarchive as equivalent to meta noindex, but that this had not been turned on in the code.

We've corrected this, so going forward blekko will treat any meta noarchive pages it encounters as meta noindex, and will not index them. This will take a little time before it is pushed to our production servers and makes it into our indices, so please be patient.

Thanks, and happy new year.

skrenta




msg:4248355
 8:22 pm on Jan 2, 2011 (gmt 0)

Also wanted to point out a correction:

I find it interesting that the noarchive tag isn't merely ignored, it's also deleted from the cached copy.


This is not correct. blekko does not alter the cached view in any way. It shows the exact bytes that we received at the time of crawl.

This 64 message thread spans 3 pages: 64 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved