Forum Moderators: bakedjake
sorry for the confusion. we debated having noarchive mean noindex on blekko, but did not go that way in the end
[twitter.com...]
we ignore noarchive and do nothing with it
[twitter.com...]
If you use ModSecurity 2.x, here is a rule to serve that ScoutJet user agent a 403 Forbidden page.
SecRule HTTP_User-Agent "ScoutJet" "deny,log,status:403" 64.13.159.*
38.99.96.*, 38.99.97.*, 38.99.98.*, 38.99.99.*
Blekko was the name of company CEO Rich Skrenta's first networked computer. Skrenta was 15 years old when he wrote the Elk Cloner virus that infected Apple II machines in 1982; it is believed to have been the first large-scale self-spreading personal computer virus ever created. Skrenta went on to work on the Amiga at Commodore, then at Sun Microsystems, then co-founded the Netscape-acquired Dmoz and the Tribune/Gannett/Knight Ridder-acquired local news search engine Topix.
ScoutJet is me, it is a good robot. It has a 45-second min delay between fetches per-ipaddr. Of course you are free not to let it in, it obeys robots.txt of course.
incrediBILL, totally agree on the poor value from niche search engines. Not our intent. Full scale real web search is so much more interesting.
> noarchive
Was never endorsed or proposed by any standards body. New engines are not obligated to honor another search engines proprietary commands.
Before you run over a cliff with wild bs, you really should checkout Blekko, it has some awesome features. You plugged your nose when Google came around with all it's own issues (like 'caching') - give Blekko the same chance.
Though not strictly a bug, this issue is potentially serious for users of Nutch who deploy live systems who might be threatened with legal action for caching copies of copyrighted material. The major search engines all observe this directive (even though apparently it's not stanard) so there's every reason why Nutch should too.
Skentra: "Similarly I think the ODP is suffering from its closed, stultifying culture."
Skrenta was 15 years old when he wrote the Elk Cloner virus that infected Apple II machines in 1982; it is believed to have been the first large-scale self-spreading personal computer virus ever created.
[edited by: zdgn at 11:20 am (utc) on Jan 2, 2011]
By the way, we considered white-listing bots in robots.txt (thus banning all unknown robots). However, we concluded that we would ban many important search engines in countries we know nothing about.
and archive.org offers you NO way of removing back content.
> I'm going to use robots.txt until Blekko changes its stance on this one.
Yep. That's the answer for now.
Crawled: 23h ago
Robots: http://www.webmasterworld.com/robots.txt (last fetched: 20d ago)
I did everything they documented to stop archive.org from crawling or showing my sites on archive.org and it didn't work.
Blocked Site Error.
domain .com is not available in the Wayback Machine.
SecRule HTTP_User-Agent "ia_archiver" "deny,log,status:403" Access denied with code 403 (phase 2). Pattern match "ia_archiver" at REQUEST_HEADERS:User-Agent. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "393"]