Forum Moderators: bakedjake
In a move with potentially far-reaching implications for the search market, Alexa Internet is opening up its huge web crawler to any programmer who wants paid access to its rich trove of internet data.
From Alexa Web Search Platform [websearch.alexa.com]
The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools.
The pricing scheme is confusing, but it looks like it would be fairly cheap for what its offering.
Anyone in here know what this is really going to accomplish?
$1 per 50 GB processed
I assume here they mean 50 GB of raw uncompressed data - it therefore follows that its $1 per about 2.5 mln web pages, or $2000 per 5 bln (what appears to be their 2 month worth of crawling) processed.
Getting your data out will cost more - $1 per GB is not cheap if you are building full text index: at 2 KB per page (10 times less than raw size) it will take 10 TB or $10,000. It will cost the same if you keep the data on their system for a year.
All in all it is not THAT expensive - spammers will certainly like the idea of paying just 2K for processing of 5 bln pages for email addresses...
Does Alexa consider itself a competitor in the search market (MSN, Yahoo, Google)? I always thought they were just using Google's data for search, and their own data from the Alexa bar to "rank" sites.
Now that I talk it out, I really don't get Alexa. What are they?
I have the same question as internets ... why are they "POWERED BY GOOGLE" if they do their own crawls and can process the data?
One reason that I can think off the top of my head is simply they know they can not compete with Google or other search engines results at this current point. But they can make some $$ on the raw data and bank it for possible future projects, may they be a search engine or what ever at the moment they decided to spend it on.
Well, frankly because any kid with some harddrives, some bandwith, and a free script can crawl the web. Ranking results is what Google does really well. At least it's pretty dang hard to do it just as well as them.
It's two different things, that's all.
---
And, unlike John Battelle I'm pretty sure you can find an aged post by me somewhere that mentions this exact thing. I'm probably not even the first to mention it, as I recall having the discussion of an open source crawler with other members here - mostlikely more than a year ago. But nevermind, as I haven't got a blog.
This shows how important "vertical search" will becomeWhere users will go to their "engineering search engine" vs. their "summer european travel search engine"
that was my first idea of what to do with it. problem is, i just can't see a vertical engine adding much value to what is already available via the horizontal engines. to work a vertical engine would have to leverage it's knowledge of it's particular area of expertise to produce results that are better than google. other than a few niches i can't see how that would work well.
I can however see SEOs jumping on this for performing competitive analysis. want to know an accurate backlink count for sites linking to another site? well, that would be fairly painless and cheap with this tool.
One reason that I can think off the top of my head is simply they know they can not compete with Google or other search engines results at this current point.
Dang.
We need a new Internet. Hey guys, want to get together and start Internet 3.0. New protocols. Get rid of old baggage.
*No more "www".
*No more spoofed email addresses.
*You can use any .end domain you want, not just .com and the official few.
*The .end part of the domain will be built into the protocol in such a way that you can programmatically detect the difference between a subdomain and the "end part".
* so much more.
OK, dream sequence over.
Still, it's pretty neat to have access to a relatively large search index that's already created. People can test their spiders on Alexa's index.. but then you're sorta trapped in Alexa's way of doing things -- which is the point, I think.