Welcome to WebmasterWorld Guest from 184.108.40.206
Forum Moderators: bakedjake
In a move with potentially far-reaching implications for the search market, Alexa Internet is opening up its huge web crawler to any programmer who wants paid access to its rich trove of internet data.
From Alexa Web Search Platform [websearch.alexa.com]
The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools.
The pricing scheme is confusing, but it looks like it would be fairly cheap for what its offering.
Anyone in here know what this is really going to accomplish?
If you want a spider, that's no problem. You'll find lots of spiders as free scripts all around the web. However, getting those to spider as much as Alexa has done already might be a challenge.
Does anyone see some benefit for me in what they're doing that I might be missing here?
Perhaps they will help someone to create a much more efficient search engine that will de-monopolise the SE industry and make redundant need to create multi-1000 posts threads on this BBS every time Google makes an update. Is this not worth some of the traffic that you in all probability pay fixed sum for anyway?
well, sure you can create your own spiders, but I thought there might be some advantage in having your own spiders hosted on a "better" system -- Alexa's spiders may not be targeting things that you might want to find, but they may be spread out and gathering stuff faster than you could if you were just starting out.
Say you wanted to create a blog search engine, for example. You could (1) just write something to pick out the "blog" stuff from Alexa's existing index or (2) you may want to have Alexa actively try to find more "blog" related stuff for you to combine with (1) so you get "fresh" results.
Maybe I just don't understand how Alexa's spiders work.
1. The index currently being created
2. The most recently completed index
3. The index before that
Since each index is a snapshot over a 2 month window, the farthest back you will be able to look is 6 months.
(BTW, Alexa claims each index to be 100 terabytes of data for a total of 300 TB available at a time.)
What I'm not sure of is when a site gets blocked by robots.txt will the API still let you access older versions of the site from before the block? On the Internet Archive, a new robots.txt block will remove all previous copies from the index I believe.
I wonder if Alexa will one day open up the Toolbar data?