Alexa opens up crawler to the public

Forum Moderators: bakedjake

Message Too Old, No Replies

Alexa opens up crawler to the public

For a fee....

grelmar

3:05 pm on Dec 13, 2005 (gmt 0)

From Wired [wired.com]:

In a move with potentially far-reaching implications for the search market, Alexa Internet is opening up its huge web crawler to any programmer who wants paid access to its rich trove of internet data.

From Alexa Web Search Platform [websearch.alexa.com]

The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools.

The pricing scheme is confusing, but it looks like it would be fairly cheap for what its offering.

Anyone in here know what this is really going to accomplish?

claus

10:38 pm on Dec 15, 2005 (gmt 0)

mhhfive, I thought the whole point was that you had access to a body of context that was already spidered? Why do you want to change the spidering?

If you want a spider, that's no problem. You'll find lots of spiders as free scripts all around the web. However, getting those to spider as much as Alexa has done already might be a challenge.

surfin2u

3:40 pm on Dec 16, 2005 (gmt 0)

I'm thinking of banning Alexa's bot from my site. They're crawling hundreds of pages each day. I don't see the benefit to me in having them cache my content, possibly for some questionable uses. Does anyone see some benefit for me in what they're doing that I might be missing here?

Lord Majestic

4:06 pm on Dec 16, 2005 (gmt 0)

Does anyone see some benefit for me in what they're doing that I might be missing here?

Perhaps they will help someone to create a much more efficient search engine that will de-monopolise the SE industry and make redundant need to create multi-1000 posts threads on this BBS every time Google makes an update. Is this not worth some of the traffic that you in all probability pay fixed sum for anyway?

surfin2u

4:16 pm on Dec 16, 2005 (gmt 0)

Good point! I'll let them continue. I would like to wish all you inventors of better search engines well!

mhhfive

4:35 pm on Dec 16, 2005 (gmt 0)

claus,

well, sure you can create your own spiders, but I thought there might be some advantage in having your own spiders hosted on a "better" system -- Alexa's spiders may not be targeting things that you might want to find, but they may be spread out and gathering stuff faster than you could if you were just starting out.

Say you wanted to create a blog search engine, for example. You could (1) just write something to pick out the "blog" stuff from Alexa's existing index or (2) you may want to have Alexa actively try to find more "blog" related stuff for you to combine with (1) so you get "fresh" results.

Maybe I just don't understand how Alexa's spiders work.

RonS

4:57 pm on Dec 18, 2005 (gmt 0)

What everyone seems to be missing is the "Internet Archive" aspect.

Alexa not only stores the current page, but all previous versions it has spidered...

howiejs

3:01 pm on Dec 19, 2005 (gmt 0)

"What everyone seems to be missing is the "Internet Archive" aspect."

Will the archived versions of pages be included in their new platform?

That would be interesting . . .
Lets use 24 month old content . . .

RonS

9:07 am on Dec 21, 2005 (gmt 0)

I don't know. If it is, I see what seems like an interesting opportunity to create a portal for researchers.

One problem might be that the Archive's data seems to be quite spotty... but maybe they have a better version than they offer up to the general public.

surfin2u

5:18 pm on Dec 22, 2005 (gmt 0)

What if other search engines decided to purchase pages from Alexa? Would that spell doom for cloaking? Google or whoever could compare results from their own bot with what's on file at Alexa, and if there's a discrepency, then cloaking could be exposed.

Is this farfetched or a possibility?

instinct

6:16 am on Dec 27, 2005 (gmt 0)

According to the docs, Alexa has three copies of the index at any one time:

1. The index currently being created
2. The most recently completed index
3. The index before that

Since each index is a snapshot over a 2 month window, the farthest back you will be able to look is 6 months.

(BTW, Alexa claims each index to be 100 terabytes of data for a total of 300 TB available at a time.)

What I'm not sure of is when a site gets blocked by robots.txt will the API still let you access older versions of the site from before the block? On the Internet Archive, a new robots.txt block will remove all previous copies from the index I believe.

Next Question:

I wonder if Alexa will one day open up the Toolbar data?

alaska2

6:44 pm on Jan 8, 2006 (gmt 0)

Alexa can see all theyr customer search codes, and maybe find in it the bright idea that indexes better than Google

This 41 message thread spans 2 pages: 41