Welcome to WebmasterWorld Guest from 54.163.49.19

Forum Moderators: bakedjake

Message Too Old, No Replies

Alexa opens up crawler to the public

For a fee....

     
3:05 pm on Dec 13, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:683
votes: 0


From Wired [wired.com]:

In a move with potentially far-reaching implications for the search market, Alexa Internet is opening up its huge web crawler to any programmer who wants paid access to its rich trove of internet data.

From Alexa Web Search Platform [websearch.alexa.com]

The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools.

The pricing scheme is confusing, but it looks like it would be fairly cheap for what its offering.

Anyone in here know what this is really going to accomplish?

10:38 pm on Dec 15, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 15, 2003
posts:2395
votes: 0


mhhfive, I thought the whole point was that you had access to a body of context that was already spidered? Why do you want to change the spidering?

If you want a spider, that's no problem. You'll find lots of spiders as free scripts all around the web. However, getting those to spider as much as Alexa has done already might be a challenge.

3:40 pm on Dec 16, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 19, 2004
posts:394
votes: 0


I'm thinking of banning Alexa's bot from my site. They're crawling hundreds of pages each day. I don't see the benefit to me in having them cache my content, possibly for some questionable uses. Does anyone see some benefit for me in what they're doing that I might be missing here?
4:06 pm on Dec 16, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 8, 2004
posts:1679
votes: 0


Does anyone see some benefit for me in what they're doing that I might be missing here?

Perhaps they will help someone to create a much more efficient search engine that will de-monopolise the SE industry and make redundant need to create multi-1000 posts threads on this BBS every time Google makes an update. Is this not worth some of the traffic that you in all probability pay fixed sum for anyway?

4:16 pm on Dec 16, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 19, 2004
posts:394
votes: 0


Good point! I'll let them continue. I would like to wish all you inventors of better search engines well!
4:35 pm on Dec 16, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 9, 2004
posts:169
votes: 0


claus,

well, sure you can create your own spiders, but I thought there might be some advantage in having your own spiders hosted on a "better" system -- Alexa's spiders may not be targeting things that you might want to find, but they may be spread out and gathering stuff faster than you could if you were just starting out.

Say you wanted to create a blog search engine, for example. You could (1) just write something to pick out the "blog" stuff from Alexa's existing index or (2) you may want to have Alexa actively try to find more "blog" related stuff for you to combine with (1) so you get "fresh" results.

Maybe I just don't understand how Alexa's spiders work.

4:57 pm on Dec 18, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 28, 2005
posts:552
votes: 0


What everyone seems to be missing is the "Internet Archive" aspect.

Alexa not only stores the current page, but all previous versions it has spidered...

3:01 pm on Dec 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 3, 2003
posts:1092
votes: 0


"What everyone seems to be missing is the "Internet Archive" aspect."

Will the archived versions of pages be included in their new platform?

That would be interesting . . .
Lets use 24 month old content . . .

9:07 am on Dec 21, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 28, 2005
posts:552
votes: 0


I don't know. If it is, I see what seems like an interesting opportunity to create a portal for researchers.

One problem might be that the Archive's data seems to be quite spotty... but maybe they have a better version than they offer up to the general public.

5:18 pm on Dec 22, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 19, 2004
posts:394
votes: 0


What if other search engines decided to purchase pages from Alexa? Would that spell doom for cloaking? Google or whoever could compare results from their own bot with what's on file at Alexa, and if there's a discrepency, then cloaking could be exposed.

Is this farfetched or a possibility?

6:16 am on Dec 27, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Aug 30, 2004
posts:146
votes: 0


According to the docs, Alexa has three copies of the index at any one time:

1. The index currently being created
2. The most recently completed index
3. The index before that

Since each index is a snapshot over a 2 month window, the farthest back you will be able to look is 6 months.

(BTW, Alexa claims each index to be 100 terabytes of data for a total of 300 TB available at a time.)

What I'm not sure of is when a site gets blocked by robots.txt will the API still let you access older versions of the site from before the block? On the Internet Archive, a new robots.txt block will remove all previous copies from the index I believe.

Next Question:

I wonder if Alexa will one day open up the Toolbar data?

6:44 pm on Jan 8, 2006 (gmt 0)

New User

10+ Year Member

joined:Oct 8, 2003
posts:2
votes: 0


Alexa can see all theyr customer search codes, and maybe find in it the bright idea that indexes better than Google
This 41 message thread spans 2 pages: 41