mhhfive, I thought the whole point was that you had access to a body of context that was already spidered? Why do you want to change the spidering?
If you want a spider, that's no problem. You'll find lots of spiders as free scripts all around the web. However, getting those to spider as much as Alexa has done already might be a challenge.
I'm thinking of banning Alexa's bot from my site. They're crawling hundreds of pages each day. I don't see the benefit to me in having them cache my content, possibly for some questionable uses. Does anyone see some benefit for me in what they're doing that I might be missing here?
|Does anyone see some benefit for me in what they're doing that I might be missing here? |
Perhaps they will help someone to create a much more efficient search engine that will de-monopolise the SE industry and make redundant need to create multi-1000 posts threads on this BBS every time Google makes an update. Is this not worth some of the traffic that you in all probability pay fixed sum for anyway?
Good point! I'll let them continue. I would like to wish all you inventors of better search engines well!
well, sure you can create your own spiders, but I thought there might be some advantage in having your own spiders hosted on a "better" system -- Alexa's spiders may not be targeting things that you might want to find, but they may be spread out and gathering stuff faster than you could if you were just starting out.
Say you wanted to create a blog search engine, for example. You could (1) just write something to pick out the "blog" stuff from Alexa's existing index or (2) you may want to have Alexa actively try to find more "blog" related stuff for you to combine with (1) so you get "fresh" results.
Maybe I just don't understand how Alexa's spiders work.
What everyone seems to be missing is the "Internet Archive" aspect.
Alexa not only stores the current page, but all previous versions it has spidered...
"What everyone seems to be missing is the "Internet Archive" aspect."
Will the archived versions of pages be included in their new platform?
That would be interesting . . .
Lets use 24 month old content . . .
I don't know. If it is, I see what seems like an interesting opportunity to create a portal for researchers.
One problem might be that the Archive's data seems to be quite spotty... but maybe they have a better version than they offer up to the general public.
What if other search engines decided to purchase pages from Alexa? Would that spell doom for cloaking? Google or whoever could compare results from their own bot with what's on file at Alexa, and if there's a discrepency, then cloaking could be exposed.
Is this farfetched or a possibility?
According to the docs, Alexa has three copies of the index at any one time:
1. The index currently being created
2. The most recently completed index
3. The index before that
Since each index is a snapshot over a 2 month window, the farthest back you will be able to look is 6 months.
(BTW, Alexa claims each index to be 100 terabytes of data for a total of 300 TB available at a time.)
What I'm not sure of is when a site gets blocked by robots.txt will the API still let you access older versions of the site from before the block? On the Internet Archive, a new robots.txt block will remove all previous copies from the index I believe.
I wonder if Alexa will one day open up the Toolbar data?
Alexa can see all theyr customer search codes, and maybe find in it the bright idea that indexes better than Google
| This 41 message thread spans 2 pages: < < 41 ( 1  ) |