Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: open
*Feel free to chime in*
What I imagine is there is an simple check run on pages submitted to the engine, this would be against machine submission tools to help weed out the mass submission on an hourly basis type of situation. Based on state information such as cookies and IP tracking this would help them shuttle a mass of automated requests into the round file (/dev/null)
Then if your page gets passed this stage your request would be put into a spidering queue, probably split among spiders in some way in Alta's case I'm sure that it just goes out and gets the page at that moment and saves it for indexing.
Here is where the next algo comes in: The indexing algo hacks apart your page based on it's content and analyses it's links, links put into two lists, future spidering and Internet 'cartography' The Cartographer also reports back any links that your site has to it. I would imagine that links are in a separate DB from page content. Page content is hacked apart and weighed for ranking assessment. In the case of AV I imagine the spidering of links on your site to be pretty quick. This is the only way they could build a theme if you submit only the root URL (Aside: What if you no followed all your pages with AV? Maybe this is a hot tip, anyone want to try:) )
So now your site is mapped, sliced and weighed, themed and stored, what is next.
The Search Algo: I guess there to be another algo that looks for the best matches in the index when a search is done. First it must look at the search to figure out which words are most important, then it has to look at the phrase (if search is more than one word) It would dip into the indexes based on this interpretation of the search phrase and look for top matches. Now there is a chance here to figure out what is going to weigh the highest in a broader sense. For instance are link's going to be more important than page makeup? What about it's position in the 'map' of the net?
The search algo would probably pull a matches and display 1 page of listings. It would then cache the rest of the listings based on how many pages you could directly navigate to (the up to 10 pages on the bottom like the ol' goooooogle-> thing)
In summary I believe that the DB's are pretty separate, that weighting can be modified on any given DB and the search itself is subject to weighting.
]PS: File this under the know thy enemy catagory
joined:June 27, 2000
>(Aside: What if you no followed all your pages with AV? Maybe this is a hot tip, anyone want to try:) )
I followed you up to the Aside. I don't understand what you mean by the "What if"...
Everything before that sounded very plausible, though, but I would like to understand the rest of your theory..
I assume than any substantial SE must have a parallel scheme just to deal with the sheer volume of information on the web. But Alta's is particularly sharp.
Alta also offers a specialized media search: images, audio and video -- and I believe they do some kind of dedicated spidering for each of these categories. For instance, you can filter an image-search for "color or b&w", and "buttons, photos, or graphics". I'd think they would have a pretty sophisticated image-reading spider to automate that database.
I've never read anyone's ideas about the "ins and outs" of AV media search -- but I'd be all ears, especially since it looks like I'll be taking on a music client soon with all kinds of mixed media on the site.