Forum Moderators: open
"Our index contains every word found on more than 350 million unique Web pages"
Hmmm... their index contains words, not pages. Maybe term vectors?
"Text relevance searches every Web page for exactly the words you enter. Many factors enter into text relevance, such as how important the words are on the page, how many times the words appear, where on the page they appear, and how many other pages contain those words."
I smell themes. It's surprising that they give away this much information!
This certainly is interesting. I guess it's a good thing that my pages are doing better there (if this is the way they are moving for the future).
Maybe they test out their advanced technology at Raging Search to see how it works, or to see how users like it based on feedback. Then, possibly they start implementing some of those things on regular AV. Maybe it's a beta version of the future Alta?
Any other speculations?
A term vector database is all about words, that's what they'll index and that's how they store your pages -
"Although we use the usual TF-IDF weighting to select terms for vectors, we do not store these weights in vectors. Instead, we store just the term frequency, that is, the number of times the term appears in the page"
"In addition to the term counts themselves, this raw data includes the lengths of pages, both in bytes and in terms."
I came across another thing that's quite puzzling.
Look at the Inverse Document Frequency part of the TF*IDF equation.
log (Number of documents/Number of documents containing keyword)
Assume the keyword is on every page (you would think this was a good thing). When you do the division you get 1. Take the log of 1 and you get zero. Any Term Frequency you multiply by 0 you come up with zero. Maybe they have some kind of catch for when this happens?
I think in that case you wouldn't even include the 1, because 1 x log would = log. I think you would need to just multiply the term frequency with log. (I'll email a friend about this to make sure I'm right)
btw - isn't it log2 not log?
Edited by: seth_wilde
It is possible that log * 1 = log. That would make sense. It's been a few years since I had any algebra. I just open up my windows calculator and press "1" then "log". That's all I know.
"Tell me about it, I think after about the 5th or 6th read I achieved a moment of clarity and and now I have it pretty much figured out."
I guess I have a lot more reading to do. I keep my dictionary right next to my desk here (BTW, what...exactly...is a vector?).
"isn't it log2 not log?"
From what I can see it is log
"BTW, what...exactly...is a vector?"
vector = A one-dimensional array
""isn't it log2 not log?" From what I can see it is log"
Are you using the formula from the link that james gave in the other thread? If so I found one derived from Salton (who they mention in the orginal article) and he uses log2 [instruct.uwo.ca...]
Edited by: seth_wilde
I will put it all together eventually. After I find time to read those documents a few more times each. There's just not enough hours in a day!
Here is the dictionary definition (Funk & Wagnalls '76) -
Vector: n. A physical quantity that has magnitude and direction in space, as velocity and acceleration.
I guess we are shooting for the most quantity and the greatest magnitude!
Is no one using raging? It feels like Alta with it's head chopped off to me...