Forum Moderators: open
My queries are as follows:
1. What sort of table structures are used with what sort of DB (mySQL, oracle...)
2. How big are these db's. Clearly google will be massive.
3. Do they employ indexes and what happens when a db is updated regularly, Reindex? In the case of a regular crawler/updater does this affect performance.
4. Are there any studies on this sort of thing so I can get a good feel for this area.
5. Do they have controlled lists of keywords and relate them to particular URL.
6. Ranking algo, altogether a different beast which could take up 10 PHD's.
To be honest the questions could be endless, limited only by how much thought I give the subject.
Any pointers appreciated.
Regards
UKGimp
First, there are many different ways of storing information in a database. None of the the large search engines I am aware of use off-the-shelf databases. Even most of the midrange search engines and distributed programs don't either. They prefer to use optimized custom index file systems with multiple key indexes.
> 1. What sort of table structures
That is completely unique and independent. I've never seen two se's (that use sql like databases) that are alike.
> 2. How big are these db's.
Not as big as you would think. It's been my findings that the average size of text on the page is between 2 and 3k net wide. If you limit your upper size of acceptable html to say 64k or 100k max, then you can even drop below 2k per page. If you strip a healthy set of stop words out, the average size can drop down to about 1.25 - 1.75k.
Next, throw in some sort of compression via a dictionary/index file, and the size of a stored page on disk can fall below 500bytes per page. That is with every major word stored as a two byte pointer into the index file.
To build a key file, you just split the entire index on the words, then assign a number to each word. The number is then a pointer back into the index file telling you where to look up the real word. That file of numbers to words resolution is often called a key file. Any unrecognized word, or word outside the maximum 65,536 (two eight bit bytes), can trigger a flag to store the word on-the-fly in the index.
Using that system, I was able to build a 500k page se in 500meg (before compression).
There are other ways that you can get the index size down even smaller with multiple index files and key files by using a dictionary technique borrowed from the famous Lempel-Ziv compression algo.
If you consider that you could take the top 255 words found in the index and store them as merely one byte (character) - you can reduce the index file down by the order of 50%.
Taking it to the next level, you can use a sliding dictionary technique. Not only do you store a key index file, you set it up with multiple index files for different locations in the database. With that, you can achieve awe inspiring compression rates on the order of 10 to 1 (or HIGHER).
Using the above techniques, it is possible to build a 1 million page se, that is stored in a small 50-60 MEG database.
When Google was at around 1 billion pages, there was quote by Google tech, that the entire index was stored in 80 gig.
>Reindex?
Yes, reindexing is mandatory. The tough part is in rebuilding the index. That takes an awesome amount of time to recalculate that. That's just to build the index itself. Then there is the time taken to calculate rankings just as the PageRank calculation. It becomes clear, why Google only updates once a month. It takes the rest of the month to spider, then index, then the pr calc. If anything goes wrong in the loop, they are set back a good amount of time.
>Are there any studies
There are about 3 really good documents out there, and right now, I can find a url for any of the three. The best is one by Altavista circa 1997. (help)...
>Do they have controlled lists of keywords
Very individual to each se. Once you build up a set of search queries over time, those query words have a bottomless pit of data in them. Try labs.google.com and look at the google sets. Those were most likely built out of query data and show what you can do with word associations based on query data. Keyword data related to the page is very easy to build due to page density checks.
>Ranking algo, altogether a different beast which could take up 10 PHD's.
In some circles, yes it is complicated. In other circles, its rudimentary math for the lego crowd. The most intimidating factor is the shear scale that a index can grow too.
First, start by trying to define "relevance". It's all a matter of choices. I can find more than 50 prime variables that go into a "relevance" factor based on page entities (title, url, density, etc). Then comes the intangibles such as off-the-page entities (links, who's the link from, what's the link do)...etc. Stir all those factors together and it still comes down to a personal preference.
You can start with fundamentals such as a query for "yahoo.com". Should the first result return "yahoo.com", or should it return all pages but yahoo.com? I maintain, that under any domain name query, the se should never return pages from that site. After all, if the user wanted that domain, they could type it into the browser. However, in the real world, there is the courtesy factor and se's return it for the user as the number one result.
If there is that type of disagreement on a fundamental search, what happens down deeper into algos and queries? It still becomes an educated guess.
In the end, a successful search engine is going to be the one that serves the widest number of visitors what they are looking for. To that end, most of us feel that the next big battle ground for se algo's will be personalization and intelligent filtering of results.
If I do two searches for "ships" and "boats", and I see "rudder.com, sail.com, oar.com", then the next search would filter out those same pages seen before. That is intelligent filtering. It takes a great deal of hardware and software to do that. As we have seen, the downward trend ahead of the curve in hardware prices is going to continue.
Se's such as Google and Teoma have been successful partly because of the mass quantity of hardware they were able to purchase at reasonable prices. Google has data centers with an excess of 10k boxes. Excite spent more in 1997 on 100 work stations. That trend is continuing unabated.
“URL”
UrlId (PK)
URL
Date Created
Date of last crawl
Others, title desc
“KEYINDEX”
KeywordID (PK)
Keyword
Stopword (YES/NO)
“OCCURENCEWORD” one for <title>, <meta>, <h1>, <p>
UrlID (FK)
KeywordId (FK)
Count
Density
“RANKING”
RankID (PK)
HTMLType (eg H1 from “Occuranceword”
Score
“URLSUBMITTED”
URLSID
URLS
Now here come my first major dilemma. Should I use one huge table for each of the HTML elements from the “occurrenceword” table or produce a comma delimited array for each URLID.
The Process:
As of now I don’t really know how to build an agent or do the serious parsing out required, so I am on a process mission. Maybe steps missing or slightly out of order, bear with me.
Request page
Remove JS
Look for HREF=”xyz”
Query URL to see if already present, if yes, ignor
Compare to URLSubmitted to see if already in queue
Assign page title / desc / body/ meta / alt to a variable for later inserting.
For each variable remove HTML
Remove stopwords
Stripout characters (+,-,(,) ect) and duplicate words for title and H1
Submit into table.
So assuming all that goes ok and a user does a search for a term.
Submit word
Find ID for word and look for the highest density of that word in the occurenceword table.
Try to find highest density in <p> table, then look through H1, title, meta, alt and multiply by ranking factors, EG, if H1 X 1.1, if H1+meta X 1.2 and then rank accordingly.
Display results
My next step is finding financial backers, any takers
Cheers
Richard
I have also located these documents on the general topic which may be of ineterest. I have not read them all yet. :)
Authorative sources in a hyperlinked Environment
[cs.cornell.edu...]
Hilltop: A Search Engine based on Expert Documents
[cs.toronto.edu...]
The Anatomy of a Large-Scale Hypertextual Web Search Engine
[www-db.stanford.edu...]
When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics
[cs.toronto.edu...]
Improved Algorithms for Topic Distillation in a Hyperlinked Envronment.
ftp://ftp.digital.com/pub/DEC/SRC/publications/monika/sigir98.pdf
Automatic resource compilation by analyzing hyperlink structure and associated text
[decweb.ethz.ch...]
A Scalable Distributed Search Engine to Look for Images by Content on the Web
[cs.bris.ac.uk...]
<ADDED> A couple more, watch the wrap.
Efficient Crawling through URL Ordering
[citeseer.nj.nec.com...]
The PageRank Citation Ranking: Bringing Order to the web
[citeseer.nj.nec.com...]
</ADDED>
Excellent posts so far....
In regards to the "occurrence" bit, I think I'm along the same lines of you, that bit can be at least compressed a little.
I see there are no real computational parts to this? I mean.....the occurrence bit, you have numbers..maybe the values can be passed on to a small C script that processes some sort of algorithm wizardry ;) I'll not elaborate-its def not my area (yet)
the "scores" for each page could be put on an index with the URL ID.
A diagram with a key and notes would look much clearer :)
IMHO the normalization needs to be almost perfect for any sort of scalability. Then db size wont be a problem. You are obviously working hard on the fetching of pages so that leaves the ranking mechanism.
What exactly are you finding the biggest barrier preventing you from cracking this?
Thanks for your response.
the "scores" for each page could be put on an index with the URL ID.
I think I like this idea, it would mean that the process of ranking the page could be done away from the front end and would not require the accessing of the occuranceword and ranking tables. Anychanges in the algo values would require a complete update if the significance of the H1 was altered for example.
What exactly are you finding the biggest barrier preventing you from cracking this?
LOL, how long have you got? I am not a database designer by trade, not even close. My background is Minerals Engineering but I love this technology arena. I know some of the underlying priciples and if I know I am going in the right direction I will increase my study of the subject. Suppose I am looking for a nod to say "along the right lines, but you need to consider this and that" or "no that will never work in practice due to this"
Cheers
Richard
I still ponder this problem in my spare time and would hope that one day I can crack the muther. I do have a dilemma and hope that someone can suggest the best route.
For a page that has the following text after all the stop words have been trimmed out (forgetting stemming at this stage)
“House structure brick”
Where the key index for the 3 words is 12, 356, and 85 respectively.
Option 1 (above idea)
“URL”
UrlId (PK)
url
“KEYINDEX”
KeywordID (PK)
Keyword
and a linking table containing the URLId and KeywordID for every page that has been indexed.
Option 2
“URL”
UrlId (PK)
text
“KEYINDEX”
KeywordID (PK)
Keyword
and then insert all the keyword ID’s for the text into a varchar field so in this example would contains all the values eg. 12 356 85 and if the document contained the words “House structure brick house” it would look like 12 356 85 12.
Sorry to be a pain on this
Which method 1, 2 or ?
Cheers
Brett, months later your post makes much more sense to me.
There are other ways that you can get the index size down even smaller with multiple index files and key files by using a dictionary technique borrowed from the famous Lempel-Ziv compression algo.
After playing around with all of the above ideas for a SE/Directory, this is the bit that gets me.
Yeah, so after everything has been spidered, stopwords stripped, we have a "word-list" that contains terms that will be searched for.
So when someone performs a query, I assume that the words in the query will be converted into numbers where number=word.
If there is a match, 10 results may be passed on to the page......but would it not be "expensive" to convert every number in each SERP back to a word?
I was thinking about having a text title and description field for human consumption, with the lempel-ziv sleek-type encoding being used against title/description fields that converted into numbers for part of the algo.
Is this a sacrifice? Perhaps even why google need so many servers perhaps?
G "homes in" on a "ransom note" as they are called, which can be on any part of the page (including the stopwords) - which are displayed on screen. i.e. this is variable to the query entered and presents descriptions that clearly have stopwords and all text indexed somewhere
With my above "sacrifice" of using a plain text title and description to be used, that wouldn't be happening.
Perhaps they store a word-by-word cache of the <body> of a document with all tags stripped...that is only accessed when the page is going to be one of the SERP's?
In various machine language sets you'll often find a type of addressing known as indirect, or indexed indirect. You take a "pointer" (any byte) and load it into an address register, that in turn points to the real destination address.
With a table of word=numbers, you can just load up the number in an address register and it points to the real word in the table via indexed indirect addressing.
Other languages such as C have similar styles of pointers that can be used quite rapidly to extract data from dictionary tables such as word lists. It is pretty amazing how fast these machines can execute optimized code like that.