Welcome to WebmasterWorld Guest from 184.108.40.206
You will find much more useful links when searching at Google for keywords such as: rank +weight +sqrt +log AND ("TF/IDF" OR IDF).
The very basic concept of any expandable search engine is as follows:
1. Crawl and retrieve documents, strip html/xml markup while remembering text contained between special entities such as href (also for PageRank calculations), alt, title, bold, etc. You may also store word positions in the document.
2. Convert each document's text into a vector space model: filter out stop words (a long, carefully crafted list is preferable), analyze term frequency (TF) and inverse document frequency (IDF), maybe also document similarity using a cosine square algorithm. Finally, use log() to assign a weight value between 0 and 1 to each term and document. TF-IDF is one of the most successful and well-tested techniques in Information Retrieval.
3. Assign an unique ID for each document and store the ID as key and the value (containing URL + other useful document info) in a hashed binary tree database, such as Berkeley DBM (SQL databases are too slow for this context). Do the same for each term in another database where each unique keyword is stored as a key together with its value that holds info such as the IDs of documents incorporating that term, total term frequency and weight, etc.
4. Build a search interface. Split a user query into keywords, lookup each keyword in the term database and retrieve the associated document IDs (which could be pre-sorted by weight). Lookup the IDs in the document database depending on user preference:
a) ANDed search - lookup only the intersection of keyword IDs (duplicate IDs appearing for each user keyword)
b) ORed search: lookup all IDs
c) Phrase search: aa) use the ANDed technique then do a live phrase search in the retrieved documents, ab) if a word position matrix is available for each document, calculate the term offsets to check for a match.
5. Serve the results, which may also be temporarily cached, sorted by term/document weight to the user.
This really is just a very rough outline. The crawler alone could give you months to develop if it's for a large-scale operation.