Search Engine Developement in C/C++,Perl,CGI

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Search Engine Developement in C/C++,Perl,CGI

search engine perl cgi c/c++

shaijujose

5:09 pm on Jun 24, 2003 (gmt 0)

Hi,

Any tutorial/article/white paper for search engine -algoritham-development in C/C++ and or perl and or CGI and or asp and or xml.

Help greatly appreciated.

Regards,
Shaiju Jose

stevegpan2

5:44 pm on Jun 24, 2003 (gmt 0)

I like to know if there are any. thanks,

sugarkane

3:15 pm on Jun 26, 2003 (gmt 0)

Shaiju, I've *still* not found any decent tutorials on this despite asking a similar question [webmasterworld.com] 18 months ago... If anyone knows of one, please share!

kmarcus

5:57 am on Jun 28, 2003 (gmt 0)

I have documentation on the site in my profile. In particular, "my site"/tech-overview.php may be useful as well as "my site"/seo-stuff.php

sugarkane

10:00 am on Jun 28, 2003 (gmt 0)

Thanks kmarcus, that's interesting stuff.

Birdman

10:21 am on Jun 28, 2003 (gmt 0)

Yes, very nice write up. Thanks for sharing it. I like the photos :)

SwissGuy

12:10 pm on Jun 29, 2003 (gmt 0)

Here are some resources which provided me with useful input to implement a local search engine as well as a nifty meta-search engine including a world-class topic distillation and result clustering back-end (can I post links here at all?):

[clis2.umd.edu...]
[www10.org...]
[cs.kun.nl...]
[javelina.cet.middlebury.edu...]
[fieldmethods.net...]
[cs.mu.oz.au...]

You will find much more useful links when searching at Google for keywords such as: rank +weight +sqrt +log AND ("TF/IDF" OR IDF).

The very basic concept of any expandable search engine is as follows:

1. Crawl and retrieve documents, strip html/xml markup while remembering text contained between special entities such as href (also for PageRank calculations), alt, title, bold, etc. You may also store word positions in the document.

2. Convert each document's text into a vector space model: filter out stop words (a long, carefully crafted list is preferable), analyze term frequency (TF) and inverse document frequency (IDF), maybe also document similarity using a cosine square algorithm. Finally, use log() to assign a weight value between 0 and 1 to each term and document. TF-IDF is one of the most successful and well-tested techniques in Information Retrieval.

3. Assign an unique ID for each document and store the ID as key and the value (containing URL + other useful document info) in a hashed binary tree database, such as Berkeley DBM (SQL databases are too slow for this context). Do the same for each term in another database where each unique keyword is stored as a key together with its value that holds info such as the IDs of documents incorporating that term, total term frequency and weight, etc.

4. Build a search interface. Split a user query into keywords, lookup each keyword in the term database and retrieve the associated document IDs (which could be pre-sorted by weight). Lookup the IDs in the document database depending on user preference:
a) ANDed search - lookup only the intersection of keyword IDs (duplicate IDs appearing for each user keyword)
b) ORed search: lookup all IDs
c) Phrase search: aa) use the ANDed technique then do a live phrase search in the retrieved documents, ab) if a word position matrix is available for each document, calculate the term offsets to check for a match.

5. Serve the results, which may also be temporarily cached, sorted by term/document weight to the user.

This really is just a very rough outline. The crawler alone could give you months to develop if it's for a large-scale operation.

andreasfriedrich

10:46 pm on Jul 1, 2003 (gmt 0)

Interesting links and nice outline SwissGuy. Easy enough for even me to understand.

Andreas