Welcome to WebmasterWorld Guest from 54.144.107.83

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Search Engine Developement in C/C++,Perl,CGI

search engine perl cgi c/c++

     
5:09 pm on Jun 24, 2003 (gmt 0)

New User

10+ Year Member

joined:May 29, 2003
posts:15
votes: 0


Hi,

Any tutorial/article/white paper for search engine -algoritham-development in C/C++ and or perl and or CGI and or asp and or xml.

Help greatly appreciated.

Regards,
Shaiju Jose

5:44 pm on June 24, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 24, 2003
posts:186
votes: 0


I like to know if there are any. thanks,
3:15 pm on June 26, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 6, 2000
posts:904
votes: 0


Shaiju, I've *still* not found any decent tutorials on this despite asking a similar question [webmasterworld.com] 18 months ago... If anyone knows of one, please share!
5:57 am on June 28, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:July 27, 2002
posts:75
votes: 0


I have documentation on the site in my profile. In particular, "my site"/tech-overview.php may be useful as well as "my site"/seo-stuff.php
10:00 am on June 28, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 6, 2000
posts:904
votes: 0


Thanks kmarcus, that's interesting stuff.
10:21 am on June 28, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 22, 2002
posts:2546
votes: 0


Yes, very nice write up. Thanks for sharing it. I like the photos :)
12:10 pm on June 29, 2003 (gmt 0)

New User

10+ Year Member

joined:Aug 12, 2002
posts:8
votes: 0


Here are some resources which provided me with useful input to implement a local search engine as well as a nifty meta-search engine including a world-class topic distillation and result clustering back-end (can I post links here at all?):

[clis2.umd.edu...]
[www10.org...]
[cs.kun.nl...]
[javelina.cet.middlebury.edu...]
[fieldmethods.net...]
[cs.mu.oz.au...]

You will find much more useful links when searching at Google for keywords such as: rank +weight +sqrt +log AND ("TF/IDF" OR IDF).

The very basic concept of any expandable search engine is as follows:

1. Crawl and retrieve documents, strip html/xml markup while remembering text contained between special entities such as href (also for PageRank calculations), alt, title, bold, etc. You may also store word positions in the document.

2. Convert each document's text into a vector space model: filter out stop words (a long, carefully crafted list is preferable), analyze term frequency (TF) and inverse document frequency (IDF), maybe also document similarity using a cosine square algorithm. Finally, use log() to assign a weight value between 0 and 1 to each term and document. TF-IDF is one of the most successful and well-tested techniques in Information Retrieval.

3. Assign an unique ID for each document and store the ID as key and the value (containing URL + other useful document info) in a hashed binary tree database, such as Berkeley DBM (SQL databases are too slow for this context). Do the same for each term in another database where each unique keyword is stored as a key together with its value that holds info such as the IDs of documents incorporating that term, total term frequency and weight, etc.

4. Build a search interface. Split a user query into keywords, lookup each keyword in the term database and retrieve the associated document IDs (which could be pre-sorted by weight). Lookup the IDs in the document database depending on user preference:
a) ANDed search - lookup only the intersection of keyword IDs (duplicate IDs appearing for each user keyword)
b) ORed search: lookup all IDs
c) Phrase search: aa) use the ANDed technique then do a live phrase search in the retrieved documents, ab) if a word position matrix is available for each document, calculate the term offsets to check for a match.

5. Serve the results, which may also be temporarily cached, sorted by term/document weight to the user.

This really is just a very rough outline. The crawler alone could give you months to develop if it's for a large-scale operation.

10:46 pm on July 1, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 22, 2002
posts:1782
votes: 0


Interesting links and nice outline SwissGuy. Easy enough for even me to understand.

Andreas