homepage Welcome to WebmasterWorld Guest from 54.145.172.149
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Search Engine Developement in C/C++,Perl,CGI
search engine perl cgi c/c++
shaijujose

10+ Year Member



 
Msg#: 3094 posted 5:09 pm on Jun 24, 2003 (gmt 0)

Hi,

Any tutorial/article/white paper for search engine -algoritham-development in C/C++ and or perl and or CGI and or asp and or xml.

Help greatly appreciated.

Regards,
Shaiju Jose

 

stevegpan2

10+ Year Member



 
Msg#: 3094 posted 5:44 pm on Jun 24, 2003 (gmt 0)

I like to know if there are any. thanks,

sugarkane

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3094 posted 3:15 pm on Jun 26, 2003 (gmt 0)

Shaiju, I've *still* not found any decent tutorials on this despite asking a similar question [webmasterworld.com] 18 months ago... If anyone knows of one, please share!

kmarcus

10+ Year Member



 
Msg#: 3094 posted 5:57 am on Jun 28, 2003 (gmt 0)

I have documentation on the site in my profile. In particular, "my site"/tech-overview.php may be useful as well as "my site"/seo-stuff.php

sugarkane

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3094 posted 10:00 am on Jun 28, 2003 (gmt 0)

Thanks kmarcus, that's interesting stuff.

Birdman

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3094 posted 10:21 am on Jun 28, 2003 (gmt 0)

Yes, very nice write up. Thanks for sharing it. I like the photos :)

SwissGuy

10+ Year Member



 
Msg#: 3094 posted 12:10 pm on Jun 29, 2003 (gmt 0)

Here are some resources which provided me with useful input to implement a local search engine as well as a nifty meta-search engine including a world-class topic distillation and result clustering back-end (can I post links here at all?):

[clis2.umd.edu...]
[www10.org...]
[cs.kun.nl...]
[javelina.cet.middlebury.edu...]
[fieldmethods.net...]
[cs.mu.oz.au...]

You will find much more useful links when searching at Google for keywords such as: rank +weight +sqrt +log AND ("TF/IDF" OR IDF).

The very basic concept of any expandable search engine is as follows:

1. Crawl and retrieve documents, strip html/xml markup while remembering text contained between special entities such as href (also for PageRank calculations), alt, title, bold, etc. You may also store word positions in the document.

2. Convert each document's text into a vector space model: filter out stop words (a long, carefully crafted list is preferable), analyze term frequency (TF) and inverse document frequency (IDF), maybe also document similarity using a cosine square algorithm. Finally, use log() to assign a weight value between 0 and 1 to each term and document. TF-IDF is one of the most successful and well-tested techniques in Information Retrieval.

3. Assign an unique ID for each document and store the ID as key and the value (containing URL + other useful document info) in a hashed binary tree database, such as Berkeley DBM (SQL databases are too slow for this context). Do the same for each term in another database where each unique keyword is stored as a key together with its value that holds info such as the IDs of documents incorporating that term, total term frequency and weight, etc.

4. Build a search interface. Split a user query into keywords, lookup each keyword in the term database and retrieve the associated document IDs (which could be pre-sorted by weight). Lookup the IDs in the document database depending on user preference:
a) ANDed search - lookup only the intersection of keyword IDs (duplicate IDs appearing for each user keyword)
b) ORed search: lookup all IDs
c) Phrase search: aa) use the ANDed technique then do a live phrase search in the retrieved documents, ab) if a word position matrix is available for each document, calculate the term offsets to check for a match.

5. Serve the results, which may also be temporarily cached, sorted by term/document weight to the user.

This really is just a very rough outline. The crawler alone could give you months to develop if it's for a large-scale operation.

andreasfriedrich

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3094 posted 10:46 pm on Jul 1, 2003 (gmt 0)

Interesting links and nice outline SwissGuy. Easy enough for even me to understand.

Andreas

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved