Forum Moderators: coopster
I couldn't decipher htdig since I am not familiar with command line linux/unix commands nor have access to shell and whatnot at my host.
google search will require your site be in google I believe, if half of your site isn't spidered then it may not be the best solution.
There was a few others I spotted but never got around to trying, iSearch(not the toolbar) is PHP and free for non commercial/profit sites. PHP-Pheonix, Interspire FastFind, TSEP(The Search Engine Project), I only list these since they seem to be some of the few that actually have a spider portion.. others would only do filesystem indexing or just do the search/display part based on your own database(which may not hold all pages of the site).
If you run your whole site off a database with common table/row structure then you can just use the database search features.
If you want to cook your own then roll up your sleeves :) There are a few ways to actually download the page(CURL, http methods, php's file functions with url wrappers) but you will also need error handling, cookie/session handling, regular expressions for parsing data such as titles, content, new links, storing and flagging links so you don't visit more than once unless specified(reindex), handling relative links or malformed href, exclusions(robots.txt, meta stuff like nofollow, noidex), probably more I don't recall off hand.
You could probably skip some of these if you want brute force spidering without regard to your bandwidth allocation.
Then you need a way to search the data efficiently then display it nicely.. I haven't researched this far yet.. somewhat overwhelmed by the spider part. How to do it efficiently so the system that is doing the spidering and indexing isn't pegged at 100% CPU usage for a week :P I'm sure many shared hosts won't like the cpu usage either.
"SWISH-E is a fast, powerful, flexible, free, and easy to use system for indexing collections of Web pages or other files. See the article How to Index Anything by Josh Rabinowitz in the Linux Journal for more information."
Its very fast for indexing records form a database. It also has a couple helper scripts for doing spidering and displaying results.
[swish-e.org...]
Thank you for all of you who contributed to this topic. Since I am not expert in web yet, I would like to create a Search Engine for my website which displays information about products on sales.
I have been through [htdig.org...] abit of time, but it seems to be complicated.
Please advise the simplest way to create a Web Search Engine.
I highly appreciate any advice.
Cheers
It indexes all your sites pages into mysql and then you can search via keywords
I use the same search software but it is integrated into the directory software by the same company