Forum Moderators: mack
Thank you.
[edited by: jatar_k at 6:39 pm (utc) on Jan. 11, 2007]
[edit reason] no urls thanks [/edit]
A search facility is an external program which communicates with the web site through "CGI". That is, it receives (via CGI) the search request from a web page form filled in by a reader, does a search, creates a results page, and sends (via CGI) the result page to the web server.
There are lots of free Perl scripts to do this widely available on the web, you can then adapt one to suit your own needs.
Matt
I'd tried using a script installed on my own server, but it overloaded the server. The script itself was fine; there were simply too many people (13K or so daily) trying to use it at once.
If you have a small site, using your own script should be fine, but if your site is big or has heavy traffic, you might want to consider putting that "search" traffic on somebody else's server.
Eliz.
When it comes to searching, you basically have 4 choices:
Search Types:
HTTP Search
An HTTP Search is when the search actually uses your browser (or a simulacrum of it) and records the URI of your page. It does not access the files directly. This is important if you have a site full of active content, like PHP, Perl or ASP. This is usually the best and most secure way to search, but can be slower than a Filesystem Search.
One of the chief disadvantages of this search is that it can use a fair bit of bandwidth as it spiders through your site.
Filesystem Search
A Filesystem search actually goes through the files themselves (or your database), and constructs the URIs directly from your files. This can be the most powerful and efficient way to do it, but it can be very dangerous unless you do it correctly. I tend to use this type of search, but I'm a geek, so I can do it right. Here be dragons. It also needs to be run (usually) by code running on the server. It's not (usually) something that can be imposed by a third party.
Search Methods:
Realtime Search
The search is performed at the time the request is made. No data is indexed, and the searcher is made to wait for the results. Many blog packages use this.
Advantages:This is the easiest to implement and requires the least amount of involvement from you, the Webmaster. It can be the most "up to date" search, as well as the most complete, since all the data is being searched, not just a digest.
Disdvantages:However, this is the least efficient and slowest search. It is most useful on very small sites.
Indexed Search
At some regular interval, the search package actually scans through the site and builds an index, which is kept in some sort of database. There are all sorts of indexing methods that can vary in the size and specificity of the indexed data.
When a search request is made, the search package consults the index; not the site. For example, if the visitor asks for all pages containing "Frank N. Furter," then the package will return pages containing "Frank," "Furter" or "Frank N. Furter." (Usually one-letter choices are ignored). This depends upon how fancy the search package lets you get.
This is basically how most of the search services like Google and Yahoo operate.
Advantages:This is much faster and more efficient than Realtime Search, and can be scaled to very large sites. Searches take a predictable amount of time, and consume a predictable amount of resources.
Disadvantages:However, this can result in not so timely results. The indexer runs on a regular schedule (or occasionally, as necessary). If the data for which you are searching was introduced after the last index, it will not show in the search. This method is also generally a great deal more complex than Realtime. Often, the indexer needs to be written in a different scripting language, and needs to be triggered at the server level (such as crontab). It also limits you as to what you can search for, as the index is, by definition, a digest of the data.
Hybrid Search
Some of the big blogs and communities have this. It is often a pretty big deal, as you need to structure your entire architecture in a manner that affords these searches. I have seen a "two stage" system, kinda like our brain, with a "short term" area, and a "long term, archive" area. The short term area is searched in realtime, while the archive is indexed. After a given period of time, data is moved from the short term area into the long term area, and the long term index is updated.
When you search, it will perform both a realtime search on the short term area and an indexed search on the archive.
Advantages:Best of both worlds. You get the advantages of each type of search.
Disadvantages:Complex as all git-go. You need to insinuate the search into your entire site structure, and test like there is no tomorrow. Also, because you are doing two kinds of searches, you can get inconsistent results if they use different search methods. For example, you may find a page while it is in short term, but it "disappears" in the archive because the indexer doesn't index the terms you used to find it the first time.
Third-Party Search
Specifically, the various Google [code.google.com] options [google.com]. You integrate a commercial search engine into your site. This is a very popular option, and many sites, even big ones, use this.
Advantages:It is usually quite easy. They make it so. You can also assure that your pages are all in the Google Index.
Disadvantages:Ads, restrictive EULAs and loss of the ability to customize.
I tend to use indexed searches on my sites. Some of the sites are frequently updated, so I need to run a nightly cron index, which means that the indexer is a Perl script. Other sites have fairly static content, so I re-run the index when I update. I'm a control freak, so I don't like to give up the autonomy that using the Google API requires.
Just my $0.02. This is a HUGE topic, and this is merely scratching the surface.
When thinking about adding search to a website there are two main approaches you may wish to take into account. Hostes software where the actual search engine is installed on your website/server. When you follow this path there are advantages and disadvantages. The main advantage is you have total control over the system. You manage how it works, layout etc etc.
The main disadvantage is you are responsible for it's operation and maintenance.
The other main option is a hosted solution, freefind or example. Using this method allows you to simple plug the results into your site. This solution gives you a little bit more freedom because the maintenance site of things will be dealt with by the search provider.
Mack.