Forum Moderators: phranque

Message Too Old, No Replies

Looking for advanced site-search script

Need lots of unusual features, not sure if I can get it all or not

         

MatthewHSE

12:35 pm on Oct 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm trying to set up a site search for my largest site. I want to do this one right, so I thought I'd ask if anyone here has ever used a site search with the following features:

  • Automatically crawls and indexes all files in specified directories, with options to include or exclude sub-directories and specific file types
  • Won't index content between special HTML comments, e.g.,
    <!--No Search-->This won't be indexed<!--End No Search-->

  • Sorts results by relevance (and does a good job at determining relevance)
  • Returns <title> of page as result heading, with snippets of text surrounding search terms
  • Phrase search
  • Obeys robots.txt
  • Heavily-customizable results pages, preferably as a standard HTML page with special "replacement" tags for dynamic page elements
  • Nice but not required: Displays how long search took, highlights or "bolds" search terms, shows how many total results, shows relevancy for each result, configurable length for results "snippets", configurable number of results per page, configurable layout for results, ability to use images for links to more results.
  • Keeps statistics (not necessarily graphical) on search terms, clicked results for searches, searches with no results, or with no clicked results, and similar informational data

That last one is what's held up the show so far. Perlfect and other free site searches will do most of the rest pretty well, but so far I haven't been able to find anything that keeps decent search stats. There's nothing worse than a site search that returns no results or no good results, so I'd like to be able to keep tabs at least on the searches that return no results or don't get a clicked result. Obviously, knowing which results have been clicked would also be very useful information.

Thanks in advance for any suggestions.

Matthew

webdoctor

10:28 am on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm trying to set up a site search for my largest site

Can you tell us how your site is built?

Static or dynamic pages?

If dynamic, what kind of db backend?

IIS or Apache (or something else)?

MatthewHSE

11:26 am on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The site is mostly dynamic, with perl cgi scripts pulling data from a MySQL database. However, I can schedule a script that will build static pages from the dynamic ones. While I wouldn't allow other SE's or crawlers into the directories where those static pages are stored, they would work just fine for a local site search. So in the end, I have a choice to either use a crawler that will index .cgi pages, or just use an indexer that will go through specified directories indexing regular .html pages. While either option would work, I think I prefer using the static .html pages, since the tool used to build these will ignore pages that wouldn't be useful for a search anyway (such as "review this article" pages, which have no real content).

The site is running on Linux and Apache 1.3.33, with MySQL 4.1.13.

fish_eye

12:58 pm on Oct 7, 2005 (gmt 0)

10+ Year Member



I'm interested in this too - especially if it's php and therefore I can modify it (a Perlfect written in PHP would be Pherphect!).

I've been messing about with corzoogler but it does not (really) have relevance and it also works on raw files (not an index) (which is good and bad - but probably not serve your purpose Matthew).

The site is mostly dynamic etc etc

a) Does this mean you have a regular, predictable update regime (of the database itself) or you are just prepared to wear some inaccuracies / misses / losses? A loss would be a bit annoying or does the site just keep growing?

b) Isn't redirecting from the static name to the dynamic page a bit (too) fiddly or would this all be stored in the index? If so, I'm pretty sure you'd have to hack the code or it would have to be specifically written to do this (and not a bad selling point either).

Nice topic!

MatthewHSE

1:30 pm on Oct 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



a)Does this mean you have a regular, predictable update regime (of the database itself) or you are just prepared to wear some inaccuracies / misses / losses? A loss would be a bit annoying or does the site just keep growing?

Articles are added weekly, forum posts and other content is added daily by our visitors. Several times per day, I can have an automated script go through the database and build static HTML pages of all the actual content, storing them in a directory of my choosing. I would seldom if ever remove content, so losses shouldn't be an issue. The search index wouldn't be right up to the minute, but it would be close enough to suit me.

b) Isn't redirecting from the static name to the dynamic page a bit (too) fiddly or would this all be stored in the index? If so, I'm pretty sure you'd have to hack the code or it would have to be specifically written to do this (and not a bad selling point either).

Actually, the index would only be indexing the static pages. When someone performed a search, the results would link to the static HTML files - no redirection at all. This is okay because the script that builds the static pages still preserves all the links on the page so they point back to the dynamic site. In other words, although the static page is where the searcher will be taken after clicking a result, everything will still behave exactly as though they were on a dynamic page. To put it another way, imagine viewing a dynamic page in your browser that uses absolute URL's. Now view the page source and save it as a new, static file in a new location. You'll be able to view the new page and it will look and act just like the original, dynamic version, and clicking a link on the new page will take you to one of the normal, dynamic pages. That's about the same thing as what's going on here.

Although this approach would work well for this site, it would also be fine if I could only find a site search that would crawl and index the dynamic pages themselves. Actually this might be much better and simpler in the long run. I'd have to configure either the indexer or robots.txt to disallow access to certain non-content files, such as the login and new forum post scripts. But once that was done, indexing right from the dynamic pages would probably be best after all.

Mokita

9:50 pm on Oct 7, 2005 (gmt 0)

10+ Year Member



You might like to investigate iSearch:

[isearchthenet.com...]

There is a Pro and a free version, and is PHP. It does most, if not all of what you require and is well supported.

bill

8:36 am on Oct 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Atomz product will do this for you. The downside is that they were purchased by Websidestory a while back and no longer offer a lot of services for free. However if you spend a bit of money they still have that great Atomz back-end in place. It's highly customizable.