Forum Moderators: phranque
One of the nice things about CMSs is that a pretty decent site search typically goes along for the ride - since your content is already in a database.
That said, search requires a different kind of indexing than that which is provided by the typical database. FWIW, MYSQL *does* have full-text indexing capability. Of course, your database will at least double in size when you do full-text indexing.
What I meant by an 'indexing script', is this: a script that is pointed to certain directories that will suck out content from files (text) and stick into a searchable format (either a db or a flat file) while associating the file name. The script will run as a periodic cron job.
For example check this out:
//example_1.php
<?php
echo "<h1>Pancakes</h1>";
?>
//example_2.php
<?php
echo "<h1>Hotdogs</h1>";
?>
These two files would be indexed like follows:
¦----content----¦-----file---------¦
¦--Pancakes-----¦--example_1.php---¦
¦--Hotdogs------¦--example_2.php---¦
The hard part would be getting to the actual content, i.e. stripping the php and html and preserving the content.
So my question was does this seem silly or smart, and has anyone attempted it and/or have any tips?
Has anyone ever done this?
Yes. <g>
And it works rather well, though I say so myself.
I have a large web site, many pages, much data (all reference), which is searchable by various scripts, a plain "search" script to enable the user to look for "widgets", and research scripts that enable users to retrieve ALL data recorded about "widgets" either site wide or within a restricted scope.
I did encounter issues while developing these searches. The main one being server overloads caused by site suckers and the like running numerous simultaneous searches which took too long.
It should also be noted that its very unlikely any shared hosting service will allow such CPU intensive work to be done, I use a dedicated server.
The scripts are written in Perl, which is designed for extracting text from within text files (in this case HTML pages).
The simplest basic premise is to sequentially open each HTML page in turn, seek the required text string, and if found return the contents of that web page.
From this simple premise one can add sophistication to increase the speed at which results are returned, restrict the returned data to just parts of a web page, &c.
You will find that the web site is a database, and the HTML pages are records, and as such that modifying the basic structure of the web pages can cause the searches to fail (sheepish grin <g>)
Good luck.
Matt
The reason I don't want my content in a db, is because I don't want to go through the db to edit my content. I use alot of scripting that gest tweaked. I want files that open in code editors.
You still want to store your search data in a database, it's at least 100 times faster than searching text files. You shouldn't have to use the DB to edit your pages.
The speed improvements are twofold. What you do is write a script (or get a canned script and modify it) that indexes the pages once or multiple times a day via a cron job. When it does the indexing it strips out all the HTML and stores the raw content in the DB, associated with the URL of th eoriginating page. Instead of searching through HTML files every time someone searches the site, it does it once, or twice, or however many times you think the search database needs updating.
The other advantage is the speed and flexibility of select statements on a mysql DB have a lot of advantages over regexps. It's also not as hard on the sever, opening and closing plain text files for a search process every time a visitor runs it is pretty hard on a server disk.
If you're good at scripting, you can even develop a few subs to use this clean content for generating meta tags for keywords and descriptions that are truly unique and specific to the pages rather than having to hand code them.
It's a thought, and much more flexible than plain page searches.
That is exactly what I started this thread for. Check out my explanation in my second post of this thread:
What I meant by an 'indexing script', is this: a script that is pointed to certain directories that will suck out content from files (text) and stick into a searchable format (either a db or a flat file) while associating the file name. The script will run as a periodic cron job.
But specifically I was looking for confirmation that this isn't a silly idea, and for recommendations on existing scripts (I could do it myself if I had to).
So judging by the responses here, it doesn't seem like a silly idea. I'm more of a php guy, but people tell me perl is good at this kind of thing.
Your insistence on not using a database caused some confusion. It's always useful when asking a question here to state what you are trying to accomplish, not how you think it might be accomplished. :)
...without putting all content in a database?
...stick into a searchable format (either a db or a flat file) while associating the file name. The script will run as a periodic cron job.
What I really mean is I don't want the original content (files) to live in a database, but I do want to put the results of the indexing script in a searchable format, probably a database.
Sorry for the confusion.
That is exactly what I started this thread for. Check out my explanation in my second post of this thread:
Bill sez "duh" and sorry, it was about 4 AM when I posted my reply. Yeah that's the way to do it, it's not all that difficult to code either. Once you have raw content in the DB, there's all sorts of cool stuff you can do with the data.