joined:Jan 18, 2008
So I've had a search script I've been honing for a few years on a custom content management system I've built. It started out really basic but it's gotten pretty good. I just need to get a few things figured out for usability and speed.
Right now it searches through all areas of the site (blogs and their comments, products/categories, static pages, polls, etc.) using UNION. The only fields it needs to return are table, id, name, teaser, and last edit date. So pretty basic there. I also have the logic that looks through a predefined set of fields for each table (blogs would be name/title, content.... products would be name/title, content, partnumber, manufacturer, etc) in the WHERE statement for each UNION.
I explode out the words of a query to do a broad search so 'blue shirt' will match anything with 'blue' OR 'shirt'. No problem.
Next was ordering the results with the most relevant first. This was where it got a little more tricky. But I was able to set up a calculated 'relevance' field that was weighted so that exact matches have a higher weight, matches in the title field have more weight than matches buried somewhere in the content, and number of appearances of search terms adds weight. I got it to work beautifully. The way it works is:
suppose a search for 'blue shirt' against a table with 2 fields (name, content). substr_count() is a function that just counts the appearances of one string in another. The weight variables are just integers I change until I get the right balance.
(substr_count($table.name, 'blue shirt') * $name_field_weight * $exact_match_weight)
(substr_count($table.content, 'blue shirt') * $content_field_weight * $exact_match_weight)
(substr_count($table.name, 'blue') * $name_field_weight)
(substr_count($table.content, 'blue') * $content_field_weight)
(substr_count($table.name, 'shirt') * $name_field_weight)
(substr_count($table.content, 'shirt') * $content_field_weight)
The sum of all these are the relevance (higher numbers show first)
I've also extended the functionality so that remote items will trigger a result show for it's parent (a match in a blog comment will show the blog in the search result). However the way I've set this up isn't very optimized (I don't think). I created an additional UNION statement for each sub section. So that the same blog might end up as a result in the blogs UNION as in the comments UNION. To prevent the result from being shown twice, I wrapped all of the unions in an outer query like so
SUM(rel) AS rel2
) AS combined
LIMIT $limit_start, $limit_span
This adds up the weight fields for items that might show in multiple UNION sets (AS rel2). It seems like this isn't the best way to do it, but I can't think of a reason why it would be slower than any other way. I feel like I should be doing JOINs in the UNIONs. Ex
'blogs' AS table,
blogs.blg_id AS id,
($weight_calculation_sql) AS rel,
LEFT JOIN comments ON comments.blg_id = blogs.blg_id
blogs.name LIKE '%QUERY%'
blogs.content LIKE '%QUERY%'
comments.message LIKE '%QUERY%'
This code works fine for blogs since there is only 1 join. But when it comes to tables that have multiple unrelated joins, it doesn't calculate the rel field right. For instance suppose you are searching products and joining on an attributes table and a features table. If the search term is matched 2 times in linked attributes and 3 times in the linked features, the function that counts the appearance of search term will return 6 instead of 5. I always feel like there is an SQL feature that would allow for this type of scenario, but I've never come across a good solution. It would be so helpful for other things too.... to have two or more sets of aggregate data per query that don't interfere with each other.
It seems like I need to do joins instead of sub-queries because of how I have the weight field generated. Unless anyone has any brilliant ideas?
Any I was just hoping for some feedback/suggestions or if anyone else wants to share their logic for in depth searches, I'd like to see it.