homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Databases
Forum Library, Charter, Moderators: physics

Databases Forum

MySQL Match Against + a very expensive join. How do I optimize this?
This seems like an impossible problem

 1:02 am on Nov 12, 2010 (gmt 0)

oy vey, do I ever have a problem.

I have one table - called "items", with about 800,000 rows. In it are various data specs about products.

Each product can be described in multiple languages, so i have another table - called itemtext - which uses items.id as a key plus a "language" column - those two make up a unique index - plus two more columns for the product name and description of that item in that language.

I used to have this search function that used a Fulltext index in the items table; that was when I had the name and description in that table. It was slow, but tolerable, usually returning results in 1 to 3 seconds.

Now you'd suppose that it would be easy to apply the fulltext index to the itemtext table, and it is. But my problem is that there are columns in `items` that also need to be in the WHERE clause, notably: deleted (indicating that a product is queued for deletion), active, and some other flags that affect whether a product should be visible in the search.

What I end up with is a query like this:

SELECT itemtext.*,
MATCH (itemtext.name,itemtext.description) AGAINST ('word') as score
from items,itemtext
WHERE items.id = itemtext.itemid
AND MATCH (itemtext.name,itemtext.description) AGAINST( 'word' IN BOOLEAN MODE)
AND items.active = 1
AND items.deleted = 0

I may have the syntax wrong there.. I'm recalling it from memory (the kind in my head, not the kind that crashed when I performed that query)

Now, as I was writing it, I knew this was going to be a nastily expensive query. It's already using a fulltext index - it has to - but furthermore it's doing a huge JOIN between items (800K rows) and itemtext (over 1M rows). Since any particular query can only use one index at a time, you can guess that the join, even though it's on a primary key, is going to be insane.

Indeed, when I tried it, MySQL crashed. It crashed slowly and agonizingly, over 5 minutes of despair and anguish. It could only be resuscitated with a KILL and RESTART.

I need a viable search option for this site. But because of this JOIN, the fulltext index is probably not going to be possible.

What are my options?

Are there any good denormalizing solutions that will provide search functionality without relying on a real-time query of the two tables?

reluctantly, I could maintain a duplicate, pre-filtered, itemtext table where it only includes rows where items.deleted = 0 and items.active = 1... that would work and it would denecessitate the JOIN. But how do I do that? must I write this secondary index maintenance into the application code?



 10:03 am on Nov 12, 2010 (gmt 0)

Well, it seems to me the query-optimizer should already be doing this, but since it crashes I'm gonna go out on a limb and suggest it anyway:

How about putting your fulltext search in a nested query, and do a limit on the number of rows return (Assuming you want to page your search results), and THEN perform the joined. Then you would only have to join X rows with your 800k items...

Now, as I said, the query-optimizer might already be doing this, and you'll need another approach, but just the same, I think it's worth a try.

Hope it helps


 12:14 pm on Nov 12, 2010 (gmt 0)

DWarp9, thanks!

I tried something similar and it works great.

query #1:

select distinct itemtext.itemid,
MATCH (`name`,`description`) AGAINST ('".mysql_real_escape_string($q)."') as score
FROM itemtext
WHERE MATCH(`name`,`description`) AGAINST ('".mysql_real_escape_string($q)."' IN BOOLEAN MODE)

then I put those results in an array named $temp

query #2:

SELECT itemid
FROM items
WHERE itemid IN (".implode(",",$temp).")
AND deleted = 0
AND active = 1

The "WHERE itemid IN" feeds the temp results in, the other two clauses rid the list of those ones that are deleted or inactive.

then I cache that result and use it to paginate without making the query repeatedly.

The result is that I don't necessarily get a 500 hits, but I can cope with that emotionally.
Importantly they're filtered the way I want, still ordered by their score, I can easily paginate the list, and the queries are fast.


 1:35 pm on Nov 12, 2010 (gmt 0)

Glad to hear it!

You could save yourself a round-trip to the db by nesting the first statement like so:
SELECT itemid
FROM items
WHERE itemid IN
(select distinct itemtext.itemid,
MATCH (`name`,`description`) AGAINST ('".mysql_real_escape_string($q)."') as score
FROM itemtext
WHERE MATCH(`name`,`description`) AGAINST ('".mysql_real_escape_string($q)."' IN BOOLEAN MODE)
LIMIT 500)
AND deleted = 0
AND active = 1

If you want a little more flexibility, assuming you have a variable $pagenum holding the current pagenumber, you could do the limit like this:

LIMIT ".($pagenum-1).", ".$pagesize

It might be slower, since you don't cache the query, but you could probably work around that as well.


 6:11 pm on Nov 14, 2010 (gmt 0)

I found after a couple of test searches that the ORDER of relevance is lost when I do that "WHERE IN" query. The results come back in primary key order, not in order of the items in the "IN" clause.

I fiddled with it a little last night, and found that doing a weighted match against `name` and `description` gave better results. This way the word frequency in the name counts for more than in the description.

The new query uses a temporary table instead of an array in PHP-space.

It looks like this:

select items.itemid, serps.score from
(select distinct itemtext.itemid,
1.5 * (MATCH (`name`) AGAINST ('".mysql_real_escape_string($q)."'))
+ 0.5 * (MATCH (`description`) AGAINST ('".mysql_real_escape_string($q)."')) as score
FROM itemtext
WHERE MATCH(`name`,`description`) AGAINST ('".mysql_real_escape_string($q)."' IN BOOLEAN MODE)
LIMIT 15000) AS serps
where serps.itemid = items.itemid
AND items.deleted = 0
AND items.active = 1

The results are good! some sample searches came up with nice relevant results, it's quick, and the caching makes all subsequent searches and pagination very fast.


 2:30 pm on Nov 15, 2010 (gmt 0)

the trick is the temporary table named "serps" being built within the query.

In my first attempt, I was actually joining the two tables first, then filtering that JOIN using a WHERE clause. It's no wonder it was inefficient.

In the current version (above) the `serps` temporary table only has matching items in it, so JOINing that into the items table is a lighter operation - it only joins as many rows as there are matches. The weighted scores on the two columns do a better job of sorting relevant matches. It's done with one query, instead of two, and doesn't require an array to store those temporary matches. Query optimizer does a rather good job on this one.

A lot better than what I started with.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Databases
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved