Welcome to WebmasterWorld Guest from 50.17.117.221

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

related stories/text algorthm/code?

     
12:10 am on Oct 26, 2007 (gmt 0)

Junior Member

5+ Year Member

joined:Aug 29, 2007
posts:82
votes: 0


Hey guys, I havnt been able to figure it out, does anyone have code that compares lets say paragraph of description text to other descriptions in db and selects records that their descriptions are related?

I have been thinking about stripping out certain words, just to keep subject type of words in, but havnt been able to get close.

Any help would be greatly appreciated, thanks!

7:47 am on Oct 26, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 9, 2005
posts:817
votes: 0


1 - store the posted description in a variable
2 - use SELECT * FROM db_table WHERE description LIKE '%storedVar%'

if you want to do it yourself then follow the above steps. If you want us to do it for you then keep waiting ;)

7:15 pm on Oct 29, 2007 (gmt 0)

Junior Member

5+ Year Member

joined:Aug 29, 2007
posts:82
votes: 0


Well how does that get related? It will only find stories with exact description as the storedvar? That seems to be limited...I want to get related to topics, names, events, etc...Thanks!
8:30 pm on Oct 29, 2007 (gmt 0)

Preferred Member

5+ Year Member

joined:Jan 16, 2007
posts:477
votes: 0


this is kind of slow, but u can try assiging your description paragraph to an array(filter out prepositions 'on' 'in' 'up' etc, conjunctions 'and' 'for' 'nor' 'yet',etc) by using explode() or split(). loop through this array and compare each value to the db with the like% thing like phparion suggested. you're bound to find a lot of results. u can enhance ur results by coding so the result with the most matches go on top.
9:03 pm on Oct 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 4, 2002
posts:1314
votes: 0


Use similarity metrics.

Essentially, a count of the number of edits needed to change each database paragraph into the target paragraph. The paragraphs with the lowest counts are "most similar".

You may need to calculate using several different metrics and combine the scores in a weighted way that meets your application's quirks.

It's what I do, and it works very well.

Wikiedia for Jaccard Index to get yourself started.

6:38 am on Oct 30, 2007 (gmt 0)

Junior Member

5+ Year Member

joined:Aug 29, 2007
posts:82
votes: 0


thanks for the reply guys, do you think you could share some possible code for it, anyone out there have any? Thanks!
4:38 pm on Oct 31, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 4, 2002
posts:1314
votes: 0


I generally use REBOL rather than PHP for server side scripting, so the similarity matrix code I use is in that language.

I ma not aware of any canned code in PHP. To find the code I use, search with google for
rebol simetrics

You may be able to recast the algorithms into PHP given the above reference implementation.

5:50 pm on Oct 31, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:July 12, 2007
posts:766
votes: 0


There is something on SourceForge that may help -

[sourceforge.net...]

1:29 pm on Nov 1, 2007 (gmt 0)

Preferred Member

5+ Year Member

joined:July 31, 2006
posts:629
votes: 0


think fulltext search would help.

[dev.mysql.com...]

It will match even if some words are missing / rearranged. Result set can be sorted by relevance. Works much faster than LIKE. But uses more disk space than regular indices.