Checking article uniqueness

Forum Moderators: coopster

Message Too Old, No Replies

Checking article uniqueness

using php to compare two articles for uniqueness

msykes

6:20 pm on Jun 17, 2010 (gmt 0)

I'm trying to develop an php script that'll compare two articles and generate a percentage for uniqueness. A good example of software that does this is [dupecop.com...]

I want a php script that does this that runs on my own server, and although I've tried combinations of similar_text() etc. I haven't found a method that gives a similar percentage to the sites and services out there that are geared to make sure your spun articles are unique enough.

There's a couple more sites that give similar numbers as dupecop for the same comparisons, so I figured there must be something out there already... but alas I can't find it.

Can anyone point me in the right direction?

Readie

6:26 pm on Jun 17, 2010 (gmt 0)

I have a few ideas, the easiest to implement that I can think of is to split both articles up into 1 character pieces.

Problem with that though, is if the following 2 were compared, it would say "50% unique":

abcdefgh
abcdxefg

[edit]

Actually, another thought:

Split both up into arrays of 1 character pieces, then count how often each character appears (could do this as an explode(" ", $article) for by-word too)

msykes

7:13 pm on Jun 17, 2010 (gmt 0)

Hmm... that's a possibility actually. I'll try that out and see.

However I'm pretty sure something more complex is needed. I want something with a similar accuracy to Google's content checking system (not that we know how accurate it is) at a document to document level. I'm betting they throw in some LSA techniques in there or something like that.

Readie

7:34 pm on Jun 17, 2010 (gmt 0)

Google's content checking system

I suspect Google have been working to perfect that system for several years, being able to hire the best of the best seems to me you won't be able to achieve the same level of accuracy.

My advice? Start basic, and work at it here and there as ideas occur to you.

[edit]

Oooh, I just turned senior member...

Anyways, idea again :)

Have multiple systems (so, have the 2 I suggested above, then any more you can think of) and take a value from both, then compile the values together to work out an average. Could possibly configer it to give greater weight to one system than another etc.

penders

10:20 pm on Jun 17, 2010 (gmt 0)

What about splitting the articles up into words and comparing that way? Just my 2c

I would have thought there would be a few(?) recognised mathematical/computational algorithms for calculating similarity/uniqueness? So you may need to search for the method, rather than the implementation? Sorry, can't be more help I'm afraid.

msykes

8:31 am on Jun 18, 2010 (gmt 0)

I was afraid that mentioning Google would get that response :). In no way do I expect (or want to pay for!) something as accurate as google in terms of checking for duplicate content as a whole. After all they check billions of pages for duplication against each other with a lot of contextual checks too.

I'm more interested in direct comparison, and I'm pretty sure from what I've read from their press releases they don't have something that elaborate (or at least elaborate enough that we can't reproduce to some degree), after all the computer power needed to compare every page online with each other is a tough job even for google!

penders, that might be a good idea too about the words. I'll try these two solutions and see how well they fare. I fear though more will be needed. thanks!