Related Pages - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Related Pages

NickCoons

5:03 pm on May 8, 2004 (gmt 0)

10+ Year Member

This is probably less a PHP question and more a conceptual one.

I have a site about a particular region. In this site is a growing amount of content, and this "vision" is to everything ever imaginable about this region. Obviously a journey more than a destination :-).

But I'd like to build some interlinking. For instance, if I have a restaurant review about a restaurant in this region, and there is an article posted elsewhere on the site, I'd like for them to link together.. but not manually, and there is the dilemma.

I don't want to have to remember everything that's on the site so that when something new is added that I link the old page with the new one. I'd like the site to know, based on the content of the pages, that they are related in content and that they should link together.

This is just one of several things that already exists on the site for easy accessibility. One that I currently have in place is a real-time glossary. While a page is loading, it's content is scanned against the glossary to see if the word exists.. if so, it's hyperlinked to the definition. This means that all I have to do is add a word to the glossary and every page with that word changes instantly sitewide.

Adding a related-links section to the bottom of each page that links that page to every other related page on the site would be another way of making the site "smart" and easy to use for my visitors.

Ideas?

henry0

8:50 pm on May 8, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I do about the same in a site that has business members

here is what I do through my CMS

in the add or edit I have a drop down box with "select a category"
for example (to follow your idea)
I would have restau, baker, winery etc... options
every time you add or edit it sends the selected option
that is a category to the DB again make it simple:
in your DB add a Col:category
and within the drop box such options as: resto, bake, wine....
next you create a query that will be processed as a link to the bottom of the page by doing a
.....where category=resto....

hope you get this help

Henry

NickCoons

6:42 am on May 9, 2004 (gmt 0)

10+ Year Member

I can't say that I know for sure what you're saying, but if I had to guess, I would guess that you're suggesting I categorize each type of article, and then use that to interlink.

That's not really the problem though. Everything is already in its own category. What I'd like to do is to have pages automatically interlink because they have related content, and to have this done automatically. If I have a review about a restaurant, and then I write an article where the restaurant is mentioned, I'd like those two to link together, or perhaps the article to link to the review, but not the other way around.

This can either be done in real-time (when a page loads, it scans the site to see if there are other pages that have similar content, then links to them, though very processor-intensive and time-consuming), or even having a cron job running that goes through the entire site looking for pages that are related by content and links them together.

Here's the dilemma I've run into so far. If I try and do a straight text match, words like "is" and "the" match, and so virtually every page links to every other page. I've even tried creating a list of such words to be removed, but that list grew very large very quickly, because I don't want most common words to be included either (run, jump, fly, computer, car, etc) since they don't give a real feel for the content.

Instead of creating a list of words to exclude, I've thought about creating a list of words to include, but I think that would be equally tedious. I'm looking for a method to use so that my server can look at articles A and B and say, "These two articles have a similar subject, so I will link them together," or, "Article A mentions something that article B focuses on, so I will link article A to article B."

Seems almost like AI, but I'm hoping someone will be able to point out a possibility that I've overlooked.

lildemon

7:27 am on May 9, 2004 (gmt 0)

10+ Year Member

From the sounds of things, processor usage isn't a problem if you can run it all on cron jobs, etc... So have you thought of using the same idea, matching words with stopwords left out, and requiring a 90% match or something like that? Possibly just match a certain number of words (10 words = related)? Better still, only top 5/10/15 scorers linked, in order of relevance?

If you cron job it, it would be like having your own search engine crawling the site for you... The AI end of things would be fairly simple.

And you could have the first run through catalogue words for you. Then you can just select the words from that list that you want to exclude, and you would have a much larger starting point than by simply adding words as they occur to you.

This would be a fairly intensive project (Far beyond my scope of talent) but that is probably the concept I would go for, if you were going to want it self-updating.

Bleh. Probably an unrealistic plan, but I can figure out a good chunk of it as I sit here, and I think it should work.

Sorry for using your post as a brainstorming session, I think you gave me several ideas for new toys to program though lol

NickCoons

1:48 pm on May 17, 2004 (gmt 0)

10+ Year Member

lildemon,

I think this is the route that I will take. As far as creating a list of stopwords, I'll either use current content to gather that from, or I'll use a dictionary. It's mainly proper names that I would want to associate.

One thing I'm trying to work out right now is multiple word terms. For instance, imagine a popular topic on my site would be about a place called the "Great Widget." Any time the term "Great Widget" was found in two documents, that would be meant to count towards relevency.

However, I would not want the words "great" or "widget" to mean anything, those would be stopwords. So when comparing my article against my list of stopwords, how would I prevent "Great Widget" from being removed, since those two words will be removed once both "great" and "widget" have been encountered in the stopwords list. It almost seems that I'd have to have a list of words to keep, which I don't want to do.

One thing I could do would be to make it case-sensitive. It would match "great" but not "Great." But then I have the issue of irrlevent words being at the beginning of the sentence, and they'd start with a capital letter. Or, people posting in forums and not using proper capitalization. Someone could post about the topic and type "great widget", and while that would be relevent, it won't match a case-sensitive search.

lildemon

5:47 am on May 22, 2004 (gmt 0)

10+ Year Member

Sorry it took so long for me to come back to this post, I was out of town for some time here...

And even more sorry because, due to my (complete) lack of experience in programming search engines, I really don't have an answer for you...

Possible directions to look in:
1 - Dual-pass searching... Perhaps the second pass could look for phrases, using a second stopword list which could contain only the truly common words (a, the, it, is, etc) and requires both words to be over a certain length (3 letters or greater would probably work). I would suggest limiting phrase searching to two words, however, or the complexity (and runtime) goes up immensely.

2 - Sort and only allow top 5-10 matches (or whatever number suits your design concept) to be accepted... Which should (And I stress _should_) allow you to use less stopwords, and count on the likelyhood of related topics having a higher relevance to each other.

3 - There is no 3, I'm out of ideas. But again, your questions have given me ideas for new toys, so thanks!

PS: Good luck, and please do keep us updated here on how it goes.

NickCoons

3:06 am on May 23, 2004 (gmt 0)

10+ Year Member

I think I've figured out what I wanted to do. First, I thought I'd start off as you suggested; I created a quick script that runs through every piece of content on my site, creates an array (one element for each word), removes duplicates, and sorts them alphabetically. At this point, I have the beginnings of an English dictionary. However, it's very obvious that there are many gaps.

I've decided that I don't mind if the words "great" and "widget" trip the similar content sensor.. I'll just set the threshold high enough that random occurances won't be able to trigger the creation of a link.

Now that I have a very long, yet horribly incomplete, list of words that needs to be manually edited, I find myself looking for another solution. The problem here is that I will spend a great deal of time removing words from the list that I don't want to have as stop words, and then finding that as new content is added, there are new words that should be stop words but are not, so they will have to be added.. not a task I look forward to.

So I've decided that the topic of a page is dependent one words of a certain part of speech -- nouns. So I'd like to start my stop words list with the entire English dictionary, and then remove all of the nouns. Obviously, I don't want to do this by hand. Which means that I am now in search of a list of English words, hopefully labeled with their parts of speech. I'll write a quick script that will remove all of the nouns, and voila!

And the rest should be easy.. now I just need to find such a list.

NickCoons

6:41 am on May 23, 2004 (gmt 0)

10+ Year Member

Okay, I did all of this. I created my dictionary sans nouns, and wrote the script to run and find comparisons. The dictionary contains about 133,000 words in it. Each word in the dictionary needs to be checked against each word in the document. In the test document, we have 300 words.

That's:

133000 * 300 = 39,900,000 textual checks.

Not pretty. This process takes too long.. I didn't let it run all the way through, but I calculate that it would have taken about 11 minutes to run. That's for one document.. I certainly can't do that for each and every 300 words worth of documents on my site.

Instead, I concatenated the 133000 terms into one long string using a tab character as a delimeter, and then I used strpos to see if each word in the document existed in the this string. Obviously this isn't 100% accurate since a word in the content may exist within another word in the long string, but not on its own. Fortunately, this project doesn't require 100% accuracy to be successful. And even better, the time it took to check this same document was 6 seconds, compare to 11 minutes.. well worth it.

Now I need to make a few additions to my nounless dictionary, have it create the table with each documents' "keywords", and then create the script to include at the bottom of each page to actually create the related links on the fly.. then make this run as a cron job.

jamesa

7:22 am on May 23, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Sounds like a workable solution NickCoons. Here's some other ideas:

>> strpos

If you're using MySQL, check out fulltext indexing. It's very fast, and you can sort by relevancy.

>> dictionary

Depending on the amount of content you have, and how narrow your topic range is, it could make sense to create a topic list. Think META tags. What you'd normally put in a META description tag could instead go in the db. That's assuming the content is hand generated initially.

Step two, put a search box on the site and add the successful queries to the topic table.

NickCoons

7:58 am on May 23, 2004 (gmt 0)

10+ Year Member

jamesa,

<If you're using MySQL, check out fulltext indexing. It's very fast, and you can sort by relevancy.>

Looks interesting.. this may be useful.

<Depending on the amount of content you have, and how narrow your topic range is, it could make sense to create a topic list. Think META tags. What you'd normally put in a META description tag could instead go in the db. That's assuming the content is hand generated initially.>

The topic is very broad and the content is ever increasing. I did think about this possibility before, but I've come to the conclusion that this would be way too much work. I don't want to have to hand-select keywords for each article posted, and I want this to cover things that are not hand-written (like forums posts).

The code I've written so far seems to work well.. but I do seem to have discovered one flaw. My earlier assumption that nouns are what is needed to describe the topic of a document doesn't seem to work well. Some nouns are so vague that they don't lend anything to the related-links cause (like the word "place"). In addition, certain non-nouns seem to be important. "Camping" or "cliff-diving" could be crucial to determining the meaning of a document, but my current setup will strip these words out.

So I've figured out the code, how to have it scan through and create the database. I'm not left with coming up with a valid list of stop-words, and doing so based on some sort of "rule", because doing it by hand (creating a list of 100,000+ words) would be very tedious.

Looking for ideas...

FknBlazed

8:22 am on May 23, 2004 (gmt 0)

I may be way off here but could you not do like you were referring to earlier and create your script to crawl through all the content, log all the words, remove reduntant entries, insert into the db only if the word is not already there. That would be your word list and the word amount would always increase as the site increased. Then you could compare your wordlist to all the content (possibly stripped of 1 and 2 letter words) then rate it for relevency.

Also make it so everything has a headline. What I mean by this is, if it were an article then that would have a headline (you could force the user to input that in your data inserting form) then only search that, but if it were say an excerpt of a writing or an informal writing of some sort usually the first couple of lines or paragraph sum up the rest of the story. This way you are only logging the most pertinent information.

I think that would shrink the size of what you are trying to do immensly especially if it gets as big as you describe.

Hope I made sense there. If I didn't I am sorry.

~FknBlazed

NickCoons

7:45 pm on May 23, 2004 (gmt 0)

10+ Year Member

FknBlazed,

Maybe I'm not clear on exactly what it is that I'm trying to do. Inevitably, a site with a particular topic will end up having one page of content that is related to another page.

Let's say that this site focuses on widgets. In one section, you have the history of widgets, and this is broken down further by types of widgets and the specific history of each. Then, in an entirely different section, let's say widget reviews, you're reviewing a particular type of widget. In this article, you mention how this widget has evolved from how it was originally made, and the historical value of this. Suddenly, you have an article on your site that is somehow related to another article. I would like the site to know that on its own (i.e. without me going in there and saying that these two articles are related), and have them link together at the bottom of the page in a section called "Related Links."

I've got the code that will go through and get rid of stop-words, that will create the table of keywords, and that will search through and find relvency between documents to link them together (well, that part is mostly done). What I'm stuck on now is where to get a list of stop-words to remove the words that don't matter. I'd like to start with the English dictionary as a list of stop-words, and then have a rule to remove words from it that should not be on this list (because I don't want to do it manually). But I don't know what that rule would be.

<...but could you not do like you were referring to earlier and create your script to crawl through all the content, log all the words, remove reduntant entries, insert into the db only if the word is not already there. That would be your word list and the word amount would always increase as the site increased. Then you could compare your wordlist to all the content (possibly stripped of 1 and 2 letter words) then rate it for relevency.>

This is good in theory.. and I've tried this. Unfortunately, there are WAY too many words that are not one or two letters (even words that are eight letters) that are completely irrelevent to the topic of the page, or that don't give any hint to the topic of the page. "Irrelevent", for instance, is one of those words :-).