Forum Moderators: skibum

Message Too Old, No Replies

Rolling your own contextual engine

Any potential problems?

         

trillianjedi

10:50 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



With plenty of word density counting scripts available open-source, I'm considering building a contextual advertising engine to work with my affilliate ID's.

Essentially one site we have is a very niche topic forum. I could probably condense the main keywords into a list of about 200 words, phrases and widget product names and ID's.

What I would like to happen is, if Bob starts a new thread about a SuperWidgetCompany X135, contextual ads pulled from a database of my affilliate ID's associated with that product are displayed.

It seems to me to be a fairly straightforward bit of coding. The database I'll build by hand with all the relevant associations of words and phrases to affilliate URL's. It's a basic SQL search to pull the results.

Similar to any other form of contextual advertising except for the fact that unlike Google AdSense or others, I'm not fighting against people trying to fool the system as it's all for my own benefit. That keeps the search side fairly simple, as I only need simple word search - it doesn't need to be "intelligent".

Rather than building a crawler to do the word count, given that the entire forum has a static header, footer and navigation, I'm thinking of running a word count application each time someone makes a post on an existing thread, or creates a new one.

So the word count is only done each time the page content changes.

The top five scoring elements, either by density, or product names/specifics known to the DB, will be stored in the DB record along with the thread data.

Each time the thread is displayed, the SQL query will then pull the ads depending on that top five word/phrase set, or default to something preset (or AdSense).

Has anyone here done this? Any foreseeable problems with such a design?

Thanks,

TJ

linear

6:53 pm on Apr 29, 2005 (gmt 0)

10+ Year Member



There may be efficiency gains by doing it at the database level rather than the page.

I haven't done precisely what you describe, but I have used MySQL's FULLTEXT index to automagically build internal links among pages. A fulltext index on the forum posts would allow you to do the matching against your kw array much more efficiently.

The other bonus is that MATCH() returns a relevance score, saving you from having to define and compute one. You just would need to set a relevance threshold for matching yout ad inventory, and show a fallback ad below that threshold.

Going this one better, you could even build a SOUNDEX table for your keywords and catch misspellings too.

Order your keyword array by value (most expensive first) then MATCH() each keyword AGAINST() your fulltext index until you hit something above a relevance threshold you set.

Just some thoughts, I'm sure that can be improved upon also.

jchampliaud

9:05 pm on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting idea, I’m sure you will have to do some tweaking but it seems worth a try.

trillianjedi

10:43 pm on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the info linear - I hadn't even considered using some SQL built-in functionality. That could save a lot of work.

I don't think this idea would scale - it suits me as I'm looking at it as a solution for a very niche subject matter. The word "index" will consequently be very low.

I'll come back and let you know how I get on, and I might post up some code and SQL statements in the PHP forum.

TJ

linear

4:00 am on Apr 30, 2005 (gmt 0)

10+ Year Member



That's the gotcha with FULLTEXT indexes--they are tuned to large bodies of text. The corollary gotcha is that they can violate your expectations badly on small bodies of text.

The more typical way to use one is to do something like a forum search, "all posts that contain foo." So it might be tempting to index the keywords associated with your ads and match against that, but the size of the text body is too small to make it viable (and worse, probably a decent amount of kw repetition).

The real secret sauce is that it a) is natural language search, and b) returns a relevancy score you can use.

trillianjedi

8:38 am on Apr 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks.

I think you're suggesting that FULLTEXT probably won't work in this context?

My original idea was to use a basic word counting script, get a top 5 list of words/2 and 3 word phrases, and match against that, or display a default AdSense etc ad instead in the event of no matches (or a combination of those).

You're right in that this should really be natural language searching, but when I view word counts in threads, there is a natural correlation between basic word count and ads I would instinctively pick by hand. This is due to the nature of the niche subject matter and industry keywords used. This wouldn't work on every site or in any niche, but in doing the analyisis manually, I notice myself using a very very straightforward formula in this particular case.

When I noticed that I immediately thought "automation".

I can adapt a keyword density script to work direct against the database thread text easily enough. I think from that point it really is a case of directly matching keyword density to an advert keyword table. In essence, that's all I need to do manually.

It will have flaws, but if I can achieve topicality even 50/60% of the time, it will be better than what I can do manually, simply because it's impossible for me to keep up with new and changing threads on a daily basis due to activity on the forum.

TJ

linear

12:43 pm on Apr 30, 2005 (gmt 0)

10+ Year Member



I guess I was suggesting that the usual way people use a FULLTEXT index doesn't map well onto your project. But I think it is still worth a look.

I would prefer to let the MATCH() find my best match, but there's not a way to do that in a single query, unfortunately. My proposed algorith above would favor a high-paying match over a more relevant match, so that's an obvious drawback versus your approach.