Efficient Text Analysis with PHP or otherwise

//intro
Im breaking up the webpages to (at least attempt to) make a good site SE but not waste space etc etc etc
//

When breaking up webpages into elements, I have been doing it a rather slow way (at least I think I am), and it seems that there is lots of improvement to be made in regards to speed.

This is the way Ive been doing it up to now, though for the next text batch I want it to be done a lil' bit faster.....

1) Fetch 1 page from database
2) Replace all unnecessary tags and split page into "paragraph elements", i.e. nav bars, paragraph, heading etc.
3a) For each paragraph, for each word, check if word exists in db, and if not, insert it.

This involves going along the paragraph word for word and inserting them into the db if they're not already in it.

A possible alternative I've been pondering is iterating through the words and just keeping a tab of the words place in a paragraph, and what paragraph its in......and then weeding out all duplicate words. This requires flagging where the word appears etc, though saves on the # of queries to the database by only querying each word once. Something like

1) Fetch 1 page from database
2) Replace all unnecessary tags and split page into "paragraph elements", i.e. nav bars, paragraph, heading etc.
3) For every unique word in document, check if word exists in db, and if not, insert it.

As an add on to that alternative, perhaps assinging DOCID's to the words and processing 10 pages at a time might be more efficient, instead of doing one a time.

Safe to say checking word1 > word2 > word3 in each paragraph of each single page is one of the lesser efficient ways of doing this.

Has anyone got a suggestion/formula/practice that could make this process a bit speedier? Maybe there is even a formula out there to say "this is the best way to do it", but id be much satisfied with "the better ways to do it" just now :)

Efficient Text Analysis with PHP or otherwise

Trying not to waste processing time.

brotherhood of LAN

brotherhood of LAN

jmccormac

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week