Forum Moderators: coopster

Message Too Old, No Replies

Efficient Text Analysis with PHP or otherwise

Trying not to waste processing time.

         

brotherhood of LAN

10:16 pm on Feb 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



//intro
Im breaking up the webpages to (at least attempt to) make a good site SE but not waste space etc etc etc
//

When breaking up webpages into elements, I have been doing it a rather slow way (at least I think I am), and it seems that there is lots of improvement to be made in regards to speed.

This is the way Ive been doing it up to now, though for the next text batch I want it to be done a lil' bit faster.....

1) Fetch 1 page from database
2) Replace all unnecessary tags and split page into "paragraph elements", i.e. nav bars, paragraph, heading etc.
3a) For each paragraph, for each word, check if word exists in db, and if not, insert it.

This involves going along the paragraph word for word and inserting them into the db if they're not already in it.

A possible alternative I've been pondering is iterating through the words and just keeping a tab of the words place in a paragraph, and what paragraph its in......and then weeding out all duplicate words. This requires flagging where the word appears etc, though saves on the # of queries to the database by only querying each word once. Something like

1) Fetch 1 page from database
2) Replace all unnecessary tags and split page into "paragraph elements", i.e. nav bars, paragraph, heading etc.
3) For every unique word in document, check if word exists in db, and if not, insert it.

As an add on to that alternative, perhaps assinging DOCID's to the words and processing 10 pages at a time might be more efficient, instead of doing one a time.

Safe to say checking word1 > word2 > word3 in each paragraph of each single page is one of the lesser efficient ways of doing this.

Has anyone got a suggestion/formula/practice that could make this process a bit speedier? Maybe there is even a formula out there to say "this is the best way to do it", but id be much satisfied with "the better ways to do it" just now :)

brotherhood of LAN

1:38 am on Feb 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



oh well....formulas aside, the alternative is faster than what i was doing before :)

2m words over 5000 pages in 30 seconds through php...nice.

jmccormac

6:10 am on Feb 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Cutting the small words from the database would probably give some improvement. The words would be the common ones like 'the', 'a', 'an', 'or', 'and' etc. Another aspect would be to apply weightings to the positioning of the words. Thus if a word appeared in a headline/subheadline/title it would have a higher weighting.

If your pages have good headlines/subheads then the job is actually a lot easier than it first appears. The biggest problem is that you would end up indexing all words where in reality you only need to index important words. Some of the Open Source search engines come with tables of blocked words. These lists would be a good thing to use to clean your search index after you have compiled it.

Regards...jmcc