Forum Moderators: coopster
When breaking up webpages into elements, I have been doing it a rather slow way (at least I think I am), and it seems that there is lots of improvement to be made in regards to speed.
This is the way Ive been doing it up to now, though for the next text batch I want it to be done a lil' bit faster.....
1) Fetch 1 page from database
2) Replace all unnecessary tags and split page into "paragraph elements", i.e. nav bars, paragraph, heading etc.
3a) For each paragraph, for each word, check if word exists in db, and if not, insert it.
This involves going along the paragraph word for word and inserting them into the db if they're not already in it.
A possible alternative I've been pondering is iterating through the words and just keeping a tab of the words place in a paragraph, and what paragraph its in......and then weeding out all duplicate words. This requires flagging where the word appears etc, though saves on the # of queries to the database by only querying each word once. Something like
1) Fetch 1 page from database
2) Replace all unnecessary tags and split page into "paragraph elements", i.e. nav bars, paragraph, heading etc.
3) For every unique word in document, check if word exists in db, and if not, insert it.
As an add on to that alternative, perhaps assinging DOCID's to the words and processing 10 pages at a time might be more efficient, instead of doing one a time.
Safe to say checking word1 > word2 > word3 in each paragraph of each single page is one of the lesser efficient ways of doing this.
Has anyone got a suggestion/formula/practice that could make this process a bit speedier? Maybe there is even a formula out there to say "this is the best way to do it", but id be much satisfied with "the better ways to do it" just now :)
If your pages have good headlines/subheads then the job is actually a lot easier than it first appears. The biggest problem is that you would end up indexing all words where in reality you only need to index important words. Some of the Open Source search engines come with tables of blocked words. These lists would be a good thing to use to clean your search index after you have compiled it.
Regards...jmcc