It's always an acedemic exercise, I always try and do it myself first. The only thing I gave up on so far is a HTML parser!
ok, I've had a play. So far I've got it to search the entire site for html files, strip the bare text, and note important things like headers and bullets. It then saves this in a cache. I might change this to actually spider the site first instead of reading from disk. I found it picked up a few files I didn't want it to index and had to code around it.
Sean, I like your way of indexing each word, it's better than the way I was going to do it! I would prefer to do this page by page, there doesn't seem to be any problem with this, and it will save having to join and split the indexes, or create more complicated way of storing them.
Incidentally, how did you get your code to appear as code?