all leading to this page: [labs.google.com...]
|I am putting together an automatic classification system which is using around half a gigabyte of text as the raw data, I expect the computations to run for several days. |
Just out of interest, what's the programming language of implementation and do you keep all data structures in memory? 0.5 Gb of raw data is very small and should fit in all-in memory index on most current mahines, so the problem should be mainly CPU bound and while I admit to have not done anything practical yet in area of LSI it sounds a bit strange that you expect algo to run for days.
Once again, it appears current SEO techniques are out the window; which will like Bourbon leaves us scrambling for our site's lives. Yes, hopefully it will make things better, BUT, that's what Bourbon was supposed to do.
Inbound and Lord Majestic,
A half a GB of words is a lot of data.
Approx 7200 words in 1300 stories takes close to 6 minutes raw cpu power on a 2.0 GHz xenon box to do a simple classificataion using a simple minded brute force system.
That is about 1.2 mb or 0.0012 GB of data.
So the figure is reasonable.
Lots of cpu cycles will get done in doing that on a large scale.
The results can however be "interesting" even when the domain is restricted.
|A half a GB of words is a lot of data. |
Perhaps for LSI it is, certainly not so for "normal indexing" for search purposes. I will soon start working on LSI bits for the search engine, will be interesting to see if I get similar performance numbers: I hope you are wrong as your figures are way too slow :(
A friend of a friend of mine is a PhD linguist who started working at google starting a few years before the IPO. I understand she is one of many.
One of Google's stated goals is to use artificial intelligence to actually understand the content of pages, in large part to help distinguish meaningful text from copy 'n paste scraper sites.
I found a well-designed page that had copied a couple paragraphs from one of my top pages. They had also copied a few pages from several other related pages. Each paragraph was (if I may brag) well written and made sense on its own. But the whole "essay" made no sense whatsoever.
The "essay" credited its "author" as being some esteemed academic and even gave a link to his homepage.
Now imagine the daunting task google faces in distinguishing such pages from legitimate content. That's why they hired my friend the linguist.
Of course there was adsense there.
I could report them but I have found so many like these that it would be a full-time job. If I could afford employees maybe I'd have one do it. But for me alone it is better to develop new content on my site and to try to devise affordable yet effective marketing strategies.
Something like this:
There are others of course.
There are methods used to help detect program code copying.
They basiclly produce reports of places for a human to look.
Big bucks riding on the outcome of some of the code comparisions that have been done.
Half a Gigabyte is indeed smallfry when you talk about standard indexing but LSI is a different ball game.
Setting up the matrices for a specific task is the key to getting the best result with LSI. We are working for a client that needs to classify huge amounts of documents and can't afford to do it by hand. Even Verity K2 with all the bells and whistles can't handle the task (and that runs into 6 figures just for the license).
So we need to take a dataset of 500,000 1k hand-edited document headers (which include keywords, titles, subject areas & authors) and create a way to classify a set of documents that do not have such data.
The reason we think it will take a few days (it's being written in C++) is that we need to create a few sets of data. One which classifies any given document into the most likely area and then we have to generate keywords based on the subject of the document and the text contained within it (this includes extracing important data from the document and also adding other subject specific keywords that are not in the document but would make future non-LSI searches more accurate). We estimate that there will be over 10,000 categories when we are done (Not bad when you consider the UDC has 56,000 main categories).
If this goes well we expect to be given access to a dataset 3 times the size of the entire contents of the British Library. That would be interesting (and certainly take longer than a few days to run). Sadly the full text of many of the documents would not be available electronically but the data still weighs in at the multi-Terrabyte level, possibly a little more than we would need.