|Flat File Database|
Is lots of files or one large one better
I built a flat file search engine in perl as a learning tool over decade ago.
It gets so much traffic i am trying to optimize it a bit.
The database file is about 1 megabyte in size and has some small compression applied.
I built a flat file database of files with each search term(s), search returns in one file.
This is not as big as you might think, but it is about 6000 files ...
With a Dictionary file with a list of these files.
This makes the file size to open generally under 80k.
Now to the question.
Is it better to have one large file to open and search or one of the 6000 smaller files to open and search?
That depends, would a search open thousands of files if you have 6K smaller files?
The file open operation is actually one of the most expensive (time consuming) functions in the operating system so one file vs many files has a very significant advantage in that aspect alone.
When you open a very large file, the key to speedy processing is the size of the cache buffer your code uses because disk hardware tends to read like 10K or more at a one time. If you take all the data the disk controller reads from a track when it has it, you save multiple accesses to the drive, etc. etc. and it gets really speedy to process.
Compare that to reading line by line using the programming languages gets() function, which uses their own default buffering which doesn't take into account the actual hardware performance capabilities, and it's going to be much slower.
I could go on but you probably get the idea that a custom read function or at least specifying the right amount of data to cache per read designed to minimize HD head movement, can make such a search screaming fast.
A 1MB file shouldn't really challenge modern hardware for speed either in reading the whole thing from the disk in one shot or in searching through it in memory.
If possible, move it to RAM disk.
It only has to open the one of the smaller files per search term.
RAM disk... wow...
There are a lot of sites on the server and wow,, that is an idea. that would be ... fast.
I can write a function to cache like you recommend, i think that is my best option.
But, do you have some ideas about a ram disk?
incrediBILL is right. Your problem is the constant opening and closing of thousands of files. These have to be located, opened, closed, located, opened, closed.
No need to go for a RAMDISK with just 1Mb. The file is so small it will be cached in various places anyway - the O/S the disk hardware cache. Even the amount of space used on the disk will require very little head movement with one file, whilst thousands of files (albeit small) could be all over the place.
Presumably you have some kind of identifier for these files, possibly a sequential name? file0001, file0002, file6000
copy them all into one file and add the name to the end of the existing database
...answered before reading the replies.
In general, there is huge overhead associated with each call to disk. Such as; 10 files of 100k each will load signficantly slower than 1 file of 1meg each. However, it is going to be the number of hits you make to those files that is going to be the determining factor. (WebmasterWorld is all flat files - 1 per thread - about 500k flat files on here now - about 8000% faster than a *sql db.)
If you are not maxing out the ram on your machine, then 1 moderatly file load is going to be much faster than dozens of smaller files. I think it is going to depend on your operating system and it's disk caching methods. If you open those smaller files very often, then they are going to be cached at a higher rate than one larger file and the overhead file open hit will be small. A meg file is nothing.
The question I have is how many hits per hour are you talking to this db? Is this a case of 100k hits, or 100 hits? If it is the former, then you probably should consider using some type of ram disk. Ram is so cheap these days for servers, that rethinking ram disks is starting to be popular again. I'm toying with putting ALL of WebmasterWorld in a ramdisk (it is less than 4gig).
Thanks, that goes to the heart of what i was wanting to know.
I get a few thousand searches a day and all is well ...