Forum Moderators: phranque

Message Too Old, No Replies

Performance with a large number of files in a directory

Results of my testing

         

MichaelBluejay

10:52 am on Jul 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm developing an app that could have hundreds of thousands of files, and I wondered whether there would be a performance penalty for having too many files in a directory, and if so, what would that magic "too many" number be? Unfortunately a Google search of both the web in general and WebmasterWorld in particular didn't turn up much. But it occurred to me that I could just run tests on my own server. For each test, I ran some process on directories with 100, 1000, and 10000 files respectively, with a Perl script.

My server is a Pentium 4, 2.8GHZ, with about 40Mb of free RAM and a server load between 0.07 and 0.20 when I ran these tests.

File creation
This one isn't so relevant, because my app will never be creating 10,000 files all at once! Still, since I had to create the files to run the test, I timed it. It took "0" seconds to create the 100 and 1000 files for the respective directories, and 7 seconds for the 10,000 files. (Perl's normal resolution for time is 1-second increments.) Incidentally, the files are empty, because all I really need to test is access time, not actually change the contents of the files.

ls (directory listing) from the command line
One of the many (unhelpful) comments I found on a messageboard while trying to research number of files and performance was something like "with a lot of files in a directory, even an 'ls' command can bring a server to its knees." So I decided to test this one. My command line is Terminal in Mac OS X. I'm logging into my server remotely. (I'm in Austin, server is in California.) Anyway, it took less than two seconds for the ls command to run on the 10,000-file directory. The real bottleneck seemed to be in spitting all the data to the window.

Open 100 random files
Now we get down to business! I opened and closed 100 random filenames from each directory. For each folder, it took "0" seconds. I tried it again with 1000 filenames. Still 0 seconds for everything. Once more with *10,000* filenames, and the results were 2, 1, and 2 seconds respectively. (I assume the 1000-file directory scoring best is due to a rounding error, since Perl isn't measuring fractions of a second.)

Get 100 random files via HTTP
Okay, so UNIX clearly doesn't slow down if there are 10,000 files in a directory, but will Apache care? I doubt it, since Apache is getting the files via UNIX, but I might as well be thorough. Anyway, for fetching 100 random files from each directory via HTTP, it took all of 2 seconds for each directory, even the 10,000-file directory.

Conclusion
Man, you can easily have 10,000 files in a directory and the system just doesn't blink. No performance problems at all.

By the way, here was the code I used for "Open 100 to 10000 random files".


#!/usr/bin/perl

print "\n";
&testFiles(100);
&testFiles(1000);
&testFiles(10000);

sub testFiles {
$start=time;
for $counter (0..100000) {
$x = int(rand($_[0]))+1;
open (FILE,"<$_[0]/$x.txt") ¦¦ die("Couldn't open file $x. $!");
close (FILE);
}
print "$_[0]: ". (time-$start) . " seconds\n"
}

And here's the code for "Get 100 random files via HTTP".


#!/usr/bin/perl

print "\n";
&testFiles(100);
&testFiles(1000);
&testFiles(10000);

sub testFiles {
$start=time;
for $counter (0..100) {
$x = int(rand($_[0]))+1;
use LWP::Simple;
$content = get("http://example.com/$_[0]/$x.txt");
}
print "$_[0]: ". (time-$start) . " seconds\n"
}

jdMorgan

5:37 pm on Jul 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It would be interesting to try this on a shared server or a server with marginally-sufficient memory. I suspect the results would be different in that case, since the (cached) directory structure would likely not persist in memory for very long as things got swapped in and out to service requests for the other hosted sites.

This is one likely cause for the common approach of using URL-to-filespace mappings such as
/apples.html --> /a/apples.html
/apricots.html --> /a/apricots.html
/carrots.html --> /c/carrots.html
/cucumbers.html --> /c/cucumbers.html
and other similar partitioning approaches used to reduce physical directory size.

Jim