Welcome to WebmasterWorld Guest from 54.224.57.95

Forum Moderators: bakedjake

Message Too Old, No Replies

Most efficient method for counting files

     
9:54 am on Jul 4, 2011 (gmt 0)

5+ Year Member



Hey all.

I have a little system that I have set up, that receives incoming requests from my client sites. The results of these requests are then cached to the file system, to reduce the load on the server the next time the same request is performed.

I use the following method to generate the cache filename:
- I generate an MD5 hash of the requested URL.
- I then create a directory in
/tmp/cache/{0}/{1}/{2}/{3}
, where the
{n}
values are the positions of those letters.
- I then truncate this hash, and save the file in the before-mentioned directory.

For example, the hash for http://www.google.co.za is 260289fb0e63d27a83fb63a1f5449806. The cache file for this request would be in
/tmp/2/6/0/2/89fb0e63d27a83fb63a1f5449806
.

I'm needing to keep track of the number of these cache files. I don't care about which directories they're in, I just need to know a total count of files only in /tmp/cache/.

Currently, I'm using
find /tmp/cache -type f | wc -l
to generate this count of files. However, I've noticed that it's taking longer and longer to find these files (and uses more processing power than I'd like).

At the moment, there are about 370,000 files in the directory, and the file system is an EXT4 file system.

Does anyone have a better and more efficient method than this for finding the count of files? Please? :)
1:01 pm on Jul 11, 2011 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



try this:

ls -R /tmp/cache|egrep -v '^($|\/)'|wc
10:20 am on Oct 7, 2011 (gmt 0)



You still have to traverse all those directories (linear time operation), so it will necessarily take longer and longer.

The most efficient way is still "find", here is how to (vastly) improve the performance of your command:


find /tmp/cache -mindepth 5 -maxdepth 5 -type f | awk 'END { print NR }'


Explanation:
  • you specify the depth to search for. All your files are 4 subdirs in, so you both lower- and upper-limit to 5
  • you count with awk to produce a purely numeric output


Find is usually fast, especially if occurring at close intervals (VFS caches the directory structure), up to 500k files or so. If you exceed that, consider introducing an explicit counting mechanism in your caching, +1 when you add, -1 when you remove.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month