homepage Welcome to WebmasterWorld Guest from 54.197.49.162
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
Forum Library, Charter, Moderators: bakedjake

Linux, Unix, and *nix like Operating Systems Forum

    
Most efficient method for counting files
ffoeg




msg:4334727
 9:54 am on Jul 4, 2011 (gmt 0)

Hey all.

I have a little system that I have set up, that receives incoming requests from my client sites. The results of these requests are then cached to the file system, to reduce the load on the server the next time the same request is performed.

I use the following method to generate the cache filename:
- I generate an MD5 hash of the requested URL.
- I then create a directory in
/tmp/cache/{0}/{1}/{2}/{3}, where the {n} values are the positions of those letters.
- I then truncate this hash, and save the file in the before-mentioned directory.

For example, the hash for http://www.google.co.za is 260289fb0e63d27a83fb63a1f5449806. The cache file for this request would be in
/tmp/2/6/0/2/89fb0e63d27a83fb63a1f5449806.

I'm needing to keep track of the number of these cache files. I don't care about which directories they're in, I just need to know a total count of files only in /tmp/cache/.

Currently, I'm using
find /tmp/cache -type f | wc -l to generate this count of files. However, I've noticed that it's taking longer and longer to find these files (and uses more processing power than I'd like).

At the moment, there are about 370,000 files in the directory, and the file system is an EXT4 file system.

Does anyone have a better and more efficient method than this for finding the count of files? Please? :)

 

phranque




msg:4337912
 1:01 pm on Jul 11, 2011 (gmt 0)

try this:

ls -R /tmp/cache|egrep -v '^($|\/)'|wc

tizs




msg:4371850
 10:20 am on Oct 7, 2011 (gmt 0)

You still have to traverse all those directories (linear time operation), so it will necessarily take longer and longer.

The most efficient way is still "find", here is how to (vastly) improve the performance of your command:


find /tmp/cache -mindepth 5 -maxdepth 5 -type f | awk 'END { print NR }'

Explanation:
  • you specify the depth to search for. All your files are 4 subdirs in, so you both lower- and upper-limit to 5
  • you count with awk to produce a purely numeric output


Find is usually fast, especially if occurring at close intervals (VFS caches the directory structure), up to 500k files or so. If you exceed that, consider introducing an explicit counting mechanism in your caching, +1 when you add, -1 when you remove.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved