Which is faster, 'find -exec' or 'find | xargs -0'? - shell

In my web application I render pages using PHP script, and then generate static HTML files from them. The static HTML are served to the users to speed up performance. The HTML files become stale eventually, and need to be deleted.
I am debating between two ways to write the eviction script.
The first is using a single find command, like
find /var/www/cache -type f -mmin +10 -exec rm \{} \;
The second form is by piping through xargs, something like
find /var/www/cache -type f -mmin +10 -print0 | xargs -0 rm
The first form invokes rm for each file it finds, while the second form just sends all the file names to a single rm (but the file list might be very long).
Which form would be faster?
In my case, the cache directory is shared between a few web servers, so this is all done over NFS, if that matters for this issue.

The xargs version is dramatically faster with a lot of files than the -exec version as you posted it, this is because rm is executed once for each file you want to remove, while xargs will lump as many files as possible together into a single rm command.
With tens or hundreds of thousands of files, it can be the difference between a minute or less versus the better part of an hour.
You can get the same behavior with -exec by finishing the command with a "+" instead of "\;". This option is only available in newer versions of find.
The following two are roughly equivalent:
find . -print0 | xargs -0 rm
find . -exec rm \{} +
Note that the xargs version will still run slightly faster (by a few percent) on a multi-processor system, because some of the work can be parallelized. This is particularly true if a lot of computation is involved.

I expect the xargs version to be slightly faster as you aren't spawning a process for each filename. But, I would be surprised if there was actually much difference in practice. If you're worried about the long list xargs sends to each invocation of rm, you can use -l with xargs to limit the number of tokens it will use. However, xargs knows the longest cmdline length and won't go beyond that.

The find command has a -delete option builtin in, perhaps that could be useful as well?
http://lists.freebsd.org/pipermail/freebsd-questions/2004-July/051768.html

Using xargs is faster as compared to exec with find.
I tried to count no of lines in files in node_module folder with js extension using xargs and exec. So the output below.
time find . -type f -name "*.js" -exec wc -l {} \;
real 0m0.296s
user 0m0.133s
sys 0m0.038s
time find . -type f -name "*.js" |xargs wc -l
real 0m0.019s
user 0m0.005s
sys 0m0.006s
xargs executes approx 15 times faster than exec.

Related

Which is best way to grep on exec from find command?

I'm just curious which of these statements would be most resource intensive.
I expect the criteria at times to be 1000s of files, and want to make sure I implement the "safest" execution. Files themselves will be relatively small, but the amount of files might be substantially large.
The two alternatives:
sudo find /home/users -name '*.sh' -type f -exec grep -n 'rm.*tmp.*7z$' {} \+
sudo find /home/users -name '*.sh' -type f -exec grep -Hn 'rm.*tmp.*7z$' {} \;
As you can see the only difference is whether I should use the + or the ;
The first one is going to run grep fewer times than the second, which will launch one instance per file. grep's startup time is pretty fast, so it might not be much of a visible improvement, but the first one will be more efficient, the second one more resource intensive.
(You'll want to add -H to the first grep's options too, just in case it gets run with a single filename argument at some point.)

find seems to be much slower with -print0 option

I am trying to resize photos larger than specific dimensions for 100s of thousands of photos collected by a system over past 10 years. I am using find and imagemagick
I wrote this script to do it.
#!/bin/bash
ResizeSize="1080^>"
Processing=0
find . -type f -iname '*JPG' -print0 | \
while IFS= read -r -d '' image; do
((Processing++))
echo Processing file: $Processing
echo Resizing """$image"""
convert """$image""" -resize $ResizeSize """$image""___"
if [ $? -eq 0 ] ; then
rm """$image"""
if [ $? -eq 0 ] ; then
mv """$image""___" """$image"""
else
echo something wrong with resize
exit 1
fi
done
The script works on a small amount of files but it takes a long time to start with lots of files. I have tested on the command line find . -type f -iname '*JPG' -print0 vs find . -type f -iname '*JPG'. The later finds files within a few seconds but the former takes minutes before anything is found? Unfortunately the -print0 is required for dealing with filenames with special characters (which are mainly spaces in my case). How can I get this script to be more efficient?
I can not reproduce the behavior you're experiencing, but can think of two possible explanations.
First, you might be experiencing positive effects of page (disk) caching.
When you call find for the first time, it traverses files (metadata in inodes), actually reading from the data media (HDD) via kernel syscall. But kernel (transparently to find, or other applications) also stores that data in unused areas of memory, which acts as a cache. If this data is read again later, it can be quickly read from this cache in memory. This is called page caching.
So, your second call to find (no matter what output separator is used) will be a lot faster, assuming you are searching over the same files, with the same criteria.
Second, since find's output might be buffered, if your files are in many different locations, it might take some time before the actual first output to the while command. Also if the output is line-buffered, that would explain why -print0 variant takes longer to produce the first output (since there are no lines at all).
You can try running find with unbuffered output, via stdbuf command:
stdbuf -o0 find . -iname '*.jpg' -type f -print0 ...
One more thing, unrelated to this; to speed-up your find search, you might want to consider calling it like this:
find . -iname '*.jpg' -type f -print0
Here we put the -iname test before the -type test in order to avoid having to call stat(2) on every file. Even better would be to remove the -type test all together, if possible.

Ignore spaces in Solaris 'find' output

I am trying to remove all empty files that are older than 2 days. Also I am ignoring hidden files, starting with dot. I am doing it with this code:
find /u01/ -type f -size 0 -print -mtime +2 | grep -v "/\\." | xargs rm
It works fine until there are spaces in the name of the file. How could I make my code ignore them?
OS is Solaris.
Option 1
Install GNU find and GNU xargs in an appropriate location (not /usr/bin) and use:
find /u01/ -type f -size 0 -mtime +2 -name '[!.]*' -print0 | xargs -0 rm
(Note that I removed (what I think is) a stray -print from your find options. The options shown removes empty files modified more than 2 days ago where the name does not start with a ., which is the condition that your original grep seemed to deal with.)
Option 2
The problem is primarily that xargs is defined to split its input at spaces. An alternative is to write your own xargs surrogate that behaves sensibly with spaces in names; I've done that. You then only run into problems if the file names contain newlines — which the file system allows. Using a NUL ('\0') terminator is guaranteed safe; it is the only character that can't appear in a path name (which is why GNU chose to use it with -print0 etc).
Option 3
A final better option is perhaps:
find /u01/ -type f -size 0 -mtime +2 -name '[!.]*' -exec rm {} \;
This avoids using xargs at all and handles all file names (path names) correctly — at the cost of executing rm once for each file found. That's not too painful if you're only dealing with a few files on each run.
POSIX 2008 introduces the notation + in place of the \; and then behaves rather like xargs, collecting as many arguments as will conveniently fit in the space it allocates for the command line before running the command:
find /u01/ -type f -size 0 -mtime +2 -name '[!.]*' -exec rm {} +
The versions of Solaris I've worked on do not support that notation, but I know I work on antique versions of Solaris. GNU find does support the + marker and therefore renders the -print0 and xargs -0 workaround unnecessary.

Explain how many processes created?

Could someone answer how many processes are created in each case for the commands below as I dont understand it :
The following three commands have roughly the same effect:
rm $(find . -type f -name '*.o')
find . -type f -name '*.o' | xargs rm
find . -type f -name '*.o' -exec rm {} \;
Exactly 2 processes - 1 for rm, the other for find.
3 or more processes. 1 for find, another for xargs, and one or more rm. xargs will read standard input, and if it reads more lines than can be passed as parameters to a program (There is a maximum value named ARG_MAX).
Many processes, 1 for find and another one for each file ending in .o for rm.
In my opinion, option 2 is the best, because it handles the maximum parameter limit correctly and doesn't spawn too many processes. However, I prefer to use it like this (with GNU find and xargs):
find . -type f -name '*.o' -print0 | xargs -0 rm
This terminates each filename with a \0 instead of a newline, since filenames in UNIX can legally contain newlines. This also handles spaces in filenames (much more common) correctly.

improve find performance

I have a bash script that zips up filenames based on user input. It is working fine albeit slowly since I have, at times, to parse up to 50K files.
find "$DIR" -name "$USERINPUT" -print | /usr/bin/zip -1 SearchResult -#
The # sign here means that zip will be accepting file names from STDIN. Is there a way to make it go faster?
I am thinking of creating a cron job to update the locate database every night but I am not root so I don't even if it is worth it.
Any suggestions welcome.
I suggest you make use of parallel processing in xargs command to speed up your entire process. Use a command like this:
find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
Above command will make xargs run 10 parallel sub-processes.
Please record timing of above command like this:
time find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
and see if this makes any performance improvements.
As Mattias Ahnberg pointed out, this use of find will generate the entire list of matching files before zip gets invoked. If you're doing this over 50,000 files, that will take some time. Perhaps a more suitable approach would be to use find's -exec <cmd> {} \; feature:
find "$DIR" -name "$USERINPUT" -exec /usr/bin/zip -1 {} \;
This way, find invokes zip itself on each matching file. You should achieve the same end result as your original version, but if the sheer number of files is your bottleneck (which, if the files are all small, is most likely), this will kick off running zip as soon as it starts finding matches, rather than when all matches have been found.
NB: I recommend reading the man page for find for details of this option. Note that the semi-colon must be escaped to prevent your shell interpreting it rather than passing it to find.
Sounds like you're trawling through the filesystem running a find for each of the 50,000 files.
Why not do one run of find, to log names of all files in the filesystem, and then pluck the locations of them straight from this log file ?
Alternatively, break the work down into seperate jobs, particularly if you have multiple filesystems and multiple CPUs. No need to be single-threaded in your approach.

Resources