find seems to be much slower with -print0 option - bash

I am trying to resize photos larger than specific dimensions for 100s of thousands of photos collected by a system over past 10 years. I am using find and imagemagick
I wrote this script to do it.
#!/bin/bash
ResizeSize="1080^>"
Processing=0
find . -type f -iname '*JPG' -print0 | \
while IFS= read -r -d '' image; do
((Processing++))
echo Processing file: $Processing
echo Resizing """$image"""
convert """$image""" -resize $ResizeSize """$image""___"
if [ $? -eq 0 ] ; then
rm """$image"""
if [ $? -eq 0 ] ; then
mv """$image""___" """$image"""
else
echo something wrong with resize
exit 1
fi
done
The script works on a small amount of files but it takes a long time to start with lots of files. I have tested on the command line find . -type f -iname '*JPG' -print0 vs find . -type f -iname '*JPG'. The later finds files within a few seconds but the former takes minutes before anything is found? Unfortunately the -print0 is required for dealing with filenames with special characters (which are mainly spaces in my case). How can I get this script to be more efficient?

I can not reproduce the behavior you're experiencing, but can think of two possible explanations.
First, you might be experiencing positive effects of page (disk) caching.
When you call find for the first time, it traverses files (metadata in inodes), actually reading from the data media (HDD) via kernel syscall. But kernel (transparently to find, or other applications) also stores that data in unused areas of memory, which acts as a cache. If this data is read again later, it can be quickly read from this cache in memory. This is called page caching.
So, your second call to find (no matter what output separator is used) will be a lot faster, assuming you are searching over the same files, with the same criteria.
Second, since find's output might be buffered, if your files are in many different locations, it might take some time before the actual first output to the while command. Also if the output is line-buffered, that would explain why -print0 variant takes longer to produce the first output (since there are no lines at all).
You can try running find with unbuffered output, via stdbuf command:
stdbuf -o0 find . -iname '*.jpg' -type f -print0 ...
One more thing, unrelated to this; to speed-up your find search, you might want to consider calling it like this:
find . -iname '*.jpg' -type f -print0
Here we put the -iname test before the -type test in order to avoid having to call stat(2) on every file. Even better would be to remove the -type test all together, if possible.

Related

Which is best way to grep on exec from find command?

I'm just curious which of these statements would be most resource intensive.
I expect the criteria at times to be 1000s of files, and want to make sure I implement the "safest" execution. Files themselves will be relatively small, but the amount of files might be substantially large.
The two alternatives:
sudo find /home/users -name '*.sh' -type f -exec grep -n 'rm.*tmp.*7z$' {} \+
sudo find /home/users -name '*.sh' -type f -exec grep -Hn 'rm.*tmp.*7z$' {} \;
As you can see the only difference is whether I should use the + or the ;
The first one is going to run grep fewer times than the second, which will launch one instance per file. grep's startup time is pretty fast, so it might not be much of a visible improvement, but the first one will be more efficient, the second one more resource intensive.
(You'll want to add -H to the first grep's options too, just in case it gets run with a single filename argument at some point.)

Faster iteration over thousands of files

I'm trying to do something on ~200,000 files in a single folder.
When I do this:
for i in *; do /bin/echo -n "."; done
One dot is printed every few seconds. The same operation on a folder with a hundred files works blazingly fast.
Why is this so? How to accelerate the process for folders with thousands of files?
Try this with GNU find:
find . -maxdepth 1 -type f -printf "."

How can I convert JPG images recursively using find?

Essentially what I want to do is search the working directory recursively, then use the paths given to resize the images. For example, find all *.jpg files, resize them to 300x300 and rename to whatever.jpg.
Should I be doing something along the lines of $(find | grep *.jpg) to get the paths? When I do that, the output is directories not enclosed in quotation marks, meaning that I would have to insert them before it would be useful, right?
I use mogrify with find.
Lets say, I need everything inside my nested folder/another/folder/*.jpg to be in *.png
find . -name "*.jpg" -print0|xargs -I{} -0 mogrify -format png {}
&& with a bit of explaination:
find . -name *.jpeg -- to find all the jpeg's inside the nested folders.
-print0 -- to print desired filename withouth andy nasty surprises (eg: filenames space seperated)
xargs -I {} -0 -- to process file one by one with mogrify
and lastly those {} are just dummy file name for result from find.
You can use something like this with GNU find:
find . -iname \*jpg -exec /your/image/conversion/script.sh {} +
This will be safer in terms of quoting, and spawn fewer processes. As long as your script can handle the length of the argument list, this solution should be the most efficient option.
If you need to handle really long file lists, you may have to pay the price and spawn more processes. You can modify find to handle each file separately. For example:
find . -iname \*jpg -exec /your/image/conversion/script.sh {} \;

improve find performance

I have a bash script that zips up filenames based on user input. It is working fine albeit slowly since I have, at times, to parse up to 50K files.
find "$DIR" -name "$USERINPUT" -print | /usr/bin/zip -1 SearchResult -#
The # sign here means that zip will be accepting file names from STDIN. Is there a way to make it go faster?
I am thinking of creating a cron job to update the locate database every night but I am not root so I don't even if it is worth it.
Any suggestions welcome.
I suggest you make use of parallel processing in xargs command to speed up your entire process. Use a command like this:
find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
Above command will make xargs run 10 parallel sub-processes.
Please record timing of above command like this:
time find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
and see if this makes any performance improvements.
As Mattias Ahnberg pointed out, this use of find will generate the entire list of matching files before zip gets invoked. If you're doing this over 50,000 files, that will take some time. Perhaps a more suitable approach would be to use find's -exec <cmd> {} \; feature:
find "$DIR" -name "$USERINPUT" -exec /usr/bin/zip -1 {} \;
This way, find invokes zip itself on each matching file. You should achieve the same end result as your original version, but if the sheer number of files is your bottleneck (which, if the files are all small, is most likely), this will kick off running zip as soon as it starts finding matches, rather than when all matches have been found.
NB: I recommend reading the man page for find for details of this option. Note that the semi-colon must be escaped to prevent your shell interpreting it rather than passing it to find.
Sounds like you're trawling through the filesystem running a find for each of the 50,000 files.
Why not do one run of find, to log names of all files in the filesystem, and then pluck the locations of them straight from this log file ?
Alternatively, break the work down into seperate jobs, particularly if you have multiple filesystems and multiple CPUs. No need to be single-threaded in your approach.

Which is faster, 'find -exec' or 'find | xargs -0'?

In my web application I render pages using PHP script, and then generate static HTML files from them. The static HTML are served to the users to speed up performance. The HTML files become stale eventually, and need to be deleted.
I am debating between two ways to write the eviction script.
The first is using a single find command, like
find /var/www/cache -type f -mmin +10 -exec rm \{} \;
The second form is by piping through xargs, something like
find /var/www/cache -type f -mmin +10 -print0 | xargs -0 rm
The first form invokes rm for each file it finds, while the second form just sends all the file names to a single rm (but the file list might be very long).
Which form would be faster?
In my case, the cache directory is shared between a few web servers, so this is all done over NFS, if that matters for this issue.
The xargs version is dramatically faster with a lot of files than the -exec version as you posted it, this is because rm is executed once for each file you want to remove, while xargs will lump as many files as possible together into a single rm command.
With tens or hundreds of thousands of files, it can be the difference between a minute or less versus the better part of an hour.
You can get the same behavior with -exec by finishing the command with a "+" instead of "\;". This option is only available in newer versions of find.
The following two are roughly equivalent:
find . -print0 | xargs -0 rm
find . -exec rm \{} +
Note that the xargs version will still run slightly faster (by a few percent) on a multi-processor system, because some of the work can be parallelized. This is particularly true if a lot of computation is involved.
I expect the xargs version to be slightly faster as you aren't spawning a process for each filename. But, I would be surprised if there was actually much difference in practice. If you're worried about the long list xargs sends to each invocation of rm, you can use -l with xargs to limit the number of tokens it will use. However, xargs knows the longest cmdline length and won't go beyond that.
The find command has a -delete option builtin in, perhaps that could be useful as well?
http://lists.freebsd.org/pipermail/freebsd-questions/2004-July/051768.html
Using xargs is faster as compared to exec with find.
I tried to count no of lines in files in node_module folder with js extension using xargs and exec. So the output below.
time find . -type f -name "*.js" -exec wc -l {} \;
real 0m0.296s
user 0m0.133s
sys 0m0.038s
time find . -type f -name "*.js" |xargs wc -l
real 0m0.019s
user 0m0.005s
sys 0m0.006s
xargs executes approx 15 times faster than exec.

Resources