Which is best way to grep on exec from find command? - bash

I'm just curious which of these statements would be most resource intensive.
I expect the criteria at times to be 1000s of files, and want to make sure I implement the "safest" execution. Files themselves will be relatively small, but the amount of files might be substantially large.
The two alternatives:
sudo find /home/users -name '*.sh' -type f -exec grep -n 'rm.*tmp.*7z$' {} \+
sudo find /home/users -name '*.sh' -type f -exec grep -Hn 'rm.*tmp.*7z$' {} \;
As you can see the only difference is whether I should use the + or the ;

The first one is going to run grep fewer times than the second, which will launch one instance per file. grep's startup time is pretty fast, so it might not be much of a visible improvement, but the first one will be more efficient, the second one more resource intensive.
(You'll want to add -H to the first grep's options too, just in case it gets run with a single filename argument at some point.)

Related

Find Command Exclude Hidden files when using empty flag

I am looking for a way to use the find command to tell if a folder has no files in it. I have tried using the -empty flag, but since I am on macOS the system files the OS places in the directory such as .DS_Store cause find to not consider the directory empty. I have tried telling find to ignore .DS_Store but it still considers the directory not empty because that file is present.
Is there a way to have find exclude certain files from what it considers -empty? Also is there a way to have find return a list of directories with no visible files?
The -empty predicate is rather simple, it's true for a directory if it has any entries other than . or ...
Kind of an ugly solution, but you can use -exec to run another find in each directory which will implement your criteria for deciding what directories you want to include.
Below:
the outer find will execute sh -c for each directory in /starting/point
sh will execute another find with different criteria.
the inner find will print the first match and then quit
read will consume the output (if any) of the inner find. read will have an exit status of 0 only if the inner find printed at least one line, non-zero otherwise
if there was no output from the inner find, the outer find's -exec predicate will evaluate to false
since -exec is followed by -o, the following -print action will be executed only for those directories which do not match the inner find's criteria
find /starting/point \
-type d \( \
-exec sh -c \
'find "$1" -mindepth 1 -maxdepth 1 ! -name ".*" -print -quit | read' \
sh {} \; \
-o -print \
\)
Also note that the 'find FOLDER -empty' is somewhat tricky. It will consider FOLDER empty even if it contains files, as long as these are empty.
Maybe not exactly what was asked, but I prefer the brute force approach if I want to avoid a no-match error on using FOLDER/*. In tcsh:
ls -d FOLDER/* >& /dev/null
if !($status) COMMANDS FOLDER/* ...
A variation of this might be usable here (like also using
ls -d FOLDER/.* | wc -l
and drawing the desired conclusions from the combined results).

find seems to be much slower with -print0 option

I am trying to resize photos larger than specific dimensions for 100s of thousands of photos collected by a system over past 10 years. I am using find and imagemagick
I wrote this script to do it.
#!/bin/bash
ResizeSize="1080^>"
Processing=0
find . -type f -iname '*JPG' -print0 | \
while IFS= read -r -d '' image; do
((Processing++))
echo Processing file: $Processing
echo Resizing """$image"""
convert """$image""" -resize $ResizeSize """$image""___"
if [ $? -eq 0 ] ; then
rm """$image"""
if [ $? -eq 0 ] ; then
mv """$image""___" """$image"""
else
echo something wrong with resize
exit 1
fi
done
The script works on a small amount of files but it takes a long time to start with lots of files. I have tested on the command line find . -type f -iname '*JPG' -print0 vs find . -type f -iname '*JPG'. The later finds files within a few seconds but the former takes minutes before anything is found? Unfortunately the -print0 is required for dealing with filenames with special characters (which are mainly spaces in my case). How can I get this script to be more efficient?
I can not reproduce the behavior you're experiencing, but can think of two possible explanations.
First, you might be experiencing positive effects of page (disk) caching.
When you call find for the first time, it traverses files (metadata in inodes), actually reading from the data media (HDD) via kernel syscall. But kernel (transparently to find, or other applications) also stores that data in unused areas of memory, which acts as a cache. If this data is read again later, it can be quickly read from this cache in memory. This is called page caching.
So, your second call to find (no matter what output separator is used) will be a lot faster, assuming you are searching over the same files, with the same criteria.
Second, since find's output might be buffered, if your files are in many different locations, it might take some time before the actual first output to the while command. Also if the output is line-buffered, that would explain why -print0 variant takes longer to produce the first output (since there are no lines at all).
You can try running find with unbuffered output, via stdbuf command:
stdbuf -o0 find . -iname '*.jpg' -type f -print0 ...
One more thing, unrelated to this; to speed-up your find search, you might want to consider calling it like this:
find . -iname '*.jpg' -type f -print0
Here we put the -iname test before the -type test in order to avoid having to call stat(2) on every file. Even better would be to remove the -type test all together, if possible.

Ever try to delete files with unix shell find? Use the -delete option

This has come up a number of times in posts, so I'm mentioning it as a thankyou to all helpful people on stackoverflow.
Have you ever wanted to do a bunch of deletes from the command line/terminal in Unix? Perhaps you used a construct like
find . -name '*.pyc' -exec rm {} \;
Look to the answer for an elegant way to do this.
Here's how to do it with the -delete option!
Use the find command option -delete:
find . -name '*.pyc' -delete
Of course, do try a dry run without the -delete, to see if you are going to delete what you want!!! Those computers do run so darn fast! ;-)
+1 for taking the initiative and finding the solution to your issue yourself. A couple of rather minor notes:
I would recommend getting into the habit of using the -type f flag when you're wanting to delete files. This restricts find to files that are actually files (i.e. not directories or links). Otherwise you might inadvertently delete a directory, which is probably not what you wanted to do. (That said, unless you have a directory named 'something.pyc', that wouldn't be an issue for your example command. It's just a good habit in general.)
Also, just to let you know, if you decide use the -exec rm.. version, it would run faster if you did this instead:
find . -type f -name '*.pyc' -exec rm {} \+
This version adds as many arguments to a single invokation of rm as it can, thereby reducing the total number of calls to rm. It works pretty much like the default behavior in xargs.

improve find performance

I have a bash script that zips up filenames based on user input. It is working fine albeit slowly since I have, at times, to parse up to 50K files.
find "$DIR" -name "$USERINPUT" -print | /usr/bin/zip -1 SearchResult -#
The # sign here means that zip will be accepting file names from STDIN. Is there a way to make it go faster?
I am thinking of creating a cron job to update the locate database every night but I am not root so I don't even if it is worth it.
Any suggestions welcome.
I suggest you make use of parallel processing in xargs command to speed up your entire process. Use a command like this:
find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
Above command will make xargs run 10 parallel sub-processes.
Please record timing of above command like this:
time find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
and see if this makes any performance improvements.
As Mattias Ahnberg pointed out, this use of find will generate the entire list of matching files before zip gets invoked. If you're doing this over 50,000 files, that will take some time. Perhaps a more suitable approach would be to use find's -exec <cmd> {} \; feature:
find "$DIR" -name "$USERINPUT" -exec /usr/bin/zip -1 {} \;
This way, find invokes zip itself on each matching file. You should achieve the same end result as your original version, but if the sheer number of files is your bottleneck (which, if the files are all small, is most likely), this will kick off running zip as soon as it starts finding matches, rather than when all matches have been found.
NB: I recommend reading the man page for find for details of this option. Note that the semi-colon must be escaped to prevent your shell interpreting it rather than passing it to find.
Sounds like you're trawling through the filesystem running a find for each of the 50,000 files.
Why not do one run of find, to log names of all files in the filesystem, and then pluck the locations of them straight from this log file ?
Alternatively, break the work down into seperate jobs, particularly if you have multiple filesystems and multiple CPUs. No need to be single-threaded in your approach.

Which is faster, 'find -exec' or 'find | xargs -0'?

In my web application I render pages using PHP script, and then generate static HTML files from them. The static HTML are served to the users to speed up performance. The HTML files become stale eventually, and need to be deleted.
I am debating between two ways to write the eviction script.
The first is using a single find command, like
find /var/www/cache -type f -mmin +10 -exec rm \{} \;
The second form is by piping through xargs, something like
find /var/www/cache -type f -mmin +10 -print0 | xargs -0 rm
The first form invokes rm for each file it finds, while the second form just sends all the file names to a single rm (but the file list might be very long).
Which form would be faster?
In my case, the cache directory is shared between a few web servers, so this is all done over NFS, if that matters for this issue.
The xargs version is dramatically faster with a lot of files than the -exec version as you posted it, this is because rm is executed once for each file you want to remove, while xargs will lump as many files as possible together into a single rm command.
With tens or hundreds of thousands of files, it can be the difference between a minute or less versus the better part of an hour.
You can get the same behavior with -exec by finishing the command with a "+" instead of "\;". This option is only available in newer versions of find.
The following two are roughly equivalent:
find . -print0 | xargs -0 rm
find . -exec rm \{} +
Note that the xargs version will still run slightly faster (by a few percent) on a multi-processor system, because some of the work can be parallelized. This is particularly true if a lot of computation is involved.
I expect the xargs version to be slightly faster as you aren't spawning a process for each filename. But, I would be surprised if there was actually much difference in practice. If you're worried about the long list xargs sends to each invocation of rm, you can use -l with xargs to limit the number of tokens it will use. However, xargs knows the longest cmdline length and won't go beyond that.
The find command has a -delete option builtin in, perhaps that could be useful as well?
http://lists.freebsd.org/pipermail/freebsd-questions/2004-July/051768.html
Using xargs is faster as compared to exec with find.
I tried to count no of lines in files in node_module folder with js extension using xargs and exec. So the output below.
time find . -type f -name "*.js" -exec wc -l {} \;
real 0m0.296s
user 0m0.133s
sys 0m0.038s
time find . -type f -name "*.js" |xargs wc -l
real 0m0.019s
user 0m0.005s
sys 0m0.006s
xargs executes approx 15 times faster than exec.

Resources