improve find performance

improve find performance - bash

I have a bash script that zips up filenames based on user input. It is working fine albeit slowly since I have, at times, to parse up to 50K files.
find "$DIR" -name "$USERINPUT" -print | /usr/bin/zip -1 SearchResult -#
The # sign here means that zip will be accepting file names from STDIN. Is there a way to make it go faster?
I am thinking of creating a cron job to update the locate database every night but I am not root so I don't even if it is worth it.
Any suggestions welcome.

I suggest you make use of parallel processing in xargs command to speed up your entire process. Use a command like this:
find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
Above command will make xargs run 10 parallel sub-processes.
Please record timing of above command like this:
time find "$DIR" -name "$USERINPUT" -print0 | xargs -0 -P10 zip -1 SearchResult -#
and see if this makes any performance improvements.

As Mattias Ahnberg pointed out, this use of find will generate the entire list of matching files before zip gets invoked. If you're doing this over 50,000 files, that will take some time. Perhaps a more suitable approach would be to use find's -exec <cmd> {} \; feature:
find "$DIR" -name "$USERINPUT" -exec /usr/bin/zip -1 {} \;
This way, find invokes zip itself on each matching file. You should achieve the same end result as your original version, but if the sheer number of files is your bottleneck (which, if the files are all small, is most likely), this will kick off running zip as soon as it starts finding matches, rather than when all matches have been found.
NB: I recommend reading the man page for find for details of this option. Note that the semi-colon must be escaped to prevent your shell interpreting it rather than passing it to find.

Sounds like you're trawling through the filesystem running a find for each of the 50,000 files.
Why not do one run of find, to log names of all files in the filesystem, and then pluck the locations of them straight from this log file ?
Alternatively, break the work down into seperate jobs, particularly if you have multiple filesystems and multiple CPUs. No need to be single-threaded in your approach.

Related

Check if file is in a folder with a certain name before proceeding

So, I have this simple script which converts videos in a folder into a format which the R4DS can play.
#!/bin/bash
scr='/home/user/dpgv4/dpgv4.py';mkdir -p 'DPG_DS'
find '../Exports' -name "*1080pnornmain.mp4" -exec python3 "$scr" {} \;
The problem is, some of the videos are invalid and won't play, and I've moved those videos to a different directory inside the Exports folder. What I want to do is check to make sure the files are in a folder called new before running the python script on them, preferably within the find command. The path should look something like this:
../Exports/(anything here)/new/*1080pnornmain.mp4
Please note that (anything here) text does not indicate a single directory, it could be something like foo/bar, foo/b/ar, f/o/o/b/a/r, etc.

You cannot use -name because the search is on the path now. My first solution was:
find ./Exports -path '**/new/*1080pnornmain.mp4' -exec python3 "$scr" {} \;
But, as #dan pointed out in the comments, it is wrong because it uses the globstar wildcard (**) unnecessarily:
This checks if /new/ is somewhere in the preceding path, it doesn't have to be a direct parent.
So, the star is not enough here. Another possibility, using find only, could be this one:
find ./Exports -regex '.*/new/[^\/]*1080pnornmain.mp4' -exec python3 "$scr" {} \;
This regex matches:
any number of nested folders before new with .*/new
any character (except / to leave out further subpaths) + your filename with [^\/]*1080pnornmain.mp4
Performances could degrade given that it uses regular expressions.
Generally, instead of using the -exec option of the find command, you should opt to passing each line of find output to xargs because of the more efficient thread spawning, like:
find ./Exports -regex '.*/new/[^\/]*1080pnornmain.mp4' | xargs -0 -I '{}' python3 "$scr" '{}'

Which is best way to grep on exec from find command?

I'm just curious which of these statements would be most resource intensive.
I expect the criteria at times to be 1000s of files, and want to make sure I implement the "safest" execution. Files themselves will be relatively small, but the amount of files might be substantially large.
The two alternatives:
sudo find /home/users -name '*.sh' -type f -exec grep -n 'rm.*tmp.*7z$' {} \+
sudo find /home/users -name '*.sh' -type f -exec grep -Hn 'rm.*tmp.*7z$' {} \;
As you can see the only difference is whether I should use the + or the ;

The first one is going to run grep fewer times than the second, which will launch one instance per file. grep's startup time is pretty fast, so it might not be much of a visible improvement, but the first one will be more efficient, the second one more resource intensive.
(You'll want to add -H to the first grep's options too, just in case it gets run with a single filename argument at some point.)

Find, unzip and grep the content of multiple files in one step/command

First I made a question here: Unzip a file and then display it in the console in one step
It works and helped me a lot. (please read)
Now I have a second issue. I do not have a single zipped log file but I have a lot of them in defferent folders, which I need to find first. The files have the same names. For example:
/somedir/server1/log.gz
/somedir/server2/log.gz
/somedir/server3/log.gz
and so on...
What I need is a way to:
find all the files like: find /somedir/server* -type f -name log.gz
unzip the files like: gunzip -c log.gz
use grep on the content of the files
Important! The whole should be done in one step.
I cannot first store the extracted files in the filesystem because it is a readonly filesystem. I need somehow to connect, with pipes, the output from one command to the input of the next.
Before, the log files were in text format (.txt), therefore I had not to unzip them first. In this case it was easy:
ex.
find /somedir/server* -type f -name log.txt | xargs grep "term"
Now I have to deal with zipped files. That means, after I find the files, I need first somehow do unzip them and then send the contents to grep.
With one file I do:
gunzip -p /somedir/server1/log.gz | grep term
But for multiple files I don't know how to do it. For example how to pass the output of find to gunzip and the to grep?!
Also if there is another way / "best practise" how to do that, it is welcome :)

find lets you invoke a command on the files it finds:
find /somedir/server* -type f -name log.gz -exec gunzip -c '{}' + | grep ...
From the man page:
-exec command {} +
This variant of the -exec action runs the specified command on
the selected files, but the command line is built by appending
each selected file name at the end; the total number of
invocations of the command will be much less than the number
of matched files. The command line is built in much the same
way that xargs builds its command lines. Only one instance of
{} is allowed within the command, and (when find is being
invoked from a shell) it should be quoted (for example, '{}')
to protect it from interpretation by shells. The command is
executed in the starting directory. If any invocation with
the + form returns a non-zero value as exit status, then
find returns a non-zero exit status. If find encounters an
error, this can sometimes cause an immediate exit, so some
pending commands may not be run at all. This variant of -exec
always returns true.

Find/Match variable filenames and move files to respective directory

I've never come to SO asking "Do my homework" but I really don't know where to start with this one.
I have a load of documents which are dumped in a directory after being auto-signed using JSignPdf (--output-directory option seemingly has no ability to output to same as input):
/some/dir/Signed/PDF1_signed.pdf
/some/dir/Signed/PDF2_signed.pdf
/some/dir/Signed/PDF2_signed.pdf
I'd like to then find their source/unsigned counterparts:
/some/dir/with/docs/PDF1.pdf
/some/dir/where/is/PDF2.pdf
/some/dir/why/this/PDF3.pdf
...and move the signed PDFs into the respective directories.
I use the command, to find all the PDFs in the variety of directories:
find . -name '*.pdf' -exec sh -c 'exec java -jar jsignpdf-1.4.3/JSignPdf.jar ... ' sh {} +
...and I've tried things like making find output a variable and then using IF THEN to match with no success. Would I need find output to be made into multiple variables? I'm so lost :(
I'd like to accomplish this in some shell, but if there are Perl junkies out there or anything else, I am more than happy for another portable solution.
I've tried to break it down, but still don't understand how to accomplish it...
find files matching VarName without _signed
move _signed file with matching name to the directory of found file
Thanks for any help/guidance.

Use a while loop to read each file found by find and move it to the correct place:
find /some/dir -name "*.pdf" ! -name "*_signed.pdf" -print0 | while IFS= read -d '' -r file
do
f="${file##*/}"
mv "/some/dir/Signed/${f%.*}_signed.pdf" "${file%/*}"
done

I have a similar problem I've been working on. Since the path manipulation required to convert /some/dir/where/is/PDF2.pdf to /some/dir/Signed/PDF2_signed.pdf is fairly simple but more involved than can be done in a simple one-liner, I've been using find to locate the first set, and using a simple loop to process them one at a time. You did mention homework, so I'll try not to give you too much code.
find /dir/containing/unsigned -name '*.pdf' -print0 | while IFS= read -d path; do
fetch_signed_version "$path"
done
where fetch_signed_version is a shell function you write that, given a path such as /some/dir/where/is/PDF2.pdf, extracts the directory (/some/dir/where/is), computes the signed PDF's name (PDF2_signed.pdf), then executes the necessary move (mv /some/dir/Signed/$signed_pdf /some/dir/where/is)
fetch_signed_version is actually pretty simple:
fetch_signed_version () {
dir=${1%/*}
fname=${1##*/}
signed_name=${fname%.pdf}_signed.pdf
mv "/some/dir/Signed/$signed_name" "$dir"
}

Which is faster, 'find -exec' or 'find | xargs -0'?

In my web application I render pages using PHP script, and then generate static HTML files from them. The static HTML are served to the users to speed up performance. The HTML files become stale eventually, and need to be deleted.
I am debating between two ways to write the eviction script.
The first is using a single find command, like
find /var/www/cache -type f -mmin +10 -exec rm \{} \;
The second form is by piping through xargs, something like
find /var/www/cache -type f -mmin +10 -print0 | xargs -0 rm
The first form invokes rm for each file it finds, while the second form just sends all the file names to a single rm (but the file list might be very long).
Which form would be faster?
In my case, the cache directory is shared between a few web servers, so this is all done over NFS, if that matters for this issue.

The xargs version is dramatically faster with a lot of files than the -exec version as you posted it, this is because rm is executed once for each file you want to remove, while xargs will lump as many files as possible together into a single rm command.
With tens or hundreds of thousands of files, it can be the difference between a minute or less versus the better part of an hour.
You can get the same behavior with -exec by finishing the command with a "+" instead of "\;". This option is only available in newer versions of find.
The following two are roughly equivalent:
find . -print0 | xargs -0 rm
find . -exec rm \{} +
Note that the xargs version will still run slightly faster (by a few percent) on a multi-processor system, because some of the work can be parallelized. This is particularly true if a lot of computation is involved.

I expect the xargs version to be slightly faster as you aren't spawning a process for each filename. But, I would be surprised if there was actually much difference in practice. If you're worried about the long list xargs sends to each invocation of rm, you can use -l with xargs to limit the number of tokens it will use. However, xargs knows the longest cmdline length and won't go beyond that.

The find command has a -delete option builtin in, perhaps that could be useful as well?
http://lists.freebsd.org/pipermail/freebsd-questions/2004-July/051768.html

Using xargs is faster as compared to exec with find.
I tried to count no of lines in files in node_module folder with js extension using xargs and exec. So the output below.
time find . -type f -name "*.js" -exec wc -l {} \;
real 0m0.296s
user 0m0.133s
sys 0m0.038s
time find . -type f -name "*.js" |xargs wc -l
real 0m0.019s
user 0m0.005s
sys 0m0.006s
xargs executes approx 15 times faster than exec.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

improve find performance - bash

Related

Check if file is in a folder with a certain name before proceeding

Which is best way to grep on exec from find command?

Find, unzip and grep the content of multiple files in one step/command

Find/Match variable filenames and move files to respective directory

Which is faster, 'find -exec' or 'find | xargs -0'?

Categories

Resources