Is there a way to parallelize find without piping to xargs / parallel? - parallel-processing

I am trying to parallelize the following command:
find . -type f -name "{file_regex}" -exec zgrep -cH "{match_regex}" {} \;
I'm hoping to run this across 100 cores. I currently run the find separately in python and capture the stdout, and then create a threadpool and run zgrep -cH ${match_regex} {file_path} to get the matches. Is there a simpler way of doing this aside from just piping to xargs/parallel?

Related

awk for many compressed files

The following command calculates the GC content for each fasta fastq files
identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).
find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;
How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.
The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.
# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;
# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;
See many references to find/xargs in stack overflow
If, as you say, you have many large files, I would suggest processing them in parallel. If the issue is that you are having problems quoting your awk, I would suggest putting your script in a separate file, called, say script.awk like this:
(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}
Now you can simply process them all in parallel with GNU Parallel:
find . -iname \*fastq.gz -print0 | parallel -0 gzcat {} \| awk -f ./script.awk

How to convert a find command to instead use grep to filter and then `exec` commands on output?

I have a scenario where I need to execute a series of commands on each file that's found. This normally would work great, except I have over 100 files and folders to exclude from find's results for execution. This becomes unwieldy and non-executable from the shell directly. It seems like it would be optimal to use an "exclusion file" similar to how tar or grep allows for such files.
Since find does not accept a file for exclusion, but grep does, I want to know: how can the following be converted to a command that would replace the exclusion (prune) and exec functions in find to instead utilize grep with an exclusion file (grep -v -f excludefile) to exclude the folders and files and then execute a series of commands on the result like the current command does it:
find $IN_PATH -regextype posix-extended \
-regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
-o -type f \
-exec sh -c "( cmd -with_args 1 '{}'; cmd -args2 '{}'; cmd3 '{}') \
| cmd4 | cmd5 | cmd6; cmd7 '{}'" \; \
> output
As a side note (not critical), I've read that if you don't use exec this process becomes much less efficient and this process is already consuming over 100 minutes to execute each time that it's run, so I don't want to slow it down any more than is necessary.
the best way i think of to fulfill your scenario , is split the one-liner to two line and introduce xargs with parallel .
find $IN_PATH -regextype posix-extended \
-regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
-o -type f > /tmp/full_file_list
cat /tmp/full_file_list|grep -f excludefile |xargs -0 -n 1 -P <nr_procs> sh -c 'command here' >output
see Bash script processing limited number of commands in parallel and Doing parallel processing in bash? to learn more about parallel in bash
finding and command on files are facing disk-io conflicts in one liner , spilt the one-liner could speed up the process a little bit ,
hint: remember to put your full_file_list/excludefile/output in your exclude rules , and always debug your command on a smaller directory to reduce waiting time
Why not simply:
find . -type f |
grep -v -f excludefile |
xargs whatever
With respect to this process is already consuming over 100 minutes to execute - that's almost certainly a problem with whatever command line you wrote to replace whatever above and we could probably help you improve that if you post a separate question.

Explain how many processes created?

Could someone answer how many processes are created in each case for the commands below as I dont understand it :
The following three commands have roughly the same effect:
rm $(find . -type f -name '*.o')
find . -type f -name '*.o' | xargs rm
find . -type f -name '*.o' -exec rm {} \;
Exactly 2 processes - 1 for rm, the other for find.
3 or more processes. 1 for find, another for xargs, and one or more rm. xargs will read standard input, and if it reads more lines than can be passed as parameters to a program (There is a maximum value named ARG_MAX).
Many processes, 1 for find and another one for each file ending in .o for rm.
In my opinion, option 2 is the best, because it handles the maximum parameter limit correctly and doesn't spawn too many processes. However, I prefer to use it like this (with GNU find and xargs):
find . -type f -name '*.o' -print0 | xargs -0 rm
This terminates each filename with a \0 instead of a newline, since filenames in UNIX can legally contain newlines. This also handles spaces in filenames (much more common) correctly.

Which is faster, 'find -exec' or 'find | xargs -0'?

In my web application I render pages using PHP script, and then generate static HTML files from them. The static HTML are served to the users to speed up performance. The HTML files become stale eventually, and need to be deleted.
I am debating between two ways to write the eviction script.
The first is using a single find command, like
find /var/www/cache -type f -mmin +10 -exec rm \{} \;
The second form is by piping through xargs, something like
find /var/www/cache -type f -mmin +10 -print0 | xargs -0 rm
The first form invokes rm for each file it finds, while the second form just sends all the file names to a single rm (but the file list might be very long).
Which form would be faster?
In my case, the cache directory is shared between a few web servers, so this is all done over NFS, if that matters for this issue.
The xargs version is dramatically faster with a lot of files than the -exec version as you posted it, this is because rm is executed once for each file you want to remove, while xargs will lump as many files as possible together into a single rm command.
With tens or hundreds of thousands of files, it can be the difference between a minute or less versus the better part of an hour.
You can get the same behavior with -exec by finishing the command with a "+" instead of "\;". This option is only available in newer versions of find.
The following two are roughly equivalent:
find . -print0 | xargs -0 rm
find . -exec rm \{} +
Note that the xargs version will still run slightly faster (by a few percent) on a multi-processor system, because some of the work can be parallelized. This is particularly true if a lot of computation is involved.
I expect the xargs version to be slightly faster as you aren't spawning a process for each filename. But, I would be surprised if there was actually much difference in practice. If you're worried about the long list xargs sends to each invocation of rm, you can use -l with xargs to limit the number of tokens it will use. However, xargs knows the longest cmdline length and won't go beyond that.
The find command has a -delete option builtin in, perhaps that could be useful as well?
http://lists.freebsd.org/pipermail/freebsd-questions/2004-July/051768.html
Using xargs is faster as compared to exec with find.
I tried to count no of lines in files in node_module folder with js extension using xargs and exec. So the output below.
time find . -type f -name "*.js" -exec wc -l {} \;
real 0m0.296s
user 0m0.133s
sys 0m0.038s
time find . -type f -name "*.js" |xargs wc -l
real 0m0.019s
user 0m0.005s
sys 0m0.006s
xargs executes approx 15 times faster than exec.

How do I use a pipe in the exec parameter for a find command?

I'm trying to construct a find command to process a bunch of files in a directory using two different executables. Unfortunately, -exec on find doesn't allow to use pipe or even \| because the shell interprets that character first.
Here is specifically what I'm trying to do (which doesn't work because pipe ends the find command):
find /path/to/jpgs -type f -exec jhead -v {} | grep 123 \; -print
Try this
find /path/to/jpgs -type f -exec sh -c 'jhead -v {} | grep 123' \; -print
Alternatively you could try to embed your exec statement inside a sh script and then do:
find -exec some_script {} \;
A slightly different approach would be to use xargs:
find /path/to/jpgs -type f -print0 | xargs -0 jhead -v | grep 123
which I always found a bit easier to understand and to adapt (the -print0 and -0 arguments are necessary to cope with filenames containing blanks)
This might (not tested) be more effective than using -exec because it will pipe the list of files to xargs and xargs makes sure that the jhead commandline does not get too long.
With -exec you can only run a single executable with some arguments, not arbitrary shell commands. To circumvent this, you can use sh -c '<shell command>'.
Do note that the use of -exec is quite inefficient. For each file that is found, the command has to be executed again. It would be more efficient if you can avoid this. (For example, by moving the grep outside the -exec or piping the results of find to xargs as suggested by Palmin.)
Using find command for this type of a task is maybe not the best alternative. I use the following command frequently to find files that contain the requested information:
for i in dist/*.jar; do echo ">> $i"; jar -tf "$i" | grep BeanException; done
As this outputs a list would you not :
find /path/to/jpgs -type f -exec jhead -v {} \; | grep 123
or
find /path/to/jpgs -type f -print -exec jhead -v {} \; | grep 123
Put your grep on the results of the find -exec.
There is kind of another way you can do it but it is also pretty ghetto.
Using the shell option extquote you can do something similar to this in order to make find exec stuff and then pipe it to sh.
root#ifrit findtest # find -type f -exec echo ls $"|" cat \;|sh
filename
root#ifrit findtest # find -type f -exec echo ls $"|" cat $"|" xargs cat\;|sh
h
I just figured I'd add that because at least the way i visualized it, it was closer to the OP's original question of using pipes within exec.

Resources