How to use GNU parallel with find -exec? - bash

I want to unzip multiple files,
Using this answer, I found the following command.
find -name '*.zip' -exec sh -c 'unzip -d "${1%.*}" "$1"' _ {} \;
How do I use GNU Parallel with the above command to unzip multiple files?
Edit 1:
As per questions by user Mark Setchell
Where are the files ?
All the zip files are generally in a single directory.
But, as per my assumption, the command finds all the files even if recursively/non-recursively according to the depth given in find command.
How are the files named?
abcd_sdfa_fasfasd_dasd14.zip
how do you normally unzip a single one?
unzip abcd_sdfa_fasfasd_dasd14.zip -d abcd_sdfa_fasfasd_dasd14

You can first use find with the -print0 option to NULL delimit files and then read back in GNU parallel with the NULL delimiter and apply the unzip
find . -type f -name '*.zip' -print0 | parallel -0 unzip -d {/.} {}
The part {/.} applies string substitution to get the basename of the file and removes the part preceding the . as seen from the GNU parallel documentation - See 7. Get basename, and remove last ({.}) or any ({:}) extension You can further set the number of parallel jobs that can be run with the -j flag. e.g. -j8, -j64

You could also using the + variant of -exec. It starts parallel after find has completed, but also allows for you to still use -print/-printf/-ls/etc. and possibly abort the find before executing the command:
find . -type f -name '*.zip' -ls -exec parallel unzip -d {.} ::: {} \+
Note that GNU Parallel also uses {} to specify the input arguments. In this case, however, we use {.} to strip the extension like shown in your example. You can override the GNU Parallel's replacement string {} with -I (for example, using -I## allows for you to use ## instead of {}).
I recommend using GNU Parallel's --dry-run flag or prepending unzip with an echo to test the command first and see what would be executed.

Related

awk for many compressed files

The following command calculates the GC content for each fasta fastq files
identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).
find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;
How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.
The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.
# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;
# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;
See many references to find/xargs in stack overflow
If, as you say, you have many large files, I would suggest processing them in parallel. If the issue is that you are having problems quoting your awk, I would suggest putting your script in a separate file, called, say script.awk like this:
(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}
Now you can simply process them all in parallel with GNU Parallel:
find . -iname \*fastq.gz -print0 | parallel -0 gzcat {} \| awk -f ./script.awk

Bash: Fail to use find -exec

When using scp or rsync I often fail to deal with 'Argument list too long' error. When having to mv or rm, I have no problem to use find and xargs but I fail to understand how to use find and -exec despite all the SE posts on the subject. Consider the following issue...
I tried
$scp /Path/to/B* Me#137.92.4.152:/Path/to/
-bash: /usr/bin/scp: Argument list too long
So I tried
$find . -name "/Path/to/B*" -exec scp "{}" Me#137.92.4.152:/Path/to/ '\;'
find: -exec: no terminating ";" or "+"
so I tried
$find . -name "/Path/to/B*" -exec scp "{}" Me#137.92.4.152:/Path/to/ ';'
find: ./.gnupg: Permission denied
find: ./.subversion/auth: Permission denied
So I tried
$sudo find . -name "/Path/to/B*" -exec scp "{}" Me#137.92.4.152:/Path/to/ ';'
and nothing happen onces I enter my password
I am on Mac OSX version 10.11.3, Terminal version 2.6.1
R. Saban's helpful answer solves your primary problem:
-name only accepts a filename pattern, not a path pattern.
Alternatively, you could simply use the -path primary instead of the -name primary.
As for using as few invocations of scp as possible - each of which requires specifying a password by default:
As an alternative, consider bypassing the use of scp altogether, as suggested in Eric Renouf's helpful answer.
While find's -exec primary allows using terminator + in lieu of ; (which must be passed as ';' or \; to prevent the shell from interpreting ; as a command terminator) for passing as many filenames as will fit on a single command line (a built-in xargs, in a manner of speaking), this is NOT an option here, because use of + requires that placeholder {} come last on the command line, immediately before +.
However, since you're on macOS, you can use BSD xarg's nonstandard -J option for placing the placeholder anywhere on the command line, while still passing as many arguments as possible at once (using BSD find's nonstandard -print0 option in combination with xargs's nonstandard -0 option ensures that all filenames are passed as-is, even if they have embedded spaces, for instance):
find . -path "/Path/to/B*" -print0 | xargs -0 -J {} scp {} Me#137.92.4.152:/Path/to/
Now you will at most be prompted a few times: one prompt for every batch of arguments, as required to accommodate all arguments while observing the max. command-line length with the fewest number of calls possible.
EDIT after your update:
find "/Path/to" -maxdepth 1 -name "B*" -exec scp {} Me#137.92.4.152:/Path/to/ \;
A solution that wouldn't require multiple scp connections (and therefore password entries) would be to tar on one side and untar on the other like:
find /Path/to -maxdepth 1 -name 'B*' -print0 | tar -c --null -T - | ssh ME#137.92.4.152 tar -x -C /Path/to
assuming your version of find supports -print0 and the like. It works by printing out null terminated list of files from find and telling tar to read its list of files from stdin (-T -) treating the list as null terminated (--null) and create a new archive (-c). By default, tar will write to stdout.
So then we'll pipe that archive to an ssh command to the target host. That will read the output of the previous command on its stdin, so we'll use tar there to extract (-x) the archive into the given directory (-C /Path/to)

Command to empty many files recursively

I would like to clear the content of many log files of a given directory recursively, without deleting every file. Is that possible with a simple command?
I know that I can do > logs/logfile.log one by one, but there are lots of logs in that folder, and that is not straightforward.
I am using macOS Sierra by the way.
Thanks to #chepner for showing me the better way to protect against double quotes in the file names:
You can use find to do it
find start_dir -type f -exec sh -c '> "$1"' _ {} \;
And you could add extra restrictions if you don't want all files, like if you want only files ending in .log you could do
find start_dir -type f -name '*.log' -exec sh -c '> "$1"' _ {} \;
As macOS includes Perl anyway:
perl -e 'for(<logs/*log>){truncate $_,0}'
Or, more succinctly, if you use homebrew and you have installed GNU Parallel (which is just a Perl script), you can do:
parallel '>' ::: logs/*log

help using xargs to pass mulitiple filenames to shell script

Can someone show me to use xargs properly? Or if not xargs, what unix command should I use?
I basically want to input more than (1) file name for input <localfile>, third input parameter.
For example:
1. use `find` to get list of files
2. use each filename as input to shell script
Usage of shell script:
test.sh <localdir> <localfile> <projectname>
My attempt, but not working:
find /share1/test -name '*.dat' | xargs ./test.sh /staging/data/project/ '{}' projectZ \;
Edit:
After some input from everybody and trying -exec, I am finding that my <localfile> filename input with find is also giving me the full path. /path/filename.dat instead of filename.dat. Is there a way to get the basename from find? I think this will have to be a separate question.
I'd just use find -exec here:
% find /share1/test -name '*.dat' -exec ./test.sh /staging/data/project/ {} projectZ \;
This will invoke ./test.sh with your three arguments once for each .dat file under /share1/test.
xargs would pack up all of these filenames and pass them into one invocation of ./test.sh, which doesn't look like your desired behaviour.
If you want to execute the shell script for each file (as opposed to execute in only once on the whole list of files), you may want to use find -exec:
find /share1/test -name '*.dat' -exec ./test.sh /staging/data/project/ '{}' projectZ \;
Remember:
find -exec is for when you want to run a command on one file, for each file.
xargs instead runs a command only once, using all the files as arguments.
xargs stuffs as many files as it can onto the end of the command line.
Do you want to execute the script on one file at a time or all files? For one at a time, use file's exec, which it looks like you're already using the syntax for, and which xargs doesn't use:
find /share1/test -name '*.dat' -exec ./test.sh /staging/data/project/ '{}' projectZ \;
xargs does not have to combine arguments, it's just the default behavior. this properly uses xargs, to execute the commands, as intended.
find /share1/test -name '*.dat' -print0 | xargs -0 -I'{}' ./test.sh /staging/data/project/ '{}' projectZ
When piping find to xargs, NULL termination is usually preferred, I recommend appending the -print0 option to find. After which you must add -0 to xargs, so it will expect NULL terminated arguments. This ensures proper handling of filenames. It's not POSIX proper, but considered well supported. You can always drop the NULL terminating options, if your commands lack support.
Remeber while find's purpose is finding files, xargs is much more generic. I often use xargs to process non-filename arguments.

Unix find: list of files from stdin

I'm working in Linux & bash (or Cygwin & bash).
I have a huge--huge--directory structure, and I have to find a few needles in the haystack.
Specifically, I'm looking for these files (20 or so):
foo.c
bar.h
...
quux.txt
I know that they are in a subdirectory somewhere under ..
I know I can find any one of them with
find . -name foo.c -print. This command takes a few minutes to execute.
How can I print the names of these files with their full directory name? I don't want to execute 20 separate finds--it will take too long.
Can I give find the list of files from stdin? From a file? Is there a different command that does what I want?
Do I have to first assemble a command line for find with -o using a loop or something?
If your directory structure is huge but not changing frequently, it is good to run
cd /to/root/of/the/files
find . -type f -print > ../LIST_OF_FILES.txt #and sometimes handy the next one too
find . -type d -print > ../LIST_OF_DIRS.txt
after it you can really FAST find anything (with grep, sed, etc..) and update the file-lists only when the tree is changed. (it is a simplified replacement if you don't have locate)
So,
grep '/foo.c$' LIST_OF_FILES.txt #list all foo.c in the tree..
When want find a list of files, you can try the following:
fgrep -f wanted_file_list.txt < LIST_OF_FILES.txt
or directly with the find command
find . type f -print | fgrep -f wanted_file_list.txt
the -f for fgrep mean - read patterns from the file, so you can easily grepping input for multiple patterns...
You shouldn't need to run find twenty times.
You can construct a single command with a multiple of filename specifiers:
find . \( -name 'file1' -o -name 'file2' -o -name 'file3' \) -exec echo {} \;
Is the locate(1) command an acceptable answer? Nightly it builds an index, and you can query the index quite quickly:
$ time locate id_rsa
/home/sarnold/.ssh/id_rsa
/home/sarnold/.ssh/id_rsa.pub
real 0m0.779s
user 0m0.760s
sys 0m0.010s
I gave up executing a similar find command in my home directory at 36 seconds. :)
If nightly doesn't work, you could run the updatedb(8) program by hand once before running locate(1) queries. /etc/updatedb.conf (updatedb.conf(5)) lets you select specific directories or filesystem types to include or exclude.
Yes, assemble your command line.
Here's a way to process a list of files from stdin and assemble your (FreeBSD) find command to use extended regular expression matching (n1|n2|n3).
For GNU find you may have to use one of the following options to enable extended regular expression matching:
-regextype posix-egrep
-regextype posix-extended
echo '
foo\\.c
bar\\.h
quux\\.txt
' | xargs bash -c '
IFS="|";
find -E "$PWD" -type f -regex "^.*/($*)$" -print
echo find -E "$PWD" -type f -regex "^.*/($*)$" -print
' arg0
# note: "$*" uses the first character of the IFS variable as array item delimiter
(
IFS='|'
set -- 1 2 3 4 5
echo "$*" # 1|2|3|4|5
)

Resources