awk for many compressed files - shell

The following command calculates the GC content for each fasta fastq files
identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).
find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;
How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.

The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.
# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;
# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;
See many references to find/xargs in stack overflow

If, as you say, you have many large files, I would suggest processing them in parallel. If the issue is that you are having problems quoting your awk, I would suggest putting your script in a separate file, called, say script.awk like this:
(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}
Now you can simply process them all in parallel with GNU Parallel:
find . -iname \*fastq.gz -print0 | parallel -0 gzcat {} \| awk -f ./script.awk

Related

How to use GNU parallel with find -exec?

I want to unzip multiple files,
Using this answer, I found the following command.
find -name '*.zip' -exec sh -c 'unzip -d "${1%.*}" "$1"' _ {} \;
How do I use GNU Parallel with the above command to unzip multiple files?
Edit 1:
As per questions by user Mark Setchell
Where are the files ?
All the zip files are generally in a single directory.
But, as per my assumption, the command finds all the files even if recursively/non-recursively according to the depth given in find command.
How are the files named?
abcd_sdfa_fasfasd_dasd14.zip
how do you normally unzip a single one?
unzip abcd_sdfa_fasfasd_dasd14.zip -d abcd_sdfa_fasfasd_dasd14
You can first use find with the -print0 option to NULL delimit files and then read back in GNU parallel with the NULL delimiter and apply the unzip
find . -type f -name '*.zip' -print0 | parallel -0 unzip -d {/.} {}
The part {/.} applies string substitution to get the basename of the file and removes the part preceding the . as seen from the GNU parallel documentation - See 7. Get basename, and remove last ({.}) or any ({:}) extension You can further set the number of parallel jobs that can be run with the -j flag. e.g. -j8, -j64
You could also using the + variant of -exec. It starts parallel after find has completed, but also allows for you to still use -print/-printf/-ls/etc. and possibly abort the find before executing the command:
find . -type f -name '*.zip' -ls -exec parallel unzip -d {.} ::: {} \+
Note that GNU Parallel also uses {} to specify the input arguments. In this case, however, we use {.} to strip the extension like shown in your example. You can override the GNU Parallel's replacement string {} with -I (for example, using -I## allows for you to use ## instead of {}).
I recommend using GNU Parallel's --dry-run flag or prepending unzip with an echo to test the command first and see what would be executed.

How to convert a find command to instead use grep to filter and then `exec` commands on output?

I have a scenario where I need to execute a series of commands on each file that's found. This normally would work great, except I have over 100 files and folders to exclude from find's results for execution. This becomes unwieldy and non-executable from the shell directly. It seems like it would be optimal to use an "exclusion file" similar to how tar or grep allows for such files.
Since find does not accept a file for exclusion, but grep does, I want to know: how can the following be converted to a command that would replace the exclusion (prune) and exec functions in find to instead utilize grep with an exclusion file (grep -v -f excludefile) to exclude the folders and files and then execute a series of commands on the result like the current command does it:
find $IN_PATH -regextype posix-extended \
-regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
-o -type f \
-exec sh -c "( cmd -with_args 1 '{}'; cmd -args2 '{}'; cmd3 '{}') \
| cmd4 | cmd5 | cmd6; cmd7 '{}'" \; \
> output
As a side note (not critical), I've read that if you don't use exec this process becomes much less efficient and this process is already consuming over 100 minutes to execute each time that it's run, so I don't want to slow it down any more than is necessary.
the best way i think of to fulfill your scenario , is split the one-liner to two line and introduce xargs with parallel .
find $IN_PATH -regextype posix-extended \
-regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
-o -type f > /tmp/full_file_list
cat /tmp/full_file_list|grep -f excludefile |xargs -0 -n 1 -P <nr_procs> sh -c 'command here' >output
see Bash script processing limited number of commands in parallel and Doing parallel processing in bash? to learn more about parallel in bash
finding and command on files are facing disk-io conflicts in one liner , spilt the one-liner could speed up the process a little bit ,
hint: remember to put your full_file_list/excludefile/output in your exclude rules , and always debug your command on a smaller directory to reduce waiting time
Why not simply:
find . -type f |
grep -v -f excludefile |
xargs whatever
With respect to this process is already consuming over 100 minutes to execute - that's almost certainly a problem with whatever command line you wrote to replace whatever above and we could probably help you improve that if you post a separate question.

When to use xargs when piping?

I am new to bash and I am trying to understand the use of xargs, which is still not clear for me. For example:
history | grep ls
Here I am searching for the command ls in my history. In this command, I did not use xargs and it worked fine.
find /etc - name "*.txt" | xargs ls -l
I this one, I had to use xargs but I still can not understand the difference and I am not able to decide correctly when to use xargs and when not.
xargs can be used when you need to take the output from one command and use it as an argument to another. In your first example, grep takes the data from standard input, rather than as an argument. So, xargs is not needed.
xargs takes data from standard input and executes a command. By default, the data is appended to the end of the command as an argument. It can be inserted anywhere however, using a placeholder for the input. The traditional placeholder is {}; using that, your example command might then be written as:
find /etc -name "*.txt" | xargs -I {} ls -l {}
If you have 3 text files in /etc you'll get a full directory listing of each. Of course, you could just have easily written ls -l /etc/*.txt and saved the trouble.
Another example lets you rename those files, and requires the placeholder {} to be used twice.
find /etc -name "*.txt" | xargs -I {} mv {} {}.bak
These are both bad examples, and will break as soon as you have a filename containing whitespace. You can work around that by telling find to separate filenames with a null character.
find /etc -print0 -name "*.txt" | xargs -I {} -0 mv {} {}.bak
My personal opinion is that there are almost always alternatives to using xargs (such as the -exec argument to find) and you will be better served by learning those.
When you use piping without xargs, the actual data is fed into the next command. On the other hand, when using piping with xargs, the actual data is viewed as a parameter to the next command. To give a concrete example, say you have a folder with a.txt and b.txt. a.txt contains just a single line 'hello world!', and b.txt is just empty.
If you do
ls | grep txt
you would end up getting the output:
a.txt
b.txt
Yet, if you do
ls | xargs grep txt
you would get nothing since neither file a.txt nor b.txt contains the word txt.
If the command is
ls | xargs grep hello
you would get:
hello world!
That's because with xargs, the two filenames given by ls are passed to grep as arguments, rather than the actual content.
Short answer: Avoid xargs for now. Return to xargs when you have written dozens or hundreds of scripts.
Commands can get their input from parameters (like rm bad_example) or can get the input from stdin (not just the y on the question after rm -i is_this_bad_too, but also read answer). Other commands like grep and sed will look for parameters and when the parameters don't show the input, switch to the input.
Your grep example works fine reading from stdin, nothing special needed.
Your ls needs the output of find as a parameter. xargs is just one way to turn things around. Use man xargs for more about xargs. Alternatives:
find /etc -name "*.txt" -exec ls -l {} \;
find /etc -name "*.txt" -ls
ls -l $(find /etc -name "*.txt" )
ls /etc/*.txt
First try to see which of this commands is best when you have a nasty filename with spaces.txt in /etc.
xargs(1) is dangerous (broken, exploitable, etc.) when reading non-NUL-delimited input.
If you're working with filenames, use find's -exec [command] {} + instead.
If you can get NUL-delimited output, use xargs -0.
GNU Parallel can do the same as xargs, but does not have the broken and exploitable "features".
You can learn GNU Parallel by looking at examples http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-xargs--n1.-Argument-appending and walking through the tutorial http://www.gnu.org/software/parallel/parallel_tutorial.html

xargs with multiple commands

In the current directory, I'd like to print the filename and contents in it.
I can print filenames or contents separately by
find . | grep "file_for_print" | xargs echo
find . | grep "file_for_print" | xargs cat
but what I want is printing them together like this:
file1
line1 inside file1
line2 inside file1
file2
line1 inside file2
line2 inside file2
I read xargs with multiple commands as argument
and tried
find . | grep "file_for_print" | xargs -I % sh -c 'echo; cat;'
but doesn't work.
I'm not familiar with xargs, so don't know what exactly "-I % sh -c" means.
could anyone help me? thank you!
find . | grep "file_for_print" | xargs -I % sh -c 'echo %; cat %;' (OP was missing %s)
To start with, there is virtually no difference between:
find . | grep "file_for_print" | xargs echo
and
find . -name "file_for_print*"
except that the second one will not match filenames like this_is_not_the_file_for_print, and it will print the filenames one per line. It will also be a lot faster, because it doesn't need to generate and print the entire recursive directory structure just in order for grep to toss most of it away.
find . -name "file_for_print*"
is actually exactly the same as
find . -name "file_for_print*" -print
where the -print action prints each matched filename followed by a newline. If you don't provide find with any actions, it assumes you wanted -print. But it has more tricks up its sleeve than that. For example:
find . -name "file_for_print*" -exec cat {} \;
The -exec action causes find to execute the following command, up to the \;, replacing {} with each matching file name.
find does not limit itself to a single action. You can tell it to do however many you want. So:
find . -name "file_for_print*" -print -exec cat {} \;
will probably do pretty well what you want.
For lots more information on this very useful utility, type:
man find
or
info find
and read all about It.
Since it's not been said yet: -I % tells xargs to replace '%' with the arguments in the command you give it. The sh -c '...' just means run the commands '...' in a new shell.
So
xargs -I % sh -c 'echo %; cat %;'
will run echo [filename] followed by cat [filename] for every filename given to xargs. The echo and cat commands will be executed inside a different shell process but this usually doesn't matter. Your version didn't work because it was missing the % signs inside the command passed to xargs.
For what it's worth I would use this command to achieve the same thing:
find -name "*file_for_print*" | parallel 'echo {}; cat {};'
because it's simpler (parallel automatically uses {} as the substitution character and can take multiple commands by default).
In this specific case, each command is executed for each individual file anyway, so there's no advantage in using xargs. You may just append -exec twice to your 'find':
find . -name "*file_for_print*" -exec echo {} \; -exec cat {} \;
In this case-print could be used instead of the first echo as pointed out by rici, but this example shows the ability to execute two arbitrary commands with a single find
What about writing your own bash function?
#!/bin/bash
myFunction() {
while read -r file; do
echo "$file"
cat "$file"
done
}
find . -name "file_for_print*" | myFunction

help using xargs to pass mulitiple filenames to shell script

Can someone show me to use xargs properly? Or if not xargs, what unix command should I use?
I basically want to input more than (1) file name for input <localfile>, third input parameter.
For example:
1. use `find` to get list of files
2. use each filename as input to shell script
Usage of shell script:
test.sh <localdir> <localfile> <projectname>
My attempt, but not working:
find /share1/test -name '*.dat' | xargs ./test.sh /staging/data/project/ '{}' projectZ \;
Edit:
After some input from everybody and trying -exec, I am finding that my <localfile> filename input with find is also giving me the full path. /path/filename.dat instead of filename.dat. Is there a way to get the basename from find? I think this will have to be a separate question.
I'd just use find -exec here:
% find /share1/test -name '*.dat' -exec ./test.sh /staging/data/project/ {} projectZ \;
This will invoke ./test.sh with your three arguments once for each .dat file under /share1/test.
xargs would pack up all of these filenames and pass them into one invocation of ./test.sh, which doesn't look like your desired behaviour.
If you want to execute the shell script for each file (as opposed to execute in only once on the whole list of files), you may want to use find -exec:
find /share1/test -name '*.dat' -exec ./test.sh /staging/data/project/ '{}' projectZ \;
Remember:
find -exec is for when you want to run a command on one file, for each file.
xargs instead runs a command only once, using all the files as arguments.
xargs stuffs as many files as it can onto the end of the command line.
Do you want to execute the script on one file at a time or all files? For one at a time, use file's exec, which it looks like you're already using the syntax for, and which xargs doesn't use:
find /share1/test -name '*.dat' -exec ./test.sh /staging/data/project/ '{}' projectZ \;
xargs does not have to combine arguments, it's just the default behavior. this properly uses xargs, to execute the commands, as intended.
find /share1/test -name '*.dat' -print0 | xargs -0 -I'{}' ./test.sh /staging/data/project/ '{}' projectZ
When piping find to xargs, NULL termination is usually preferred, I recommend appending the -print0 option to find. After which you must add -0 to xargs, so it will expect NULL terminated arguments. This ensures proper handling of filenames. It's not POSIX proper, but considered well supported. You can always drop the NULL terminating options, if your commands lack support.
Remeber while find's purpose is finding files, xargs is much more generic. I often use xargs to process non-filename arguments.

Resources