Direct xargs output to file, using multiple arguments - xargs

Given a set of files, I need to pass 2 arguments and direct the output to a newly named file, based on either input filename. The input list follows a defined format: S1_R1.txt, S1_R2.txt; S2_R1.txt, S2_R2.txt; S3_R1.txt, S3_R2.txt, etc. The first numeric is incremented by 1 and each has an R1 and corresponding R2.
The output file is a combination of each S#-pair and should be named respective of this, e.g. S1_interleave.txt, S3_interleave.txt, S3_interleave.txt, etc.
The following works to print to screen
find S*R*.txt -maxdepth 0 | xargs -n 2 python interleave.py
How can I utilize the input filenames for use as output?

Just to make it at bit more fun: Let us assume the files are gzipped (as paired end reads often are) and you want the result gzipped, too:
parallel --xapply 'python interleave.py <(zcat {1}) <(zcat {2}) |gzip > {=1 s/_R1.txt.gz/_interleave.txt.gz/=}' ::: *R1.txt.gz ::: *R2.txt.gz
You need the pre-release of GNU Parallel to do this http://git.savannah.gnu.org/cgit/parallel.git/snapshot/parallel-1a1c0ebe0f79c0ada18527366b1eabeccd18bdf5.tar.gz (or wait for the release 20140722).
As asked it is even simpler (but you still need the pre-release, though):
parallel --xapply 'python interleave.py {1} {2} > {=1 s/_R1.txt/_interleave.txt/=}' ::: *R1.txt ::: *R2.txt

Related

How do you pass two variables over a pipe in bash?

A program I want to run a program that accepts two inputs, however the inputs must be unzipped first. The problem is the files are so large that unzipping them is not a good solution, so I need to unzip just the input. For example:
gunzip myfile.gz | runprog > hurray.txt
That's a perfectly fine thing, but the program I want to run requires two inputs, both of which must be unzipped. So
gunzip file1.gz
gunzip file2.gz
runprog -1 file1_unzipped -2 file2_unzipped
What I need is some way to unzip the files and pass them over a pipe, I imagine something like this:
gunzip f1.gz, f2.gz | runprog -1 f1_input -2 f2_input
Is this double? Is there any way to unzip two files and pass the output across the pipe?
GNU gunzip has a --stdout option (aka. -c), for just this purpose, and there's also zcat as #slim pointed out. The resulting output will be concatenated into a single stream though, because that's how pipes work. One way you can get around this would be to create two input streams and handle them separately in runprog. For example, here's how you would make the first file input stream 8, and the second input stream 9:
runprog 8< <(zcat f1.gz) 9< <(zcat f2.gz)
Another alternative is to pass two file descriptors as parameters to the command:
runprog <(zcat f1.gz) <(zcat f2.gz)
The two arguments can now be treated just like two file arguments.
Your program should understand there are two input from the zip files, you should have a delimiter in between two files.
When your program gets the delimiter, you should able to split the input into two parts. As the pipe may get your both input files in one buffer itself.

Merging large number of files into one

I have around 30 K files. I want to merge them into one. I used CAT but I am getting this error.
cat *.n3 > merged.n3
-bash: /usr/bin/xargs: Argument list too long
How to increase the limit of using the "cat" command? Please help me if there is any iterative method to merge a large number of files.
Here's a safe way to do it, without the need for find:
printf '%s\0' *.n3 | xargs -0 cat > merged.txt
(I've also chosen merged.txt as the output file, as #MichaelDautermann soundly advises; rename to merged.n3 afterward).
Note: The reason this works is:
printf is a bash shell builtin, whose command line is not subject to the length limitation of command lines passed to external executables.
xargs is smart about partitioning the input arguments (passed via a pipe and thus also not subject to the command-line length limit) into multiple invocations so as to avoid the length limit; in other words: xargs makes as few calls as possible without running into the limit.
Using \0 as the delimiter paired with xargs' -0 option ensures that all filenames - even those with, e.g., embedded spaces or even newlines - are passed through as-is.
The traditional way
> merged.n3
for file in *.n3
do
cat "$file" >> merged.n3
done
Try using "find":
find . -name \*.n3 -exec cat {} > merged.txt \;
This "finds" all the files with the "n3" extension in your directory and then passes each result to the "cat" command.
And I set the output file name to be "merged.txt", which you can rename to "merged.n3" after you're done appending, since you likely do not want your new "merged.n3" file appending within itself.

Bash: Trouble when using cat for a lot of files

For each set of parameters I try to merge 400 data files with 13 lines each to a large one like this:
folders=(*/)
for folder in ${folders[#]}; do
#FIND SIGMA
sig0=${folder#*sig}
sig=${sig0%amax*}
#MERGE
cat sig${sig}amax0.6_incr0.1/tau*.dat > merged_sigma${sig}amax0.6_incr0.1.dat
done
It's easy math that a merged file should have 5200 lines, but it doesn't.
Instead, each merged file has a different number of lines, varying between about 3100 and 5000.
I've checked that all the tau*.dat files exist, are not empty and have exactly 13 lines.
There is no problem with missing line breaks at the ends of the small files. In the merged file all lines have the same length. Just some - and it seems to me in a random manner - are missing.
I've read somewhere that the total number of characters in all the file names together mustn't exceed 32767 characters. However, even when taking into account that the file names are not tau*.dat but sig0.10amax0.1_incr0.1/tau27.0_sigma0.10__-0.6-0.6_0-0_0.1.dat I only have no more than 25000 characters for each cat command.
What I would do :
folders=(*/)
for folder in ${folders[#]}; do
#FIND SIGMA
sig0=${folder#*sig}
sig=${sig0%amax*}
#MERGE
for file in sig${sig}amax0.1_incr0.1/tau*.dat; do
cat "$file" >> "merged_sigma${sig}amax0.6_incr0.1.dat"
done
done
Note: This answer explains how to avoid the problem of the command line getting too long when using globbing; however, the command-line length limit appears not to be the source of the OP's problem.
To reliably process globs that expand to argument lists of arbitrary size - without worrying about running into the command-line length limit - you can use the following approach:
printf '%s\0' * | xargs -0 ...
This is based on the following assumptions:
printf is (also) a shell builtin and therefore not subject to the command-line length limit (as reported by getconf ARG_MAX - the size of the environment; see http://www.in-ulm.de/~mascheck/various/argmax/) - true at least for bash (run type printf to verify that printf is a builtin in your shell).
xargs supports the -0 option to accept null char.-separated input; note that a core feature of xargs is to respect the command-line length limit and partition the argument list into multiple invocations of the specified command, if necessary.
Caveat: -0 is a nonstandard (not POSIX-compliant) option, but it's supported on both Linux and BSD-like platforms such as OSX.
(Only if you know that your filenames contain no spaces (and no newlines) and do not start with - can you use the simplified form echo * | xargs ..., similarly assuming that echo is a shell builtin.)
If we apply this approach to the OP's code, we get:
printf '%s\0' sig${sig}amax0.6_incr0.1/tau*.dat |
xargs -0 cat > merged_sigma${sig}amax0.6_incr0.1.dat

Using GNU Parallel With Split

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv

Change text in argument for xargs (or GNU Parallel)

I have a program that I can run in two ways: single-end or paired-end mode. Here's the syntax:
program <output-directory-name> <input1> [input2]
Where the output directory and at least one input is required. If I wanted to run this on three files, say, sample A, B, and C, I would use something like find with xargs or parallel:
user#host:~/single$ ls
sampleA.txt sampleB.txt sampleC.txt
user#host:~/single$ find . -name "sample*" | xargs -i echo program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt
user#host:~/single$ find . -name "sample*" | parallel --dry-run program {}-out {}
program ./sampleA.txt-out ./sampleA.txt
program ./sampleB.txt-out ./sampleB.txt
program ./sampleC.txt-out ./sampleC.txt
But when I want to run the program in "paired-end" mode, I need to give it two inputs. These are related files, but they can't simply be concatenated - you have to run the program with both as inputs. Files are named sensibly, e.g., sampleA_1.txt and sampleA_2.txt.
I want to be able to create this easily on the command line with something like xargs (or preferably parallel):
user#host:~/paired$ ls
sampleA_1.txt sampleB_1.txt sampleC_1.txt
sampleA_2.txt sampleB_2.txt sampleC_2.txt
user#host:~/paired$ find . -name "sample*_1.txt" | sed/awk? | parallel ?
program ./sampleA-out ./sampleA_1.txt ./sampleA_2.txt
program ./sampleB-out ./sampleB_1.txt ./sampleB_2.txt
program ./sampleC-out ./sampleC_1.txt ./sampleC_2.txt
Ideally, the command would strip off the _1.txt to create the output directory name (sampleA-out, etc), but I really need to be able to take that argument and change the _1 to a _2 for the second input.
I know this is dead simple with a script - I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.
Thanks in advance.
I did this in Perl with a quick regular expression substitution. But I would love to be able to do this with a quick one-liner.
Perl has one-liners, too, just as sed and awk do. You can write:
find . -name "sample*_1.txt" | perl -pe 's/_1\.txt$//' | parallel program {}-out {}_1.txt {}_2.txt
(The -e flag means "the next argument is the program text"; the -p flag means "the program should be run in loop; for each line of input, set $_ to that line, then run the program, then print $_".)
With sed and xargs you could do something like this:
find . -name "sample*_1.txt" | sed -n 's/_1\..*$//;h;s/$/_out/p;g;s/$/_1.txt/p;g;s/$/_2.txt/p' | xargs -L 3 echo program
I.e.: sed creates the three arguments and xargs -L 3 composes commands lines with three arguments.
Assuming you always have exactly 2 files in your directory for each pair and assuming they get sorted the right way by find (this you can ensure by piping results of find through sort), maybe xargs -l 2 would do the job. This tells xargs to place 2 consecutive incoming parameters on each command line it executes.
A shorter version:
parallel --xapply program {1.}.out {1} {2} :::: <(ls *_1.txt) <(ls *_2.txt)
but this only works if every _1.txt has a matching _2.txt and vice versa.

Resources