How can you use multiple input streams for input parameters? - bash

I want to run a command line script that requires several parameters. Specifically:
perl prinseq-lite.pl -fastq file1.fq.gz -fastq2 file2.fq.gz
\ -out_good goodfile.out -out_bad badfile.out -log prin.log
\ -ns_max_n 5 ... more_params ...
The problem is that the files are zipped, and must be processed without first unzipping and storing them, because the unzipped file sizes are very large and this command will be run on a large number of files.
So what I need to do is to unzip the input on the fly. Previously, user l0b0, suggested that multiple input streams might be a solution. I have tried the following, but seem to be passing an empty input stream here as the program claims the input files are empty.
perl prinseq-lite.pl -fastq <(zcat f1.gz) -fastq2 <(zcat f2.gz) ...
perl prinseq-lite.pl -fastq 1< <(zcat f1.gz) -fastq2 2< <(zcat f2.gz) ...
So what I need to do, in short, is provide unzipped input for multiple parameters to this program.
Can someone tell me the proper way to do this, and/or what I'm doing wrong with my current attempts? Thanks in advance for your input.

Well, I think the easiest might be to make named pipes for the output of gzunip, then use those names pipes in the command:
mkfifo file1.fq file2.fq file3.fq ...
gunzip -c file1.fq.gz > file1.fq &
gunzip -c file2.fq.gz > file2.fq &
gunzip -c file3.fq.gz > file3.fq &
Then call your program with those pipes as file names:
perl prinseq-lite.pl -fastq file1.fq -fastq2 file2.fq -fastq3 file3.fq ...

Related

How to redirect to separate files and to a combined file?

It's easy to redirect standard output and standard error to the same file or to separate files. What if I want to do both at the same time? That is, I'd like three files as output: standard output and standard input mixed together in order and standard output and standard error in separate files. Maybe something involving the "tee" command?
Thanks!
Following ideas in comments, use tee to place stdout/stderr into specific file, and into combined file.
rm -f both.log
some-command 2> >(tee err.log >>both.log) | tee out.log >> both.log

Writing data to a zip file

I have a script which I am running in the ubuntu terminal (bash) . Currently I am directly appending the output of the script to a file using the below command :
./run.sh > a.txt
But for some input files run.sh may produce output which is large in size without compression . Is it possible to write these output directly to a zip file without going through the dump file intermediate ?
I know it is possible in Java and python . But I wanted a general method of doing it in the bash so that I could keep the run.sh same even if my running program is changing .
I have tried searching the web but haven't come across something useful .
In this case, a gzip file would be more appropriate. Unlike zip, which is an archive format, gzip is just a compressed data format and can easily be used in a pipe:
./run.sh | gzip > a.txt.gz
The resulting file can be uncompressed in place using the gunzip command (resulting in a file a.txt), viewed with zmore or listed with zcat which allows you to process the output with a filter without writing the whole decompressed file anywhere.
The 'zip' format is for archiving. The 'zip' program can take an existing file and put compressed version into an archive. For example:
./run.sh > a.txt
zip a.zip a.txt
However, you question ask specifically for a 'streaming' solution (given file size). There are few utilities that use formats that are 'streaming-happy': gz, bz2, and xz. Each excel in different type of data, but for many cases, all will work.
./run.sh | gzip > a.txt.gz
./run.sh | bzip2 > a.txt.bz2
./run.sh | xz > a.txt.xz
If you are looking for widest compatibility, gzip is usually your friend.
In bash you can use process substitution.
zip -FI -r file.zip <(./run.sh)

xargs -a [file] mv -t [new-directory] gives me mv: cannot stat `filename*': No such file or directory error

I have been trying to run this command (that I have run before in a different directory), and everything I've read on the message boards has not solved my unknown issue.
Of note: 1) the files exist in this directory 2) I have proper permissions to move these files around 3) I have run this exact line of code before and it has worked. 4) I tried listing files with and without '' to capture all the files (see below). 5) I also tired to list each file as 'Sample1', but that did not work.
xargs -a [filename.txt] mv -t [new-directory]
I have file beginnings (I have ~5 file for each beginning), and I want to move all the files associated with that beginning.
Example: Sample1.bam Sample1.sorted.bam, etc
The lines in the file are listed as such:
Sample1*
Sample2*
Sample3* ...etc.
What am I doing incorrectly and how can I fix it?
TIA!
When you execute command using 'xargs' arguments are passed directly to the called program ('mv' in your case). Wildcard patterns in the input are not expanded - 'sample1*' is passed as is to "mv", which issue an error message about note having a file named 'sample1*'.
To get file name expansion, you want to use the shell. One way to handle this situation is
xargs -a FILENAME.TXT -I__ sh -c "mv -t NEW-FOLDER -- __"
Security Note: the code provides some protection against command line injection (e.g., file name starting with '-'). However, other possible attacks are possible. Safer version is
cat FILENAME.txt | grep '^[A-Za-z0-9][A-Z-z0-9._-]*$' | xargs I__ sh -c "mv -t NEW-FOLDER -- __"
which will limit the input to file with alphanumeric. The 'grep' patterns can be extend the pattern as needed.
With GNU Parallel you would do something like:
cat FILENAME.txt | parallel mv {} NEW-FOLDER
One of the benefits of GNU Parallel is that it deals correctly with file names like:
My brother's 12" records cost > $1000.txt

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

How do you pass two variables over a pipe in bash?

A program I want to run a program that accepts two inputs, however the inputs must be unzipped first. The problem is the files are so large that unzipping them is not a good solution, so I need to unzip just the input. For example:
gunzip myfile.gz | runprog > hurray.txt
That's a perfectly fine thing, but the program I want to run requires two inputs, both of which must be unzipped. So
gunzip file1.gz
gunzip file2.gz
runprog -1 file1_unzipped -2 file2_unzipped
What I need is some way to unzip the files and pass them over a pipe, I imagine something like this:
gunzip f1.gz, f2.gz | runprog -1 f1_input -2 f2_input
Is this double? Is there any way to unzip two files and pass the output across the pipe?
GNU gunzip has a --stdout option (aka. -c), for just this purpose, and there's also zcat as #slim pointed out. The resulting output will be concatenated into a single stream though, because that's how pipes work. One way you can get around this would be to create two input streams and handle them separately in runprog. For example, here's how you would make the first file input stream 8, and the second input stream 9:
runprog 8< <(zcat f1.gz) 9< <(zcat f2.gz)
Another alternative is to pass two file descriptors as parameters to the command:
runprog <(zcat f1.gz) <(zcat f2.gz)
The two arguments can now be treated just like two file arguments.
Your program should understand there are two input from the zip files, you should have a delimiter in between two files.
When your program gets the delimiter, you should able to split the input into two parts. As the pipe may get your both input files in one buffer itself.

Resources