A program I want to run a program that accepts two inputs, however the inputs must be unzipped first. The problem is the files are so large that unzipping them is not a good solution, so I need to unzip just the input. For example:
gunzip myfile.gz | runprog > hurray.txt
That's a perfectly fine thing, but the program I want to run requires two inputs, both of which must be unzipped. So
gunzip file1.gz
gunzip file2.gz
runprog -1 file1_unzipped -2 file2_unzipped
What I need is some way to unzip the files and pass them over a pipe, I imagine something like this:
gunzip f1.gz, f2.gz | runprog -1 f1_input -2 f2_input
Is this double? Is there any way to unzip two files and pass the output across the pipe?
GNU gunzip has a --stdout option (aka. -c), for just this purpose, and there's also zcat as #slim pointed out. The resulting output will be concatenated into a single stream though, because that's how pipes work. One way you can get around this would be to create two input streams and handle them separately in runprog. For example, here's how you would make the first file input stream 8, and the second input stream 9:
runprog 8< <(zcat f1.gz) 9< <(zcat f2.gz)
Another alternative is to pass two file descriptors as parameters to the command:
runprog <(zcat f1.gz) <(zcat f2.gz)
The two arguments can now be treated just like two file arguments.
Your program should understand there are two input from the zip files, you should have a delimiter in between two files.
When your program gets the delimiter, you should able to split the input into two parts. As the pipe may get your both input files in one buffer itself.
Related
I want to run a command line script that requires several parameters. Specifically:
perl prinseq-lite.pl -fastq file1.fq.gz -fastq2 file2.fq.gz
\ -out_good goodfile.out -out_bad badfile.out -log prin.log
\ -ns_max_n 5 ... more_params ...
The problem is that the files are zipped, and must be processed without first unzipping and storing them, because the unzipped file sizes are very large and this command will be run on a large number of files.
So what I need to do is to unzip the input on the fly. Previously, user l0b0, suggested that multiple input streams might be a solution. I have tried the following, but seem to be passing an empty input stream here as the program claims the input files are empty.
perl prinseq-lite.pl -fastq <(zcat f1.gz) -fastq2 <(zcat f2.gz) ...
perl prinseq-lite.pl -fastq 1< <(zcat f1.gz) -fastq2 2< <(zcat f2.gz) ...
So what I need to do, in short, is provide unzipped input for multiple parameters to this program.
Can someone tell me the proper way to do this, and/or what I'm doing wrong with my current attempts? Thanks in advance for your input.
Well, I think the easiest might be to make named pipes for the output of gzunip, then use those names pipes in the command:
mkfifo file1.fq file2.fq file3.fq ...
gunzip -c file1.fq.gz > file1.fq &
gunzip -c file2.fq.gz > file2.fq &
gunzip -c file3.fq.gz > file3.fq &
Then call your program with those pipes as file names:
perl prinseq-lite.pl -fastq file1.fq -fastq2 file2.fq -fastq3 file3.fq ...
My mails are stored in /var/mail/ file.
I want to write a bash script which will loopfor each message.
I do not know how to split messages from a file.
Thanks
Edit : typical case of "When all you have is an hammer, everything looks like a nail" ; check formail from the procmail mail-processing-package as suggested by twalberg.
For something quick and dirty, you should be able to use the following to separate the records with NUL bytes which would make iterating over them easier :
sed '/^\n/N;s/^From/\x0&/' /var/mail/targetMailbox
For example you can use this command and split to split your mailbox into multiple files of manageable size :
sed '/^\n/N;s/^From/\x0&/' /var/mail/targetMailbox | split -l 100 -t'\0' - /tmp/mailbox
This command will split the mailbox into chunks of 100 messages which will be written to their own file in /tmp/ ; check split's options if you're interested in splitting the file, it supports a lot of different ways to do so.
A lot of (recent?) GNU tools will have a -0 or -z option to make them handle NUL-separated records, for example :
-z for grep, head and tail
-0 for xargs
To iterate over them directly from bash, the easiest is to use a while read loop with read's -d option to specify the use of NUL as a separator.
For a more permanent solution, you need to find how to use an existing mbox parser.
From this question, I found the split utilty, which takes a file and splits it into evenly sized chunks. By default, it outputs these chunks to new files, but I'd like to get it to output them to stdout, separated by a newline (or an arbitrary delimiter). Is this possible?
I tried cat testfile.txt | split -b 128 - /dev/stdout
which fails with the error split: /dev/stdoutaa: Permission denied.
Looking at the help text, it seems this tells split to use /dev/stdout as a prefix for the filename, not to write to /dev/stdout itself. It does not indicate any option to write directly to a single file with a delimiter. Is there a way I can trick split into doing this, or is there a different utility that accomplishes the behavior I want?
It's not clear exactly what you want to do, but perhaps the --filter option to split will help out:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
Maybe you can use that directly. For example, this will read a file 10 bytes at a time, passing each chunk through the tr command:
split -b 10 --filter "tr [:lower:] [:upper:]" afile
If you really want to emit a stream on stdout that has separators between chunks, you could do something like:
split -b 10 --filter 'dd 2> /dev/null; echo ---sep---' afile
If afile is a file in my current directory that looks like:
the quick brown fox jumped over the lazy dog.
Then the above command will result in:
the quick ---sep---
brown fox ---sep---
jumped ove---sep---
r the lazy---sep---
dog.
---sep---
From info page :
`--filter=COMMAND'
With this option, rather than simply writing to each output file,
write through a pipe to the specified shell COMMAND for each
output file. COMMAND should use the $FILE environment variable,
which is set to a different output file name for each invocation
of the command.
split -b 128 --filter='cat ; echo ' inputfile
Here is one way of doing it. You will get each 128 character into variable "var".
You may use your preferred delimiter to print or use it for further processing.
#!/bin/bash
cat yourTextFile | while read -r -n 128 var ; do
printf "\n$var"
done
You may use it as below at command line:
while read -r -n 128 var ; do printf "\n$var" ; done < yourTextFile
No, the utility will not write anything to standard output. The standard specification of it says specifically that standard output in not used.
If you used split, you would need to concatenate the created files, inserting a delimiter in between them.
If you just want to insert a delimiter every N th line, you may use GNU sed:
$ sed '0~3a\-----\' file
This inserts a line containing ----- every 3rd line.
To divide the file into chunks, separated by newlines, and write to stdout, use fold:
cat yourfile.txt | fold -w 128
...will write to stdout in "chunks" of 128 chars.
I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:
Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.
Is it possible at all?
This answer - bash: extract only part of tar.gz archive - describes a different problem.
To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:
tar Oxzf f.tar.gz | split -l1000000
The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:
tar Oxzf f.tar.gz |split -dl1000000 - prefix.
Under this approach:
The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.
The .tar.gz file is read only once.
split, through its many options, has a great deal of flexibility.
Explanation
For the tar command:
O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.
x tells tar to extract the file (as opposed to, say, creating an archive).
z tells tar that the archive is in gzip format. On modern tars, this is optional
f tells tar to use, as input, the file name specified.
For the split command:
-l tells split to split files limited by number of lines (as opposed to, say, bytes).
-d tells split to use numeric suffixes for the output files.
- tells split to get its input from stdin
You can use the --to-stdout (or -O) option in tar to send the output to stdout.
Then use sed to specify which set of lines you want.
#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
e=$(($l+$inc))
tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
sed -n -e "$l,$e p" > part$p.txt
l=$(($l+$inc))
p=$(($p+1))
done
Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.
#!/usr/bin/env bash
set -eu
filenum=1
chunksize=1000000
ii=0
while read line
do
if [ $ii -ge $chunksize ]
then
ii=0
filenum=$(($filenum + 1))
> out/file.$filenum
fi
echo $line >> out/file.$filenum
ii=$(($ii + 1))
done
This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:
tar xfzO big.tar.gz | ./split.sh
This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.
you can use
sed -n 1,20p /Your/file/Path
Here you mention your first line number and the last line number
I mean to say this could look like
sed -n 1,20p /Your/file/Path >> file1
and use start line number and end line number in a variable and use it accordingly.
Given a set of files, I need to pass 2 arguments and direct the output to a newly named file, based on either input filename. The input list follows a defined format: S1_R1.txt, S1_R2.txt; S2_R1.txt, S2_R2.txt; S3_R1.txt, S3_R2.txt, etc. The first numeric is incremented by 1 and each has an R1 and corresponding R2.
The output file is a combination of each S#-pair and should be named respective of this, e.g. S1_interleave.txt, S3_interleave.txt, S3_interleave.txt, etc.
The following works to print to screen
find S*R*.txt -maxdepth 0 | xargs -n 2 python interleave.py
How can I utilize the input filenames for use as output?
Just to make it at bit more fun: Let us assume the files are gzipped (as paired end reads often are) and you want the result gzipped, too:
parallel --xapply 'python interleave.py <(zcat {1}) <(zcat {2}) |gzip > {=1 s/_R1.txt.gz/_interleave.txt.gz/=}' ::: *R1.txt.gz ::: *R2.txt.gz
You need the pre-release of GNU Parallel to do this http://git.savannah.gnu.org/cgit/parallel.git/snapshot/parallel-1a1c0ebe0f79c0ada18527366b1eabeccd18bdf5.tar.gz (or wait for the release 20140722).
As asked it is even simpler (but you still need the pre-release, though):
parallel --xapply 'python interleave.py {1} {2} > {=1 s/_R1.txt/_interleave.txt/=}' ::: *R1.txt ::: *R2.txt