I have am looking to solve the problem of writing a series of very large streams concatenated to stdout, and then reading those streams from stdin again, splitting the streams into their original parts. The limitation I face is that at no time can I create any temporary files on disk.
I tried to use the unxz --singe-stream option, but this isn't having the effect I'm expecting.
To demonstrate what I am trying to achieve, I have two scripts:
user#localhost:~# cat test-source.sh
#!/bin/bash
echo "one" | xz
echo "two" | xz
echo "three" | xz
The above first script is then piped into the second script that is intended to reverse the effect:
user#localhost:~# cat test-sink.sh
#!/bin/bash
unxz --single-stream
unxz --single-stream
unxz --single-stream
The above script is expected to output the following:
one
two
three
Instead I see the following:
user#localhost:~# ./test-source.sh | ./test-sink.sh
one
unxz: (stdin): File format not recognized
unxz: (stdin): File format not recognized
The xz above was just one option I tried, I am open to other suggestions. gzip wants to uncompress the whole stream at once, I need to preserve the boundaries between the streams.
I understand that tar is no good, as it cannot accept a stream to tar from stdin.
Is there any other tool out there that can be used to script this?
I don't know if this will solve your problem or not (since it would require installing some software, which given the nature of this question is maybe not an option), but you inspired to hack together something that does exactly what you were describing:
https://github.com/larsks/muxdemux
You can iteratively produce an output stream from several chunks, as in:
echo "one" | xz | mux
echo "two" | xz | mux
echo "three" | xz | mux
And then pass that to a demux command on the other side to extract the individual components. E.g, a trivial example:
$ (
echo "one" | xz | mux
echo "two" | xz | mux
echo "three" | xz | mux
) | demux -v
INFO:demux:processing stream 0 to stream-0.out
INFO:demux:processing stream 1 to stream-1.out
INFO:demux:processing stream 2 to stream-2.out
This takes the input streams and produces three files in your current directory.
It does other other things, too, like optionally adding a sha256 hash
to each stream for data integrity verification.
As an alternative tool I came up with tarmux, which provides a multiplexer / demultiplexer written in C and based on the tar file format provided by libarchive.
https://github.com/minfrin/tarmux
The test scripts now look like this:
Little-Net:trunk minfrin$ cat ./test-source.sh
#!/bin/bash
echo "one" | tarmux
echo "two" | tarmux
echo "three" | tarmux
And this:
Little-Net:trunk minfrin$ cat ./test-sink.sh
#!/bin/bash
tardemux
tardemux
tardemux
The output of tardemux can be piped into other commands, and at no point does a file touch a disk.
Given your source script script, if I run:
sh test-source.sh | unxz
I get as output:
one
two
three
That seems to be the behavior you're asking for. Your attempt at running unxz --single-stream several times doesn't work because the first unxz process consumes all of the input, even though it only extracts the first stream.
Related
I'm running simulation with lots of terminal output, which would exceed my disc space, if I'd save it to a logfile (e.g by "cmd > logfile"). Now I would like to follow the entire terminal output, but at the same time I would like to save specific data/lines/values from this output to a file.
1) Is there a way to that in bash?
2) Or as alternative: Is it possible to save the logfile, extract the needed data and then delete the processed lines to avoid generating a huge logfile?
If you want to save into logfile only the output containing mypattern but you want the see all the output at the terminal, you could issue:
cmd 2>&1 | tee /dev/tty | grep 'mypattern' > logfile
I also assumed that the output of cmd may be directed to the standard output stream as well as to the standard error stream, by adding 2>&1 after cmd.
What criteria are you using to decide which lines to keep?
1
One common filter is to just store stderr.
cmdlist 2>logfile # stdout still to console
2
For a more sophisticated filter, if you have specific patterns you want to save to the log, you can use sed. Here's a simplistic example -
seq 1 100 | sed '/.*[37]$/w tmplog'
This will generate numbers from 1 to 100 and send them all to the console, but capture all numbers that end with 3 or 7 to tmplog. It can also accept more complex lists of commands to help you be more comprehensive -
seq 1 100 | sed '/.*[37]$/w 37.log
/^2/w 37.log'
c.f. the sed manual for more detailed breakdowns.
You probably also want error output, so it might be a good idea to save that too.
seq 1 100 2>errlog | sed '/.*[37]$/w patlog'
3
For a more complex space-saving plan, create a named pipe, and compress the log from that in a background process.
$: mkfifo transfer # creates a queue file on disk that
$: gzip < transfer > log.gz & # reads from FIFO in bg, compresses to log
$: seq 1 100 2>&1 | tee transfer # tee writes one copy to stdout, one to file
This will show all the output as it comes, but also duplicate a copy to the named pipe; gzip will read it from the named pipe and compress it.
3b
You could replace the tee with the sed for double-dipping space reduction if required -
$: mkfifo transfer
$: gzip < transfer > log.gz &
$: seq 1 100 2>&1 | sed '/.*[37]$/w transfer'
I don't really recommend this, as you might filter out something you didn't realize you would need.
I'm trying to take 2 files from one command, in one file I only put 1 entries and the other complete a list, this is the example:
I tried various commands
#!/bin/bash
for i in range 4
do
echo "test" >one >>list
done
I need what in the "one" save the last one loop and in the "list" everyone.
You can use the tee for this, since tee will still write to stdout you can do something like
#!/bin/bash
for i in range 4
do
echo "test" | tee one >>list
done
or this if you want to see echos when you run it, the -a flag tells tee to append rather than truncate
#!/bin/bash
for i in range 4
do
echo "test" | tee one | tee -a list
done
A program I want to run a program that accepts two inputs, however the inputs must be unzipped first. The problem is the files are so large that unzipping them is not a good solution, so I need to unzip just the input. For example:
gunzip myfile.gz | runprog > hurray.txt
That's a perfectly fine thing, but the program I want to run requires two inputs, both of which must be unzipped. So
gunzip file1.gz
gunzip file2.gz
runprog -1 file1_unzipped -2 file2_unzipped
What I need is some way to unzip the files and pass them over a pipe, I imagine something like this:
gunzip f1.gz, f2.gz | runprog -1 f1_input -2 f2_input
Is this double? Is there any way to unzip two files and pass the output across the pipe?
GNU gunzip has a --stdout option (aka. -c), for just this purpose, and there's also zcat as #slim pointed out. The resulting output will be concatenated into a single stream though, because that's how pipes work. One way you can get around this would be to create two input streams and handle them separately in runprog. For example, here's how you would make the first file input stream 8, and the second input stream 9:
runprog 8< <(zcat f1.gz) 9< <(zcat f2.gz)
Another alternative is to pass two file descriptors as parameters to the command:
runprog <(zcat f1.gz) <(zcat f2.gz)
The two arguments can now be treated just like two file arguments.
Your program should understand there are two input from the zip files, you should have a delimiter in between two files.
When your program gets the delimiter, you should able to split the input into two parts. As the pipe may get your both input files in one buffer itself.
In cygwin, the following code works fine
$ cat junk
bat
bat
bat
$ cat junk | sort -k1,1 |tr 'b' 'z' > junk
$ cat junk
zat
zat
zat
But in the linux shell(GNU/Linux), it seems that overwriting doesn't work
[41] othershell: cat junk
cat
cat
cat
[42] othershell: cat junk |sort -k1,1 |tr 'c' 'z'
zat
zat
zat
[43] othershell: cat junk |sort -k1,1 |tr 'c' 'z' > junk
[44] othershell: cat junk
Both environments run BASH.
I am asking this because sometimes after I do text manipulation, because of this caveat, I am forced to make the tmp file. But I know in Perl, you can give "i" flag to overwrite the original file after some operations/manipulations. I just want to ask if there is any foolproof method in unix pipeline to overwrite the file that I am not aware of.
Four main points here:
"Useless use of cat." Don't do that.
You're not actually sorting anything with sort. Don't do that.
Your pipeline doesn't say what you think it does. Don't do that.
You're trying to over-write a file in-place while reading from it. Don't do that.
One of the reasons you are getting inconsistent behavior is that you are piping to a process that has redirection, rather than redirecting the output of the pipeline as a whole. The difference is subtle, but important.
What you want is to create a compound command with Command Grouping, so that you can redirect the input and output of the whole pipeline. In your case, this should work properly:
{ sort -k1,1 | tr 'c' 'z'; } < junk > sorted_junk
Please note that without anything to sort, you might as well skip the sort command too. Then your command can be run without the need for command grouping:
tr 'c' 'z' < junk > sorted_junk
Keep redirections and pipelines as simple as possible. It makes debugging your scripts much easier.
However, if you still want to abuse the pipeline for some reason, you could use the sponge utility from the moreutils package. The man page says:
sponge reads standard input and writes it out to the specified
file. Unlike a shell redirect, sponge soaks up all its input before
opening the output file. This allows constricting pipelines that read
from and write to the same file.
So, your original command line can be re-written like this:
cat junk | sort -k1,1 | tr 'c' 'z' | sponge junk
and since junk will not be overwritten until sponge receives EOF from the pipeline, you will get the results you were expecting.
In general this can be expected to break. The processes in a pipeline are all started up in parallel, so the > junk at the end of the line will usually truncate your input file before the process at the head of the pipelining has finished (or even started) reading from it.
Even if bash under Cygwin let's you get away with this you shouldn't rely on it. The general solution is to redirect to a temporary file and then rename it when the pipeline is complete.
You want to edit that file, you can just use the editor.
ex junk << EOF
%!(sort -k1,1 |tr 'b' 'z')
x
EOF
Overriding the same file in pipeline is not advice, because when you do the mistake you can't get it back (unless you've the backup or it's the under version control).
This happens, because the input and output in pipeline is automatically buffered (which gives you an impression it works), but it actually it's running in parallel. Different platforms could buffer the output in different way (based on the settings), so on some you end up with empty file (because the file would be created at the start), on some other with half-finished file.
The solution is to use some method when the file is only overridden when it encounters an EOF with full buffered and processed input.
This can be achieved by:
Using utility which can soaks up all its input before opening the output file.
This can either be done by sponge (as opposite of unbuffer from expect package).
Avoid using I/O redirection syntax (which can create the empty file before starting the command).
For example using tee (which buffers its standard streams), for example:
cat junk | sort | tee junk
This would only work with sort, because it expects all the input to process the sorting. So if your command doesn't use sort, add one.
Another tool which can be used is stdbuf which modifies buffering operations for its standard streams where you can specify the buffer size.
Use text processor which can edit files in-place (such as sed or ex).
Example:
$ ex -s +'%!sort -k1' -cxa myfile.txt
$ sed -i '' s/foo/bar/g myfile.txt
Using the following simple script, you can make it work like you want to:
$ cat junk | sort -k1,1 |tr 'b' 'z' | overwrite_file.sh junk
overwrite_file.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" | tee "$FILENAME"
Note that if you don't want the updated file to be send to stdout, you can use this approach instead
overwrite_file_no_output.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" > "$FILENAME"
So my question is if I can somehow send data to my program and then send the same data AND its result to another program without having to create a temporary file (in my case ouputdata.txt).
Preferably using linux pipes/bash.
I currently do the following:
cat inputdata.txt | ./MyProg > outputdata.txt
cat inputdata.txt outputdata.txt | ./MyProg2
Here is another way, which can be extended to put the output of two programs together:
( Prog1; Prog2; Prog3; ... ) | ProgN
That at least works in Bash.
Choice 1 - fix MyProg to write the merged output from the input and it's own output. Then you can do this.
./MyProg <inputdata.txt | ./MyProg2
Choice 2 - If you can't fix MyProg to write both input and output, you need to merge.
./MyProg <inputdata.txt | cat inputdata.txt - | ./MyProg2