How to stream one file to multiple pipelines efficiently - bash

I have a script that wants to run several programs / pipelines over a very large file. Example:
grep "ABC" file > file.filt
md5sum file > file.md5
The kernel will try to cache the file in RAM, so if it is read again soon it may be a copy from RAM. However the files are large, and the programs run at wildly different speeds, so this is unlikely to be effective. To minimise IO usage I want to read the file once.
I know of 2 ways to duplicate the data using tee and moreutils' pee:
<file tee >(md5sum > file.md5) | grep "ABC" > file.filt
<file pee 'md5sum > file.md5' 'grep "ABC" > file.filt'
Is there another 'best' way? Which method will make the fewest copies? Does it make a difference which program is >() or |-ed to? Will any of these approaches attempt to buffer data in RAM if one program is too slow? How do they scale to many reader programs?

tee (command) opens each file using fopen, but sets _IONBF (unbuffered) on each. It reads from stdin, and fwrites to each FILE*.
pee (command) popens each command, sets each to unbuffered, reads from stdin, and fwrites to each FILE*.
popen uses pipe(2), which has a capacity of 65536 bytes. Writes to a full buffer will block. pee also uses /bin/sh to interpret the command, but I think that will not add any buffering/copying.
mkfifo (command) uses mkfifo (libc), which use pipes underneath, opening the file/pipe blocks until the other end is opened.
bash <>() syntax (subst.c:5712) uses either pipe or mkfifo. pipe if /dev/fds are supported. It does not use the c fopen calls so does not set the buffering.
So all three variants (pee, tee >(), mkfifo ...) should end up with identical behaviour, reading from stdin and writing to pipes without buffering. The data is duplicated at each read (from kernel to user), and then again at each write (user back to kernel), I think tees fwrites will not cause an extra layer of copying (as there is no buffer). Memory usage could increase to a maximum of 65536 * num_readers + 1 * read_size (if no one is reading). tee writes to stdout first, then each file/pipe in order.
Given this pee just works around other shells (fish!) lack of >() operator equivalent, there seems to be no need for it with bash. I prefer tee when you have bash, but pee is nice when you don't. The bash <() is not replaced by pee of course. Manually mkfifoing and redirecting is tricky and unlikely to deal with errors nicely.
pee could probably be changed by implementing using the tee library function (instead of fwrite). I think this would cause the input to be read at the speed of the fastest reader, and potentially fill up the kernel buffers.

AFAIK, there is no "best way" to achieve this. But I can give you another approach, more verbose, not a one liner, but maybe clearer because each command is written in its own. Use named pipes:
mkfifo tmp1 tmp2
tee tmp1 > tmp2 < file &
cat tmp1 | md5sum > file.md5 &
cat tmp2 | grep "ABC" > file.filt &
wait
rm tmp1 tmp2
Create as many name pipes as commands to be run.
tee to the named pipes the input file (tee outputs its input in the standard output, so the last name pipe must be a redirection), let it run in background.
Use the different named pipes as input to the different commands to run. Let them run in the background.
Finally, wait for the jobs to finish and remove the temporary named pipes.
The drawback of this approach, when the programs have gread variability in theirs speeds is that all of them will read the files at the same pace (limit is the buffer size, once full for one of the pipes, the others will have to wait too), so if one of them is resource-hungry (like memory hungry), the resources will be used for the whole lifespan of all the processes.

Related

Understanding tty + bash

I see that I can use one bash session to print text in another as follows
echo './myscript' > /dev/pts/0 # assuming session 2 is using this tty
# or
echo './myscript' > /proc/1500/fd/0 # assuming session 2's pid is 1500
But why does the text ./myscript only print and not execute? Is there anything that I can do to execute my script this way?
(I know that this will attract a lot of criticism which will perhaps fill any answers that follow with "DON'T DO THAT!" but the real reason I wish to do this is to automatically supply a password to sshfs. I'm working with a local WDMyCloud system, and it deletes my .authorized_keys file every night when I turn off the power.)
why does the text ./myscript only print and not execute?
Input and output are two different things.
Writing to a terminal puts data on the screen. Reading from a terminal reads input from the keyboard. In no way does writing to the terminal simulate keyboard input.
There's no inherent coupling between input and output, and the fact that keys you press show up on screen at all is a conscious design decision: the shell simply reads a key, and then both appends it to its internal command buffer, and writes a copy to the screen.
This is purely for your benefit so you can see what you're typing, and not because the shell in any way cares what's on the screen. Since it doesn't, writing more stuff to screen has no effect on what the shell executes.
Is there anything that I can do to execute my script this way?
Not by writing to a terminal, no.
Here is an example using a FIFO:
#!/usr/bin/bash
FIFO="$(mktemp)"
rm -fv "$FIFO"
mkfifo "$FIFO"
( echo testing123 > "$FIFO" ) &
cat "$FIFO" | sshfs -o password_stdin testing#localhost:/tmp $HOME/tmp
How you store the password and send it to the FIFO is up to you
You can accomplish what you want by using an ioctl system call:
The ioctl() system call manipulates the underlying device parameters of special files. In particular, many operating characteristics of character special files (e.g., terminals) may be controlled with ioctl() requests.
For the 'request' argument of this system call, you'll want TIOCSTI, which is defined as 0x5412 in my header files. (grep -r TIOCSTI /usr/include to verify for your environment.)
I accomplish this as follows in ruby:
fd = IO.sysopen("/proc/#{$$}/fd/0", 'wb')
io = IO.new(fd, 'wb')
"puts 9 * 16\n".chars.each { |c| io.ioctl 0x5412, c };

How to pipe two seperate outputs into a single program's input without a buffer

I have an executable python script, process-data.py that reads live input through stdin and processes it in real time. I want to feed it two types of data: images, and raw text. both are generated from other python scripts.
processing text works when using unbuffer and a pipe like so:
unbuffer ./text-output.py | ./process-data.py
doing the same for the image data also works
unbuffer ./image-output.py | ./process-data.py
How would I run both image-output.py and text-output.py at the same time and process the data without a delay from a buffer? I have tried using cat, but it doesn't work in real time (both "output" scripts generate their data over time and do so indefinitely)
You can try to use named pipes:
mkfifo /tmp/pipe
./text-output.py > /tmp/pipe &
./image-output.py > /tmp/pipe &
./process-data.py < /tmp/pipe
rm /tmp/pipe
But remember, that pipes (named and unnamed) still use buffers inside.

Displaying stdout on screen and a file simultaneously

I'd like to log standard output form a script of mine to a file, but also have it display to me on screen for realtime monitoring. The script outputs something about 10 times every second.
I tried to redirect stdout to a file and then tail -f that file from another terminal, but for some reason tail is updating the screen significantly slower than the script is writing to the file.
What's causing this lag? Is there an alternate method of getting one standard output stream both on my terminal and into a file for later examination?
I can't say why tail lags, but you can use tee:
Redirect output to multiple files, copies standard input to standard output and also to any files given as arguments. This is useful when you want not only to send some data down a pipe, but also to save a copy.
Example: <command> | tee <outputFile>
How much of a lag do you see? A few hundred characters? A few seconds? Minutes? Hours?
What you are seeing is buffering. Almost all file reads and writes are buffered. This includes input and output and there is also some buffering taking place within pipes. It's just more efficient to pass a packet of data around rather than a byte at a time. I believe data on HFS+ file systems are stored in UTF-16 while Mac OS X normally use UTF-8 as a default. (NTFS also stores data using UTF-16 while Windows uses code pages for character data by default).
So, if you run tail -f from another terminal, you may be seeing buffering from tail, but when you use a pipe and then tee, you may have a buffer in the pipe, and in the tee command which maybe why you see the lag.
By the way, how do you know there's a lag? How do you know how quickly your program is writing to the disk? Do you print out something in your program to help track the writes to the file?
In that case, you might not be lagging as much as you think. File writes are also buffered. So, it is very possible that the lag isn't from the tail -f, but from your script writing to the file.
Use tee command:
tail -f /path/logFile | tee outfile

What is a simple explanation for how pipes work in Bash?

I often use pipes in Bash, e.g.:
dmesg | less
Although I know what this outputs, it takes dmesg and lets me scroll through it with less, I do not understand what the | is doing. Is it simply the opposite of >?
Is there a simple, or metaphorical explanation for what | does?
What goes on when several pipes are used in a single line?
Is the behavior of pipes consistent everywhere it appears in a Bash script?
A Unix pipe connects the STDOUT (standard output) file descriptor of the first process to the STDIN (standard input) of the second. What happens then is that when the first process writes to its STDOUT, that output can be immediately read (from STDIN) by the second process.
Using multiple pipes is no different than using a single pipe. Each pipe is independent, and simply links the STDOUT and STDIN of the adjacent processes.
Your third question is a little bit ambiguous. Yes, pipes, as such, are consistent everywhere in a bash script. However, the pipe character | can represent different things. Double pipe (||), represents the "or" operator, for example.
In Linux (and Unix in general) each process has three default file descriptors:
fd #0 Represents the standard input of the process
fd #1 Represents the standard output of the process
fd #2 Represents the standard error output of the process
Normally, when you run a simple program these file descriptors by default are configured as following:
default input is read from the keyboard
Standard output is configured to be the monitor
Standard error is configured to be the monitor also
Bash provides several operators to change this behavior (take a look to the >, >> and < operators for example). Thus, you can redirect the output to something other than the standard output or read your input from other stream different than the keyboard. Specially interesting the case when two programs are collaborating in such way that one uses the output of the other as its input. To make this collaboration easy Bash provides the pipe operator |. Please note the usage of collaboration instead of chaining. I avoided the usage of this term since in fact a pipe is not sequential. A normal command line with pipes has the following aspect:
> program_1 | program_2 | ... | program_n
The above command line is a little bit misleading: user could think that program_2 gets its input once the program_1 has finished its execution, which is not correct. In fact, what bash does is to launch ALL the programs in parallel and it configures the inputs outputs accordingly so every program gets its input from the previous one and delivers its output to the next one (in the command line established order).
Following is a simple example from Creating pipe in C of creating a pipe between a parent and child process. The important part is the call to the pipe() and how the parent closes fd1 (writing side) and how the child closes fd1 (writing side). Please, note that the pipe is a unidirectional communication channel. Thus, data can only flow in one direction: fd1 towards fd[0]. For more information take a look to the manual page of pipe().
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
int main(void)
{
int fd[2], nbytes;
pid_t childpid;
char string[] = "Hello, world!\n";
char readbuffer[80];
pipe(fd);
if((childpid = fork()) == -1)
{
perror("fork");
exit(1);
}
if(childpid == 0)
{
/* Child process closes up input side of pipe */
close(fd[0]);
/* Send "string" through the output side of pipe */
write(fd[1], string, (strlen(string)+1));
exit(0);
}
else
{
/* Parent process closes up output side of pipe */
close(fd[1]);
/* Read in a string from the pipe */
nbytes = read(fd[0], readbuffer, sizeof(readbuffer));
printf("Received string: %s", readbuffer);
}
return(0);
}
Last but not least, when you have a command line in the form:
> program_1 | program_2 | program_3
The return code of the whole line is set to the last command. In this case program_3. If you would like to get an intermediate return code you have to set the pipefail or get it from the PIPESTATUS.
Every standard process in Unix has at least three file descriptors, which are sort of like interfaces:
Standard output, which is the place where the process prints its data (most of the time the console, that is, your screen or terminal).
Standard input, which is the place it gets its data from (most of the time it may be something akin to your keyboard).
Standard error, which is the place where errors and sometimes other out-of-band data goes. It's not interesting right now because pipes don't normally deal with it.
The pipe connects the standard output of the process to the left to the standard input of the process of the right. You can think of it as a dedicated program that takes care of copying everything that one program prints, and feeding it to the next program (the one after the pipe symbol). It's not exactly that, but it's an adequate enough analogy.
Each pipe operates on exactly two things: the standard output coming from its left and the input stream expected at its right. Each of those could be attached to a single process or another bit of the pipeline, which is the case in a multi-pipe command line. But that's not relevant to the actual operation of the pipe; each pipe does its own.
The redirection operator (>) does something related, but simpler: by default it sends the standard output of a process directly to a file. As you can see it's not the opposite of a pipe, but actually complementary. The opposite of > is unsurprisingly <, which takes the content of a file and sends it to the standard input of a process (think of it as a program that reads a file byte by byte and types it in a process for you).
In short, as described, there are three key 'special' file descriptors to be aware of. The shell by default send the keyboard to stdin and sends stdout and stderr to the screen:
A pipeline is just a shell convenience which attaches the stdout of one process directly to the stdin of the next:
There are a lot of subtleties to how this works, for example, the stderr stream might not be piped as you would expect, as shown below:
I have spent quite some time trying to write a detailed but beginner friendly explanation of pipelines in Bash. The full content is at:
https://effective-shell.com/docs/part-2-core-skills/7-thinking-in-pipelines/
A pipe takes the output of a process, by output I mean the standard output (stdout on UNIX) and passes it on the standard input (stdin) of another process. It is not the opposite of the simple right redirection > which purpose is to redirect an output to another output.
For example, take the echo command on Linux which is simply printing a string passed in parameter on the standard output. If you use a simple redirect like :
echo "Hello world" > helloworld.txt
the shell will redirect the normal output initially intended to be on stdout and print it directly into the file helloworld.txt.
Now, take this example which involves the pipe :
ls -l | grep helloworld.txt
The standard output of the ls command will be outputed at the entry of grep, so how does this work?
Programs such as grep when they're being used without any arguments are simply reading and waiting for something to be passed on their standard input (stdin). When they catch something, like the ouput of the ls command, grep acts normally by finding an occurence of what you're searching for.
Pipes are very simple like this.
You have the output of one command. You can provide this output as the input into another command using pipe. You can pipe as many commands as you want.
ex:
ls | grep my | grep files
This first lists the files in the working directory. This output is checked by the grep command for the word "my". The output of this is now into the second grep command which finally searches for the word "files". Thats it.
The pipe operator takes the output of the first command, and 'pipes' it to the second one by connecting stdin and stdout.
In your example, instead of the output of dmesg command going to stdout (and throwing it out on the console), it is going right into your next command.
| puts the STDOUT of the command at left side to the STDIN of the command of right side.
If you use multiple pipes, it's just a chain of pipes. First commands output is set to second commands input. Second commands output is set to next commands input. An so on.
It's available in all Linux/widows based command interpreter.
All of these answere are great. Something that I would just like to mention, is that a pipe in bash (which has the same concept as a unix/linux, or windows named pipe) is just like a pipe in real life.
If you think of the program before the pipe as a source of water, the pipe as a water pipe, and the program after the pipe as something that uses the water (with the program output as water), then you pretty much understand how pipes work.
And remember that all apps in a pipeline run in parallel.
Regarding the efficiency issue of pipe:
A command can access and process the data at its input before previous pipe command to complete that means computing power utilization efficiency if resources available.
Pipe does not require to save output of a command to a file before next command to access its input ( there is no I/O operation between two commands) that means reduction in costly I/O operations and disk space efficiency.
If you treat each unix command as a standalone module,
but you need them to talk to each other using text as a consistent interface,
how can it be done?
cmd input output
echo "foobar" string "foobar"
cat "somefile.txt" file *string inside the file*
grep "pattern" "a.txt" pattern, input file *matched string*
You can say | is a metaphor for passing the baton in a relay marathon.
Its even shaped like one!
cat -> echo -> less -> awk -> perl is analogous to cat | echo | less | awk | perl.
cat "somefile.txt" | echo
cat pass its output for echo to use.
What happens when there is more than one input?
cat "somefile.txt" | grep "pattern"
There is an implicit rule that says "pass it as input file rather than pattern" for grep.
You will slowly develop the eye for knowing which parameter is which by experience.

Can using cat create problems when passing the output to other commands?

In bash, there are multiple ways to direct input and output. For example, these commands do the same thing:
sort <input_file >output_file
cat input_file | sort >output_file
Generally I'd prefer the second way, because I prefer my commands to read left-to-right.
But the answer to this question says:
"sort" can use temporary files to work with input files larger than memory
Which makes me wonder, when sorting a huge file, if cat would short-circuit that process.
Can using cat create problems when passing the output to other commands?
There is a term I throw around a lot called Useless Use of Cat (UUoC) and the 2nd option is exactly that. When a utility can take input on STDIN (such as sort) using redirection not only saves you an extra call to an external process such as cat but it also prevents the overhead of a pipeline.
Other than the extra process and pipeline, the only other "problem" that I see would be you would be subject to the pipeline buffering.
Update
Apparently, there is even a website dedicated to giving out a UUoC Award
"I prefer my commands to read left-to-right"
<input_file sort >output_file
(The canonical way to write this is of course sort input_file >output_file.)
The 'sort' command handles large files regardless of whether the input arrives via standard input and a pipe or I/O redirection or by being directly named on the command line.
Note that you could (and probably should) write:
sort -o output_file input_file
That will work correctly even if the input and output files are the same (or if you have multiple input files, one of which is also the output file).
I see that SiegeX has already take you to task for abusing cat -- feline abuse as it is also known. I'll support his efforts. There are times when it is appropriate to use cat. There are fewer times when it is appropriate than is often recognized.
One example of appropriate use is with the tr command and multiple sources of data:
cat "$#" | tr ...
That is necessary because tr only reads its standard input and only writes to its standard output - the ultimate in 'pure filter' programs.
The authors of Unix have also noted that the general purpose 'cat inputs | command' construct is used instead of the more specialized input redirection (citation missing - books needed not at hand).

Resources