What is a simple explanation for how pipes work in Bash? - bash

I often use pipes in Bash, e.g.:
dmesg | less
Although I know what this outputs, it takes dmesg and lets me scroll through it with less, I do not understand what the | is doing. Is it simply the opposite of >?
Is there a simple, or metaphorical explanation for what | does?
What goes on when several pipes are used in a single line?
Is the behavior of pipes consistent everywhere it appears in a Bash script?

A Unix pipe connects the STDOUT (standard output) file descriptor of the first process to the STDIN (standard input) of the second. What happens then is that when the first process writes to its STDOUT, that output can be immediately read (from STDIN) by the second process.
Using multiple pipes is no different than using a single pipe. Each pipe is independent, and simply links the STDOUT and STDIN of the adjacent processes.
Your third question is a little bit ambiguous. Yes, pipes, as such, are consistent everywhere in a bash script. However, the pipe character | can represent different things. Double pipe (||), represents the "or" operator, for example.

In Linux (and Unix in general) each process has three default file descriptors:
fd #0 Represents the standard input of the process
fd #1 Represents the standard output of the process
fd #2 Represents the standard error output of the process
Normally, when you run a simple program these file descriptors by default are configured as following:
default input is read from the keyboard
Standard output is configured to be the monitor
Standard error is configured to be the monitor also
Bash provides several operators to change this behavior (take a look to the >, >> and < operators for example). Thus, you can redirect the output to something other than the standard output or read your input from other stream different than the keyboard. Specially interesting the case when two programs are collaborating in such way that one uses the output of the other as its input. To make this collaboration easy Bash provides the pipe operator |. Please note the usage of collaboration instead of chaining. I avoided the usage of this term since in fact a pipe is not sequential. A normal command line with pipes has the following aspect:
> program_1 | program_2 | ... | program_n
The above command line is a little bit misleading: user could think that program_2 gets its input once the program_1 has finished its execution, which is not correct. In fact, what bash does is to launch ALL the programs in parallel and it configures the inputs outputs accordingly so every program gets its input from the previous one and delivers its output to the next one (in the command line established order).
Following is a simple example from Creating pipe in C of creating a pipe between a parent and child process. The important part is the call to the pipe() and how the parent closes fd1 (writing side) and how the child closes fd1 (writing side). Please, note that the pipe is a unidirectional communication channel. Thus, data can only flow in one direction: fd1 towards fd[0]. For more information take a look to the manual page of pipe().
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
int main(void)
{
int fd[2], nbytes;
pid_t childpid;
char string[] = "Hello, world!\n";
char readbuffer[80];
pipe(fd);
if((childpid = fork()) == -1)
{
perror("fork");
exit(1);
}
if(childpid == 0)
{
/* Child process closes up input side of pipe */
close(fd[0]);
/* Send "string" through the output side of pipe */
write(fd[1], string, (strlen(string)+1));
exit(0);
}
else
{
/* Parent process closes up output side of pipe */
close(fd[1]);
/* Read in a string from the pipe */
nbytes = read(fd[0], readbuffer, sizeof(readbuffer));
printf("Received string: %s", readbuffer);
}
return(0);
}
Last but not least, when you have a command line in the form:
> program_1 | program_2 | program_3
The return code of the whole line is set to the last command. In this case program_3. If you would like to get an intermediate return code you have to set the pipefail or get it from the PIPESTATUS.

Every standard process in Unix has at least three file descriptors, which are sort of like interfaces:
Standard output, which is the place where the process prints its data (most of the time the console, that is, your screen or terminal).
Standard input, which is the place it gets its data from (most of the time it may be something akin to your keyboard).
Standard error, which is the place where errors and sometimes other out-of-band data goes. It's not interesting right now because pipes don't normally deal with it.
The pipe connects the standard output of the process to the left to the standard input of the process of the right. You can think of it as a dedicated program that takes care of copying everything that one program prints, and feeding it to the next program (the one after the pipe symbol). It's not exactly that, but it's an adequate enough analogy.
Each pipe operates on exactly two things: the standard output coming from its left and the input stream expected at its right. Each of those could be attached to a single process or another bit of the pipeline, which is the case in a multi-pipe command line. But that's not relevant to the actual operation of the pipe; each pipe does its own.
The redirection operator (>) does something related, but simpler: by default it sends the standard output of a process directly to a file. As you can see it's not the opposite of a pipe, but actually complementary. The opposite of > is unsurprisingly <, which takes the content of a file and sends it to the standard input of a process (think of it as a program that reads a file byte by byte and types it in a process for you).

In short, as described, there are three key 'special' file descriptors to be aware of. The shell by default send the keyboard to stdin and sends stdout and stderr to the screen:
A pipeline is just a shell convenience which attaches the stdout of one process directly to the stdin of the next:
There are a lot of subtleties to how this works, for example, the stderr stream might not be piped as you would expect, as shown below:
I have spent quite some time trying to write a detailed but beginner friendly explanation of pipelines in Bash. The full content is at:
https://effective-shell.com/docs/part-2-core-skills/7-thinking-in-pipelines/

A pipe takes the output of a process, by output I mean the standard output (stdout on UNIX) and passes it on the standard input (stdin) of another process. It is not the opposite of the simple right redirection > which purpose is to redirect an output to another output.
For example, take the echo command on Linux which is simply printing a string passed in parameter on the standard output. If you use a simple redirect like :
echo "Hello world" > helloworld.txt
the shell will redirect the normal output initially intended to be on stdout and print it directly into the file helloworld.txt.
Now, take this example which involves the pipe :
ls -l | grep helloworld.txt
The standard output of the ls command will be outputed at the entry of grep, so how does this work?
Programs such as grep when they're being used without any arguments are simply reading and waiting for something to be passed on their standard input (stdin). When they catch something, like the ouput of the ls command, grep acts normally by finding an occurence of what you're searching for.

Pipes are very simple like this.
You have the output of one command. You can provide this output as the input into another command using pipe. You can pipe as many commands as you want.
ex:
ls | grep my | grep files
This first lists the files in the working directory. This output is checked by the grep command for the word "my". The output of this is now into the second grep command which finally searches for the word "files". Thats it.

The pipe operator takes the output of the first command, and 'pipes' it to the second one by connecting stdin and stdout.
In your example, instead of the output of dmesg command going to stdout (and throwing it out on the console), it is going right into your next command.

| puts the STDOUT of the command at left side to the STDIN of the command of right side.
If you use multiple pipes, it's just a chain of pipes. First commands output is set to second commands input. Second commands output is set to next commands input. An so on.
It's available in all Linux/widows based command interpreter.

All of these answere are great. Something that I would just like to mention, is that a pipe in bash (which has the same concept as a unix/linux, or windows named pipe) is just like a pipe in real life.
If you think of the program before the pipe as a source of water, the pipe as a water pipe, and the program after the pipe as something that uses the water (with the program output as water), then you pretty much understand how pipes work.
And remember that all apps in a pipeline run in parallel.

Regarding the efficiency issue of pipe:
A command can access and process the data at its input before previous pipe command to complete that means computing power utilization efficiency if resources available.
Pipe does not require to save output of a command to a file before next command to access its input ( there is no I/O operation between two commands) that means reduction in costly I/O operations and disk space efficiency.

If you treat each unix command as a standalone module,
but you need them to talk to each other using text as a consistent interface,
how can it be done?
cmd input output
echo "foobar" string "foobar"
cat "somefile.txt" file *string inside the file*
grep "pattern" "a.txt" pattern, input file *matched string*
You can say | is a metaphor for passing the baton in a relay marathon.
Its even shaped like one!
cat -> echo -> less -> awk -> perl is analogous to cat | echo | less | awk | perl.
cat "somefile.txt" | echo
cat pass its output for echo to use.
What happens when there is more than one input?
cat "somefile.txt" | grep "pattern"
There is an implicit rule that says "pass it as input file rather than pattern" for grep.
You will slowly develop the eye for knowing which parameter is which by experience.

Related

Can the pipe operator be used with the stdout redirection operator?

We know that:
The pipe operator | is used to take the standard output of left side command as the standard input for the right side process.
The stdout redirection operator > is used to redirect the stdout to a file
And the question is, why cannot ls -la | > file redirect the output of ls -la to file? (I tried, and the file is empty)
Is it because that the stdout redirection operator > is not a process?
Is it because that the stdout redirection operator > is not a process?
In short, yes.
In a bit more detail, stdout, stderr and stdin are special file descriptors (FDs), but these remarks work for every FD: each FD refers to exactly one resource. It can be a file, a directory, a pipe, a device (such as terminal, a hard drive etc) and more. One FD, one resource. It is not possible for stdout to output to both a pipe and a file at the same time. What tee does is takes stdin (typically from a pipe, but not necessarily), opens a new FD associated with the filename provided as its argument, and writes whatever it gets from stdin to both stdout and the new FD. This copying of content from one to two FDs is not available from bash directly.
EDIT: I tried answering the question as originally posted. As it stands now, DevSolar's comment is actually more on point: why does > file, without a command, make an empty file in bash?
The answer is in Shell Command Language specification, under 2.9.1 Simple commands. In the first step, the redirection is detected. In the second step, no fields remain, so there is no command to be executed. In step 3, redirections are performed in a subshell; however, since there is no command, standard input is simply discarded, and the empty standard output of no-command is used to (try to) create a new file.

Both pipe and redirecting exist in shell

How to explain the output of cat /etc/passwd | cat </etc/issue?
In this case, the second cat receives contents from /etc/passwd as $STDIN and again /etc/issue is redirected. Why there is only /etc/issue left?
What's more, cat </etc/passwd </etc/issue only outputs the contents in /etc/issue. Is /etc/passwd overwritten?
I am not looking for a solution how to cat two files, but confused with how pipeline works.
Piping and redirection are processed from left to right.
So first the input of cat is redirected to the pipe. Then it is redirected to /etc/issue. Then the program is run, using the last redirection, which is the file.
When you do cat <file1 <file2, stdin is first redirected to file1, then it is redirected to file2. Then the program is run, and it gets its input from the last redirection.
It's like variable assignments. If you do:
stdin=passwd
stdin=issue
The value of stdin at the end is the last one assigned.
This is explained in the bash documentation, in the first paragraph of the section on Redirection:
Before a command is executed, its input and output may be redirected using a special notation interpreted by the shell. Redirection may also be used to open and close files for the current shell execution environment. The following redirection operators may precede or appear anywhere within a simple command or may follow a command. Redirections are processed in the order they appear, from left to right.
(emphasis mine). I assume it's also in the POSIX shell specification, I haven't bothered to look it up. This is how Unix shells have always behaved.
The pipe is created first: the standard output of cat /etc/passwd is sent to write side of the pipe, and the standard input of cat </etc/issue is set to the read side of the pipe. Then the command on each half of the pipe is processed. There's no other I/O redirection on the LHS, but on the RHS, the standard input is redirected so it comes from /etc/issue. That means there's nothing actually reading the read end of the pipe, so the LHS cat is terminated with a SIGPIPE (probably; alternatively, it writes data to the pipe but no process ever reads it). The LHS cat never knows about the pipe input — it only has the the file input for its standard input.

When would piping work - does application have to adhere to some standard format? What is stdin and stdout in Unix?

I am using a program that allows me to do
echo "Something" | app outputilfe
But a similar program doesnt do that (and its a bash script that runs Java -jar internally).
Both works with
app input output
This leads to me this question . And why some programs do it and some don't ?
I am basically trying to understand in a larger sense how does programs inter-operate so fluently in *nix - The idea behind it- what is stdin and stdout in a simple layman terms and
A simple way of writing a program that takes an input file and writes an output file is:
Write a code in such a manor that the first 2 positional arguments get interpreted as input and output strings where input should a file that is available in the file system and output is a string that is where its going to write back the binary data .
But this is not how it is . It seems I can stream it . Thats a real paradigm shift for me. I believe its the File Descriptor abstraction that makes it possible? That is you normally write code to expect a FD as positional arguments and not the real file strings ? Which in turn means the output file gets opened and the fd is sent to the program once I execute the command in bash ?
It can read from Terminal and give the display to screen or a application . What makes this possible ? I think there is some concept of file descriptors that I am missing here ?
Does applications 'talk' in terms of File Descriptors and not file name as strings? - In Unix everything is a file and that means FD is used ?
Few other related reads :
http://en.wikipedia.org/wiki/Pipeline_(Unix)
What is a simple explanation for how pipes work in BASH?
confused about stdin, stdout and stderr?
Here's a very non-technical description of a relatively technical topic:
A file descriptor, in Unix parlance, is a small number that identifies a given file or file-like thingy. So let's talk about file-like-thingies in the Unix sense.
What's a Unix file-like-thingy? It's something that you can read from and/or write to. So standard files that live in a directory on your hard disk certainly can qualify as files. So can your terminal session – you can type into it, so it can be read, and you can read output printed on it. So can, for that matter, network sockets. So can (and we'll talk about this more) pipes.
In many cases, an application will read its data from one (or more) file descriptors, and write its results to one (or more) file descriptors. From the point of view of the core code of the application, it doesn't really care which file descriptors its using, or what they're "hooked up" to. (Caveat: Different file descriptors can be hooked up to file-like-thingies with different capabilities, like read-only-ness; I'm ignoring this deliberately for now.) So I might have a trivial program which looks like (ignoring error checking):
void zcrew_up_zpelling(int in_fd, int out_fd) {
char c;
ssize_t
while(read(in_fd, &c, 1)) {
if (c == 's') c = 'z';
write(out_fd, &c, 1));
}
}
Don't worry too much about what this code does (please!); instead, just notice that it's copying-and-modifying from one file descriptor to another.
So, what file descriptors are actually used here? Well, that's up to the code that calls zcrew_up_zpelling(). There are, however, some vague conventions. Many programs that need a single source of input default to using stdin as the file descriptor they'll read from; many programs that need a single source of output default to using stdout as the file descriptor they'll write to. Many of these programs provide ways to use a different file descriptor instead, often one hooked up to a named file.
Let's write a program like this:
int main(int argc, char **argv) {
int in_fd = 0; // Descriptor of standard input
int out_fd = 1; // Descriptor of standard output
if (argc >= 2) in_fd = open(argv[1], O_RDONLY);
if (argc >= 3) out_fd = open(argv[2], O_WRONLY);
zcrew_up_zpelling(in_fd, out_fd);
return 0;
}
So, let's run our program:
./our_program
Hmm, it's waiting for input. We didn't pass any arguments, so it's just using stdin and stdout. What if we type "Using stdin and stdout"?
Uzing ztdin and ztdout
Interesting. Let's try something different. First, we create a file containing "Hello worlds" named, let's say, hello.txt.
./our_program hello.txt
What do we get?
Hello worldz
And one more run:
./out_program hello.txt output.txt
Out program returns immediately, but creates a file called output.text containing... our output!
Deep breath. At this point, I'm hoping that I've successfully explained how a program is able to have behavior independent of the type of file-like-thingy hooked up to a file descriptor, and also to choose what file-like-thingy gets hooked up.
What about that pipe thing I mentioned? What about streaming? Why does it work when I say:
echo Tessting | ./our_program | grep -o z | wc -l
Well, each of these programs follows some form of the conventions above. our_program, as we know, by default reads from stdin and writes to stdout. grep does the same thing. wc by default reads from stdin, but by default writes to stdout -- it likes to live at the end of pipelines. And echo doesn't read from a file descriptor at all (it just reads arguments, like we did in main()), but writes to stdout, so likes to live at the front of streams.
How does this all work? Well, to get much deeper we have to talk about the shell. The shell is the program that starts other command line programs, and it gets to choose what file descriptors are already hooked up to when a program starts. Those magic numbers of 0 and 1 for stdin and stdout we used earlier? That's a Unix convention, and the shell hooks up a file-like-thingy to each of those file descriptors before starting your program. When the shell sees you asking for a pipeline by entering a command with | characters, it hooks the stdout of one program directly into the stdin of the next program, using a file-like-thingy called a pipe. A file-like-thingy pipe, just like a plumbing pipe, takes whatever is put in one end and puts it out the other.
So, we've combined three things:
Code that deals with file descriptors, without worrying about what they're hooked to
Conventions for default file descriptors to use for normal tasks
The shell's ability to set up a program's file descriptors to "pipe" to other programs'
Together, these give us the ability to write programs that "play nice" with streaming and pipelines, without each program having to understand where it sits in the pipeline and what's happening around it.

How does find and printf work when using pipes in bash scripting

Suppose I use the printf in the find command like this:
find ./folder -printf "%f\n" | other command which uses the result of printf
in the other command part, I may be having a sort or something similar
what exactly does printf do in this case? where does it print the file names before the process in the part after "|" happens?
if I sort the filenames for example, it will first sort them, and then print them sorted on the monitor, but before that, how exactly does the part after | get the files unsorted in order to sort them? does the printf in this case give the filenames as input to the part after | and then the part after | prints the file names sorted in the output?
sorry for my english :(
Your shell calls pipe() which creates two file descriptors. Writing into one buffers data in the kernel which is available to be read by the other. Then it calls fork() to make a new process for the find command. After the fork() it closes stdout (always fd 1) and uses dup2() to copy one end of the pipe to stdout. Then it uses exec() to run find (replacing the copy of the shell in the subprocess with find). When find runs it just prints to stdout as normal, but it has inherited it from the shell which made it the pipe. Meanwhile the shell is doing the same thing for other command... with stdin so that it is created with fd 0 connected to the other end of the pipe.
Yes, that is how pipes work. The output from the first process is the input to the second. In terms of implementation, the shell creates a socket which receives input from the first process from its standard output, and writes output to the second process on its standard input.
... You should perhaps read an introduction to Unix shell programming if you have this type of questions.

When the input is from a pipe, does STDIN.read run until EOF is reached?

Sorry if this is a naïve question, but let's say I have a Ruby program called processor.rb that begins with data = STDIN.read. If I invoke this program like this
cat textfile.txt | processor.rb
Does STDIN.read wait for cat to pipe the entire textfile.txt in? Or does it assign some indeterminate portion of textfile.txt to the data variable?
I'm asking this because I recently saw a strange bug in one of my programs that suggests that the latter is the case.
The read method should import the entire file, as-is, and return only when the process producing the output has finished, as indicated by a flag on the pipe. It should be the case that on output from cat that if you call read a subsequent time, you will return 0 bytes.
In simple terms, a process is allowed to append to its output at any time, which is the case of things like 'tail -f', so you can't be assured that you have read all the data from STDIN without actually checking.
Your OS may implement cat or shell pipes slightly differently, though. I'm not familiar with what POSIX dictates for behavior here.
Probably is line buffered and reads until it encounters a newline or EOF.

Resources