Can using cat create problems when passing the output to other commands? - bash

In bash, there are multiple ways to direct input and output. For example, these commands do the same thing:
sort <input_file >output_file
cat input_file | sort >output_file
Generally I'd prefer the second way, because I prefer my commands to read left-to-right.
But the answer to this question says:
"sort" can use temporary files to work with input files larger than memory
Which makes me wonder, when sorting a huge file, if cat would short-circuit that process.
Can using cat create problems when passing the output to other commands?

There is a term I throw around a lot called Useless Use of Cat (UUoC) and the 2nd option is exactly that. When a utility can take input on STDIN (such as sort) using redirection not only saves you an extra call to an external process such as cat but it also prevents the overhead of a pipeline.
Other than the extra process and pipeline, the only other "problem" that I see would be you would be subject to the pipeline buffering.
Update
Apparently, there is even a website dedicated to giving out a UUoC Award

"I prefer my commands to read left-to-right"
<input_file sort >output_file
(The canonical way to write this is of course sort input_file >output_file.)

The 'sort' command handles large files regardless of whether the input arrives via standard input and a pipe or I/O redirection or by being directly named on the command line.
Note that you could (and probably should) write:
sort -o output_file input_file
That will work correctly even if the input and output files are the same (or if you have multiple input files, one of which is also the output file).
I see that SiegeX has already take you to task for abusing cat -- feline abuse as it is also known. I'll support his efforts. There are times when it is appropriate to use cat. There are fewer times when it is appropriate than is often recognized.
One example of appropriate use is with the tr command and multiple sources of data:
cat "$#" | tr ...
That is necessary because tr only reads its standard input and only writes to its standard output - the ultimate in 'pure filter' programs.
The authors of Unix have also noted that the general purpose 'cat inputs | command' construct is used instead of the more specialized input redirection (citation missing - books needed not at hand).

Related

multiple sed commands: when semicolon, when pipeline?

When I construct a complicated operation in sed, I often start with
cat infile | sed 'expression1' | sed 'expr2' ...
and then optimize that into
cat infile | sed 'expr1;expr2;expr3' | sed 'expr4' | sed 'expr5;expr6' ...
What guidelines are there for which expressions can be combined with semicolons into a single command?
So far, I just ad hoc combine s///'s, and don't combine //d's.
(The optimization is for running it tens of millions of times. Yes, it's measurably faster.)
(Posted here instead of on superuser.com, because that has 20x fewer questions about sed.)
The operation that you're carrying out is fundamentally different in each case.
When you "combine" sed commands using a pipe, the whole file is processed by every invocation of sed. This incurs the cost of launching a separate process for every part of your pipeline.
When you use a semicolon-separated list of commands, each command is applied in turn to every line in the file, using a single instance of sed.
Depending on the commands you're using, the output of these two things could be very different!
If you don't like using semicolons to separate commands, I would propose another option: use sed -e 'expr1' -e 'expr2' -e 'expr3' file. Alternatively, many tools including sed support -f to pass a file containing commands. You can put each command on a newline instead of using semicolons for clarity.
Generally, s and d can coexist peacefully. It's when the different commands interact with each other that you may have to break down and use separate scripts, or switch to a richer language with variables etc.
For example, a sed script which adds thousands separators to numbers which lack them should probably be kept completely separate from other processing. The modularity is probably more important than any possible efficiency gains in the long run.
What guidelines are there for which expressions can be combined with semicolons into a single command? So far, I just ad hoc combine s///'s, and don't combine //d's.
sed has many more commands than just s and d. If those are the only ones you're using, though, then you can join as many as you like in the same sed run. The result will be the same as for a pipeline of multiple single-command seds. If you're going to do that, however, then consider either using a command file, as #anubhava suggested, or giving each independent expression via its own -e argument; either one is much clearer than a single expression consisting of multiple semicolon-separated commands.
Even if you use other commands, for the most part you will get the same result from performing a sequence of commands via a single sed process as you do by performing the same commands in the same order via separate sed processes. The main exceptions I can think of involve commands that are necessarily dependent on one another, such as labels and branches; commands manipulating the hold space and those around them; commands grouped within braces ({}); and the p command under sed -n.
With that said, sed programs get very cryptic very fast. If you're writing a complicated transformation then consider carefully taking #EdMorton's advice and writing the whole thing as a (single) awk program instead of one or several sed programs.
Better to avoid multiple sed with
sed -f mycmd.awk
Where mycmd.awk will contain each sed command be listed on a separate line.
As per man sed:
-f command_file
Append the editing commands found in the file command_file to the list of commands. The editing commands should each be listed on a separate line.

Do I need to generate a second file to sort a file?

I want to sort a bunch of files. I can do
sort file.txt > foo.txt
mv foo.txt file.txt
but do I need this second file?
(I tried sort file.txt > file.txt of course, but then I just ended up with an empty file.)
Try:
sort -o file.txt file.txt
See http://ss64.com/bash/sort.html
`-o OUTPUT-FILE'
Write output to OUTPUT-FILE instead of standard output. If
OUTPUT-FILE is one of the input files, `sort' copies it to a
temporary file before sorting and writing the output to
OUTPUT-FILE.
The philosophy of classic Unix tools like sort includes that you can build a pipe with them. Every little tool reads from STDIN and writes to STDOUT. This way the next little tool down the pipe can read the output of the first as input and act on it.
So I'd say that this is a bug and not a feature.
Please also read about Pipes, Redirection, and Filters in the very nice book by ESR.
Because you're writing back to the same file you'll always end up with a problem of the redirect opening the output file before sort gets done loading the original. So yes, you need to use a separate file.
Now, having said that, there are ways to buffer the whole file into the pipe stream first but generally you wouldn't want to do that, although it is possible if you write something to do it. But you'd be inserting special tools at the beginning and the end to do the buffering. Bash, however, will open the output file too soon if you use it's > redirect.
Yes, you do need a second file! The command
sort file.txt > file.txt
would have bash to set up the redirection of stout before it starts executing sort. This is a certain way to clobber your input file.
If you want to sort many files try :
cat *.txt | sort > result.txt
if you are dealing with sorting fixed length records from a single file, then the sort algorithm can swap records within the file. There are a few available algorithms availabe. Your choice would depend on the amount of the file's randomness properties. Generally, quicksort tends to swap the fewest number of records and is usually the sort that completes first, when compared to othersorting algorithms.

How does find and printf work when using pipes in bash scripting

Suppose I use the printf in the find command like this:
find ./folder -printf "%f\n" | other command which uses the result of printf
in the other command part, I may be having a sort or something similar
what exactly does printf do in this case? where does it print the file names before the process in the part after "|" happens?
if I sort the filenames for example, it will first sort them, and then print them sorted on the monitor, but before that, how exactly does the part after | get the files unsorted in order to sort them? does the printf in this case give the filenames as input to the part after | and then the part after | prints the file names sorted in the output?
sorry for my english :(
Your shell calls pipe() which creates two file descriptors. Writing into one buffers data in the kernel which is available to be read by the other. Then it calls fork() to make a new process for the find command. After the fork() it closes stdout (always fd 1) and uses dup2() to copy one end of the pipe to stdout. Then it uses exec() to run find (replacing the copy of the shell in the subprocess with find). When find runs it just prints to stdout as normal, but it has inherited it from the shell which made it the pipe. Meanwhile the shell is doing the same thing for other command... with stdin so that it is created with fd 0 connected to the other end of the pipe.
Yes, that is how pipes work. The output from the first process is the input to the second. In terms of implementation, the shell creates a socket which receives input from the first process from its standard output, and writes output to the second process on its standard input.
... You should perhaps read an introduction to Unix shell programming if you have this type of questions.

why does redirect (<) not create a subshell

I wrote the following code
var=0
cat $file | while read line do
var=$line
done
echo $var
Now as I understand it the pipe (|) will cause a sub shell to be created an therefore the variable var on line 1 will have the same value on the last line.
However this will solve it:
var=0
while read line do
var=$line
done < $file
echo $line
My question is why does the redirect not cause a subshell to be created, or if you like why does pipe cause one to be created?
Thanks
The cat command is a command which means it needs its own process and has its own STDIN and STDOUT. You're basically taking the STDOUT produced by the cat command and redirecting it into the process of the while loop.
When you use redirection, you're not using a separate process. Instead, you're merely redirecting the STDIN of the while loop from the console to the lines of the file.
Needless to say, the second way is more efficient. In the old Usenet days before all of you little whippersnappers got ahold of our Internet (_Hey you kids! Get off of my Internet!) and destroyed it with your fancy graphics and all them web page, some people use to give out the Useless Use of Cat award for people who contributed to the comp.unix.shell group and had a spurious cat command because the use of cat is almost never necessary and is usually more inefficient.
If you're using a cat in your code, you probably don't need it. The cat command comes from concatenate and is suppose to be used only to concatenate files together. For example, when we use to use SneakerNet on 800K floppies, we would have to split up long files with the Unix split command and then use cat to merge them back together.
A pipe is there to hook the stdout of one program to the stdin or another one. Two processes, possibly two shells. When you do redirection (> and <), all you're doing remapping stdin (or stdout) to a file. reading/writing a file can be done without another process or shell.

Diff output from two programs without temporary files

Say I have too programs a and b that I can run with ./a and ./b.
Is it possible to diff their outputs without first writing to temporary files?
Use <(command) to pass one command's output to another program as if it were a file name. Bash pipes the program's output to a pipe and passes a file name like /dev/fd/63 to the outer command.
diff <(./a) <(./b)
Similarly you can use >(command) if you want to pipe something into a command.
This is called "Process Substitution" in Bash's man page.
Adding to both the answers, if you want to see a side by side comparison, use vimdiff:
vimdiff <(./a) <(./b)
Something like this:
One option would be to use named pipes (FIFOs):
mkfifo a_fifo b_fifo
./a > a_fifo &
./b > b_fifo &
diff a_fifo b_fifo
... but John Kugelman's solution is much cleaner.
For anyone curious, this is how you perform process substitution in using the Fish shell:
Bash:
diff <(./a) <(./b)
Fish:
diff (./a | psub) (./b | psub)
Unfortunately the implementation in fish is currently deficient; fish will either hang or use a temporary file on disk. You also cannot use psub for output from your command.
Adding a little more to the already good answers (helped me!):
The command docker outputs its help to STD_ERR (i.e. file descriptor 2)
I wanted to see if docker attach and docker attach --help gave the same output
$ docker attach
$ docker attach --help
Having just typed those two commands, I did the following:
$ diff <(!-2 2>&1) <(!! 2>&1)
!! is the same as !-1 which means run the command 1 before this one - the last command
!-2 means run the command two before this one
2>&1 means send file_descriptor 2 output (STD_ERR) to the same place as file_descriptor 1 output (STD_OUT)
Hope this has been of some use.
For zsh, using =(command) automatically creates a temporary file and replaces =(command) with the path of the file itself. With normal Process Substitution, $(command) is replaced with the output of the command.
This zsh feature is very useful and can be used like so to compare the output of two commands using a diff tool, for example Beyond Compare:
bcomp =(ulimit -Sa | sort) =(ulimit -Ha | sort)
For Beyond Compare, note that you must use bcomp for the above (instead of bcompare) since bcomp launches the comparison and waits for it to complete. If you use bcompare, that launches comparison and immediately exits due to which the temporary files created to store the output of the commands disappear.
Read more here: http://zsh.sourceforge.net/Intro/intro_7.html
Also notice this:
Note that the shell creates a temporary file, and deletes it when the command is finished.
and the following which is the difference between $(...) and =(...) :
If you read zsh's man page, you may notice that <(...) is another form of process substitution which is similar to =(...). There is an important difference between the two. In the <(...) case, the shell creates a named pipe (FIFO) instead of a file. This is better, since it does not fill up the file system; but it does not work in all cases. In fact, if we had replaced =(...) with <(...) in the examples above, all of them would have stopped working except for fgrep -f <(...). You can not edit a pipe, or open it as a mail folder; fgrep, however, has no problem with reading a list of words from a pipe. You may wonder why diff <(foo) bar doesn't work, since foo | diff - bar works; this is because diff creates a temporary file if it notices that one of its arguments is -, and then copies its standard input to the temporary file.

Resources