Process substitution into grep missing expected outputs

Process substitution into grep missing expected outputs - bash

Let’s say I have a program which outputs:
abcd
l33t
1234
which I will simulate with printf 'abcd\nl33t\n1234\n'. I would like to give this output to two programs at the same time. My idea would be to use process substitution with tee. Let’s say I want to give a copy of the output to grep:
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]' >&2) | grep '[0-9]'
I get the following with Bash 4.1.2 (Linux, CentOS 6.5), which is fine:
l33t
1234
abcd
l33t
But if the process substitution is not redirected to stderr (i.e. without >&2), like this:
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]') | grep '[0-9]'
Then I get:
l33t
1234
l33t
It’s like the stdout from process substitution (the first grep) is used by the process after the pipe (the second grep). Except the second grep is already reading things by itself, so I guess it's not supposed to take into account things from the first grep. Unless I’m mistaken (which I surely am).
What am I missing?

Explanation
As far as the command line is concerned, process substitution is just a way of making a special filename. (See also the docs.) So the second pipeline actually looks like:
printf 'abcd\nl33t\n1234\n' | tee /dev/fd/nn | grep '[0-9]'
where nn is some file-descriptor number. The full output of printf goes to /dev/fd/nn, and also goes to the grep '[0-9]'. Therefore, only the numerical values are printed.
As for the process inside the >(), it inherits the stdout of its parent. In this case, that stdout is inside the pipe. Therefore, the output of grep '[a-z]' goes through the pipeline just like the standard output of tee does. As a result, the pipeline as a whole only passes lines that include numbers.
When you write to stderr instead (>&2), you are bypassing the last pipeline stage. Therefore, the output of grep '[a-z]' on stderr goes to the terminal.
A fix
To fix this without using stderr, you can use another alias for your screen. E.g.:
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]' >/dev/tty ) | grep '[0-9]'
# ^^^^^^^^^
which gives me the output
l33t
1234
abcd
l33t
Testing this
To sort this out, I ran echo >(ps). The ps process was a child of the bash process running the pipeline.
I also ran
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]')
without the | grep '[0-9]' at the end. On my system, I see
abcd <--- the output of the tee
l33t ditto
1234 ditto
abcd <-- the output of the grep '[a-z]'
l33t ditto
All five lines go into the grep '[0-9]'.

After the tee you have two streams of
abcd
l33t
1234
The 1st grep (>(grep '[a-z]' >&2) filters out the
abcd
l33t
and prints the result to its(!!!) stderr - which is still connected to your terminal...
So, another simple demo:
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]' >&2) | grep '[0-9]'
this prints
l33t
1234
abcd
l33t
now add the wc -l
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]' >&2) | grep '[0-9]' | wc -l
and you will get
abcd
l33t
2
where you clearly can see: the
abcd
l33t
is the stderr of the 1st grep but the 2nd grep's stdout is redirected to the wc and prints the
2
Now another test:
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]' ) | cat -
output
abcd
l33t
1234
abcd
l33t
e.g. the two lines from the grep and the full input from the print
count:
printf 'abcd\nl33t\n1234\n' | tee >(grep '[a-z]' ) | cat - | wc -l
output
5

Related

Why doesn't this sed command put a newline

I have a file, ciao.py thas has only one line in it: print("ciao")
I want to do this: I want to do that via pipe stream, and als, if I do cat ciao.py | sed 's/.*/&\n&/' it would work, but I want to do this in two separated parts, simulating the case where I want to print it and then pass that to further commands.
If I do this:
cat ciao.py | sed 's/.*/&\n/' |tee >(xargs echo) | xargs echo
it does not work. It prints print("ciao") print("ciao") in the same line. I don't understand why, since I am putting \n with sed.

I'd guess print cia is appearing twice on the same line because xargs is calling echo with multiple strings since xargs calls the command you provide it with groups of input lines at a time by default.
Is this what you're trying to do?
$ cat ciao.py | sed 's/.*/&\n/' |tee >(xargs -n 1 echo) | xargs -n 1 echo
print(ciao)
print(ciao)
or:
$ cat ciao.py | sed 's/.*/&\n/' |tee >(cat) | xargs -n 1 echo
print(ciao)
print(ciao)
There are, of course, better ways to get that output from that input, e.g.:
$ sed 'p' ciao.py
print("ciao")
print("ciao")

The flow of stdout from combined commands

I need to edit a bash script that sorts .vcf files. vcf files are roughly structured as shown below:
## header line
## header line
…
Data line
Data line
…
The script is called vcfsort and is part of a library for manipulating vcf files. It looks like this:
head -1000 $1 | grep "^#"; cat $# | grep -v "^#" | sort -k1,1d -k2,2n
And it is run by writing vcfsort input.vcf > output.vcf.
I understand roughly what it does: since sorting should only be done on the data lines, it gets the header lines:
head -1000 $1 | grep "^#";
And combines it with sorted data lines:
cat $# | grep -v "^#" | sort -k1,1d -k2,2n
I need the head command to read more lines. Instead of calling vcfsort like above, I thought I could just edit the script myself and write it out directly as a command like this:
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
This does not work as expected. My attempt above writes the correct output to stdout, if I leave out > output.vcf. However, if I include it, only the data lines are written to file and the header lines are written to stdout. So, I have a couple of questions:
In this stack overflow answer, it is said that to combine
semicolon-separated commands, they should be enclosed in parentheses. Why is that not the case in the vcfsort script?
Why is $# used in the cat command instead of $1? $# should refer to all of a shell scripts arguments, but since only one is given (the input file), why not just use $1? If there is a reason for this, how can I transfer that to my command line expression?
Why do I only get part of the stdout when I send it to a file?
Could you show me the edits I need to make to get my command to work as intended?

So the script gets first 1000 lines of first file!
Separates header, and basically just copy all comments in those first 1000 lines to output.
Next, it filters all comments lines (leaving only data lines) for all files, and does sorting.
so if you use
vcfsort file1 file2 file3
$1 = "file1" and header from file1 only will be presented in output.
while $# referring to all files: "file1 file2 file3"
if you need to get headers from all files and merge it - I would recommend to use loop.
for file in $#; do
head -1000 $file | grep "^#";
done
cat $# | grep -v "^#" | sort -k1,1d -k2,2n
Why do I only get part of the stdout when I send it to a file?
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
Each command executing separatelly (divided by semicolon ";"). So in example above you just redirecting data lines output after sorting. It doesn't redirect to file header part.
I would recommend to delete redirecting to file and just use:
vcfsort input.vcf > output.vcf
This does not work as expected
May I know what was expected?

There are two command lists, separated by a ;, inside vcfsort:
head -1000 $1 | grep "^#"
cat $# | grep -v "^#" | sort -k1,1d -k2,2n
Each list is a single pipeline. The final two commands in each pipeline inherit their standard output from vcfsort, so that when you run
vcfsort input.vcf > output.vcf
both grep and sort write to output.vcf.
The equivalent using braces would be (replacing ; with a newline for readability)
# Quoting the parameter expansions is important, to protect
# against word-splitting and pathname expansion of the original arguments.
{ head -1000 "$1" | grep "^#"
cat "$#" | grep -v "^#" | sort -k1,1d -k2,2n
} > output.vcf
Output redirections apply only to a single command, not a command list. Here, a command group serves as that single command:
the standard output of the command group is output.vcf, and the two lists in the group inherit that just as before.
Your attempt
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
only opened output.vcf to use as the standard output for sort; the standard output of grep remains whatever standard output it inherits from its parent, namely your terminal.

How can I send the out of the last pipe to two different commands?

So, I have text file with a a bunch of numbers, one number per line to be specific, so I do:-
cat filename.txt|sort -n|head -1 to get the top number and I can do cat filename.txt|sort -n|tail -1 to get the bottom number.
Just to be sure is there a way to send cat filename.txt|sort -n| and its output to two different commands in one line and have the out put (the highest number and the lowest number next to each other)

You can do interesting things with tee and process substitutions, but the order of the output may not be stable (due to timing of processes)
sort -n filename.txt | tee >(tail -1 >/dev/tty) | head -1
In this case, I'd use sed to print the first and last line:
sort -n filename.txt | sed -n '1p; $p'
As #chepner suggests
... | sed -n '1p; $p' | paste - - # tab separated
or
... | awk 'NR == 1 {first = $0} END {print first, $0}' # space separated

There is a useful command tee syntax tee second.txt will output to second.txt
You can combine that with bash executive pipes eg tee >(wc),
So you can do 2 or more commands by eg tee >(wc) | head

Need help writing this specific bash script

Construct the pipe to execute the following job.
"Output of ls should be displayed on the screen and from this output the lines
containing the word ‘poem’ should be counted and the count should be
stored in a file.”

If bash is allowed, use a process substitution as the receiver for tee
ls | tee >( grep -c poem > number.of.poetry.files)

Your attempt was close:
ls | tee /dev/tty | grep poem | wc -l >number_of_poems
The tee /dev/tty copies all ls output to the terminal. This satisfies the requirement that "Output of ls should be displayed on the screen." while also sending ls's output to grep's stdin.
This can be further simplified:
ls | tee /dev/tty | grep -c poem >number_of_poems
Note that neither of these solutions require bash. Both will work with lesser shells and, in particular, with dash which is the default /bin/sh under debian-like systems.

This sounds like a homework assignment :)
#!/bin/bash
ls
ls -l | grep -c poem >> file.txt
The first ls will display the output on the screen
The next line uses a series of pipes to output the number of files/directories containing "poem"
If there were 5 files with poem in them, file.txt would read 5. If file.txt already exists, the new count will be appended to the end. If you want overwrite file each time, change the line to read ls -l | grep -c poem > file.txt

Getting head to display all but the last line of a file: command substitution and standard I/O redirection

I have been trying to get the head utility to display all but the last line of standard input. The actual code that I needed is something along the lines of cat myfile.txt | head -n $(($(wc -l)-1)). But that didn't work. I'm doing this on Darwin/OS X which doesn't have the nice semantics of head -n -1 that would have gotten me similar output.
None of these variations work either.
cat myfile.txt | head -n $(wc -l | sed -E -e 's/\s//g')
echo "hello" | head -n $(wc -l | sed -E -e 's/\s//g')
I tested out more variations and in particular found this to work:
cat <<EOF | echo $(($(wc -l)-1))
>Hola
>Raul
>Como Esta
>Bueno?
>EOF
3
Here's something simpler that also works.
echo "hello world" | echo $(($(wc -w)+10))
This one understandably gives me an illegal line count error. But it at least tells me that the head program is not consuming the standard input before passing stuff on to the subshell/command substitution, a remote possibility, but one that I wanted to rule out anyway.
echo "hello" | head -n $(cat && echo 1)
What explains the behavior of head and wc and their interaction through subshells here? Thanks for your help.

head -n -1 will give you all except the last line of its input.

head is the wrong tool. If you want to see all but the last line, use:
sed \$d
The reason that
# Sample of incorrect code:
echo "hello" | head -n $(wc -l | sed -E -e 's/\s//g')
fails is that wc consumes all of the input and there is nothing left for head to see. wc inherits its stdin from the subshell in which it is running, which is reading from the output of the echo. Once it consumes the input, it returns and then head tries to read the data...but it is all gone. If you want to read the input twice, the data will have to be saved somewhere.

Using sed:
sed '$d' filename
will delete the last line of the file.
$ seq 1 10 | sed '$d'
1
2
3
4
5
6
7
8
9

For Mac OS X specifically, I found an answer from a comment to this Q&A.
Assuming you are using Homebrew, run brew install coreutils then use the ghead command:
cat myfile.txt | ghead -n -1
Or, equivalently:
ghead -n -1 myfile.txt
Lastly, see brew info coreutils if you'd like to use the commands without the g prefix (e.g., head instead of ghead).

cat myfile.txt | echo $(($(wc -l)-1))
This works. It's overly complicated: you could just write echo $(($(wc -l)-1)) <myfile.txt or echo $(($(wc -l <myfile.txt)-1)). The problem is the way you're using it.
cat myfile.txt | head -n $(wc -l | sed -E -e 's/\s//g')
wc consumes all the input as it's counting the lines. So there is no data left to read in the pipe by the time head is started.
If your input comes from a file, you can redirect both wc and head from that file.
head -n $(($(wc -l <myfile.txt) - 1)) <myfile.txt
If your data may come from a pipe, you need to duplicate it. The usual tool to duplicate a stream is tee, but that isn't enough here, because the two outputs from tee are produced at the same rate, whereas here wc needs to fully consume its output before head can start. So instead, you'll need to use a single tool that can detect the last line, which is a more efficient approach anyway.
Conveniently, sed offers a way of matching the last line. Either printing all lines but the last, or suppressing the last output line, will work:
sed -n '$! p'
sed '$ d'

Here is a one-liner that can get you the desired output, and it can be used more generally for getting all lines from a file except the last n lines.
grep -n "" myfile.txt \ # output the line number for each line
| sort -nr \ # reverse the file by using those line numbers
| sed '1,4d' \ # delete first 4 lines (last 4 of the original file)
| sort -n \ # reverse the reversed file (correct the line order)
| sed 's/^[0-9]*://' # remove the added line numbers
Here is the above command in an actual single line and runnable (can't execute the above due to the added comments):
grep -n "" myfile.txt | sort -nr | sed '1,4d' | sort -n | sed 's/^[0-9]*://'
It's a little cumbersome, and this problem can be solved with more comprehensive commands like ghead, but when you can't or don't want to download such tools, it's nice to be able to do this with the more basic options. I've been in situations where it's simply not an option to get better tools.

awk 'NR>1{print p}{p=$0}'
For this job, an awk one-liner is a bit longer than a sed one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio