I want to pipe both script and data over stdin to a shell and its child process. Basically, I'm starting a shell and sending a string like "exec wc -c;\n<data>" to the shell over stdin.
I'm looking to start a sub-process and pass data to said sub-process.
Using exec I fully expect wc -c to replace my shell and count the number of bytes sent over stdin.
Examples:
echo -ne 'exec wc -c;\nabc' | dash
echo -ne 'exec wc -c;\nabc' | bash
echo -ne 'exec wc -c;\nabc' | bash --posix
echo -ne 'exec wc -c;\nabc' | busybox sh
It seems to work consistently with bash, but not with dash or busybox sh. They both seem to fail intermittently. If I sleep for 100ms before sending <data> over stdin, then it works. But sleeping is not a reliable fix.
In practice I'm sending a non-trivial amount of data, so I don't want to encode it or somehow store it in memory. Any ideas?
Note: I'm sure there are many use cases where one could work around this. But I'm looking to find out why this doesn't work consistently. And/or what shell magic I could do to make it work reliably, preferably across platforms :)
Update: to clarify I have a remote system where I run sh and expose stdin/stdout/stderr, I would like to prove that such a setup can do anything. By sending "exec cat - > /myfile;\n<data>" one could imagine I could stream in /myfile so that it contains <data>. Again imagine <data> being large.
Essentially, I'm looking to prove that a system can be controlled using just stdin for sh. Something else very simple instead of sh that is readily available across platforms as a static binary might also work.
I'll admit this might be totally crazy and that I should employ a protocol like SSH or something, but then I would likely have to implement that.
This appears to work for me with bash, but not at all with dash or busybox sh. I'm surprised it ever works at all. I would expect the shell to read a big chunk of input from its stdin, including both the command on the first line and the data after it, before processing anything. That would leave nothing remaining on the exec'ed command's stdin.
In practice, bash seems to read 1 byte at a time until it sees a newline, and then immediately execs, so everything is good.
dash and busybox sh do exactly what I expected, reading the whole input first.
Can you do
echo -ne "exec wc -c <<'END'\nabc" | sh
instead? Possibly replacing END with END_adjfioe38999f3jf_END if you're worried about it appearing in the input?
Update: though come to think of it, that fails your "don't buffer in memory" criterion. Err... ok, that's hard. If you can guarantee the command to run fits within 4k, then something horrible like this would "work":
echo -ne "exec wc -c\nabc" | perl -e 'sysread(STDIN, $buf, 4096); $buf =~ s/([^\n]*)\n//; open(my $pipe, "|-", $1); syswrite($pipe, $buf); open(STDOUT, ">&", $pipe); exec("cat")'
but that's hardly the portable shell-based solution you were looking for. (It reads a 4k chunk of input, grabs the command from it, forks and spawns off a shell running that command connected by a pipe, sends the rest of the initial chunk down the pipe, dups its own stdout to the pipe, and then execs cat to copy its stdin a piece at a time to the pipe, which will receive it on its stdin. I think that's what you want to end up happening, just with simpler shell-based syntax.
Related
I am trying to write a script to slice a 13 Gb file in smaller parts to launch a split computation on a cluster. What I wrote so far works on terminal if I copy and paste it, but stops at the first cycle of the for loop.
set -ueo pipefail
NODES=8
READS=0days_rep2.fasta
Ntot=$(cat $READS | grep 'read' | wc -l)
Ndiv=$(($Ntot/$NODES))
for i in $(seq 0 $NODES)
do
echo $i
start_read=$(cat $READS | grep 'read' | head -n $(($Ndiv*${i}+1)) | tail -n 1)
echo ${start_read}
end_read=$(cat $READS | grep 'read' | head -n $(($Ndiv*${i}+$Ndiv)) | tail -n 1)
echo ${end_read}
done
If I run the script:
(base) [andrea#andrea-xps data]$ bash cluster.sh
0
>baa12ba1-4dc2-4fae-a989-c5817d5e487a runid=314af0bb142c280148f1ff034cc5b458c7575ff1 sampleid=0days_rep2 read=280855 ch=289 start_time=2019-10-26T02:42:02Z
(base) [andrea#andrea-xps data]$
it seems to stop abruptly after the command "echo ${start_read}" without raising any sort of error. If I copy and paste the script in terminal it runs without problems.
I am using Manjaro linux.
Andrea
The problem:
The problem here (as #Jens suggested in a comment) has to do with the use of the -e and pipefail options; -e makes the shell exit immediately if any simple command gets an error, and pipefail makes a pipeline fail if any command in it fails.
But what's failing? Take a look at the command here:
start_read=$(cat $READS | grep 'read' | head -n $(($Ndiv*${i}+1)) | tail -n 1)
Which, clearly, runs the cat, grep, head, and tail commands in a pipeline (which runs in a subshell so the output can be captured and put in the start_read variable). So cat starts up, and starts reading from the file and shoving it down the pipe to grep. grep reads that, picks out the lines containing 'read', and feeds them on toward head. head reads the first line of that (note that on the first pass, Ndiv is 0, so it's running head -n 1) from its input, feeds that on toward the tail command, and then exits. tail passes on the one line it got, then exits as well.
The problem is that when head exited, it hadn't read everything grep had to give it; that left grep trying to shove data into a pipe with nothing on the other end, so the system sent it a SIGPIPE signal to tell it that wasn't going to work, and that caused grep to exit with an error status. And then since it exited, cat was similarly trying to stuff an orphaned pipe, so it got a SIGPIPE as well and also exited with an error status.
Since both cat and grep exited with errors, and pipefail is set, that subshell will also exit with an error status, and that means the parent shell considers the whole assignment command to have failed, and abort the script on the spot.
Solutions:
So, one possible solution is to remove the -e option from the set command. -e is kind of janky in what it considers an exit-worthy error and what it doesn't, so I don't generally like it anyway (see BashFAQ #105 for details).
Another problem with -e is that (as we've seen here) it doesn't give much of any indication of what went wrong, or even that something went wrong! Error checking is important, but so's error reporting.
(Note: the danger in removing -e is that your script might get a serious error partway through... and then blindly keep running, in a situation that doesn't make sense, possibly damaging things in the process. So you should think about what might go wrong as the script runs, and add manual error checking as needed. I'll add some examples to my script suggestion below.)
Anyway, just removing -e is just papering over the fact that this isn't a really good approach to the problem. You're reading (or trying to read) over the entire file multiple times, and processing it through multiple commands each time. You really should only be reading through the thing twice: once to figure out how many reads there are, and once to break it into chunks. You might be able to write a program to do the splitting in awk, but most unix-like systems already have a program specifically for this task: split. There's also no need for cat everywhere, since the other commands are perfectly capable of reading directly from files (again, #Jens pointed this out in a comment).
So I think something like this would work:
#!/bin/bash
set -uo pipefail # I removed the -e 'cause I don't trust it
nodes=8 # Note: lower- or mixed-case variables are safer to avoid conflicts
reads=0days_rep2.fasta
splitprefix=0days_split_
Ntot=$(grep -c 'read' "$reads") || { # grep can both read & count in a single step
# The || means this'll run if there was an error in that command.
# A normal thing to do is print an error message to stderr
# (with >&2), then exit the script with a nonzero (error) status
echo "$0: Error counting reads in $reads" >&2
exit 1
}
Ndiv=$((($Ntot+$nodes-1)/$nodes)) # Force it to round *up*, not down
grep 'read' "$reads" | split -l $Ndiv -a1 - "$splitprefix" || {
echo "$0: Error splitting fasta file" >&2
exit 1
}
This'll create files named "0days_split_a" through "0days_split_h". If you have the GNU version of split, you could add its -d option (use numeric suffixes instead of letters) and/or --additional-suffix=.fasta (to add the .fasta extension to the split files).
Another note: if only a little bit of that big file is read lines, it might be faster to run grep 'read' "$reads" >sometempfile first, and then run the rest of the script on the temp file, so you don't have to read & thin it twice. But if most of the file is read lines, this won't help much.
Alright, we have found the troublemaker: set -e in combination with set -o pipefail.
Gordon Davisson's answer provides all the details. I provide this answer for the sole purpose of reaping an upvote for my debugging efforts in the comments to your answer :-)
I have a few commands i run between brackets which i then redirect to a named pipe and tail the pipe however it looks like the redirection happens only after the block has finished executing as i don't see any output from the tail command for a while and it only shows the last command ouput when i do. Any ideas how view the output of the block in realtime?
Example Script
#!/usr/bin/env bash
mkfifo /tmp/why_you_no_out;
trap "rm /tmp/why_you_no_out" 0;
{
for ((i=1;i<=100;i++)); do
printf "$i";
done
sleep 10s;
printf "\n12356";
} >> /tmp/why_you_no_out &
printf "here";
tail -n 1 -f /tmp/why_you_no_out
Sounds like the issue is buffering. Most shells don't want to write data a byte at a time because it's wasteful. Instead, they wait until they have a sizable chunk of data before committing it unless the output is connected to your terminal.
If you're looking to unbuffer the output of an arbitrary command, you may find the "unbuffer" utility helpful or any of the solutions mentioned in this question: How to make output of any shell command unbuffered?
If you're dealing with specific applications, they may have options to reduce buffering. For example, GNU's grep includes the --line-buffered option.
Let me present my findings first and put my questions at the end. (1) applies to zsh only and (2), (3) apply to both zsh and bash.
1. stdin of command substitution
ls | echo $(cat)
ls | { echo $(cat) }
The first one prints cat: -: Input/output error; while the second one produces the output of ls.
2. chained commands after pipe
ls | { head -n1; cat}
ls | { read a; cat}
The first command doest work properly. cat encounters EOF and directly exits. But the second form works: the first line is read into a and cat gets the rest of them.
3. mixed stdin
ls | { python -c 'import sys; print(sys.argv)' $(head -n1) }
ls | { python -c 'import sys; print(sys.argv); print(input())' $(head -n1) }
Inside the {} in the first line, the command is to print the cmdline arguments; in the second form, the command also reads a line from stdin.
The first command can run successfully while the second form throws due to that input() reads the EOF.
My questions are:
(as in section 1) What is the difference between the form with {} and without ?
(as in section 2) Is it possible for the head and cat to read the same stdin sequentially? How can the second form succeed while the first form fails?
(as in section 3) How is the stdin of the command in a command substitution connected to the stdin of the original command (echo here). Who reads first? And how to make the stdin kept open so that both commands (python and head) can read the same stdin sequentially?
You are not taking input buffering into account and it explains most of your observations.
head reads several kilobytes of input each time it needs data, which makes it much more efficient. So it is likely that it will read all of stdin before any other process has a chance to. That's obvious in case 2, where the execution order is perhaps clearer.
If input were coming from a regular file, head could seek back to the end of the lines it used before terminating. But since a pipe is not seekable, it cannot do that. If you use "here-strings" -- the <<< syntax, then stdin will turn out to be seekable because here-strings are implemented using a temporary file. I don't know if you can rely on that fact, though.
read does not buffer input, at least not beyond the current line (and even then, only if it has no other line end delimiter specified on the command line). It carefully only reads what it needs precisely because it is generally used in a context where its input comes from a pipe and seeking wouldn't be possible. That's extremely useful -- so much so that the fact that it works is almost invisible -- but it's also one of the reasons shell scripting can be painfully slow.
You can see this more clearly by sending enough data into the pipe to satisfy head's initial read. Try this, for example:
seq 1 10000 | { head -n1; head -n2; }
(I changed the second head to head -n2 because the first head happens to leave stdin positioned exactly at the end of a line, so that the second head sees a blank line as the first line.)
The other thing you need to understand is what command substitution does, and when it does it. Command substitution reads the entire output of a command and inserts it into the command line. That happens even before the command has been identified, never mind started execution.
Consider the following little snippet:
$(printf %cc%co e h) hello, world
It should be clear from that that the command substitution is fully performed before the echo utility (or builtin) is started.
Your first scenario triggers an oddity of zsh which is explained by Stéphane Chazelas in this answer on Unix.SE. Effectively, zsh does the command substitution before the pipeline is set up, so cat is reading from the main zsh's standard input. (Stéphane explains why this is and how it leads to an EIO error. Although I think it is dependent on the precise zsh configuration and option settings, since on my default zsh install, it just locks up my terminal. At some point I'll have to figure out why.) If you use braces, then the redirection is set up before the command substitution is performed.
In the use-case of having the output of a singular command being consumed by only one other, is it better to use | (pipelines) or <() (process substitution)?
Better is, of course, subjective. For my specific use case I am after performance as the primary driver, but also interested in robustness.
The while read do done < <(cmd) benefits I already know about and have switched over to.
I have several var=$(cmd1|cmd2) instances that I suspect might be better replaced as var=$(cmd2 < <(cmd1)).
I would like to know what specific benefits the latter case brings over the former.
tl;dr: Use pipes, unless you have a convincing reason not to.
Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe.
There are three practical differences:
1. Bash defaults to creating a fork for every stage in a pipeline.
Which is why you started looking into this in the first place:
#!/bin/bash
cat "$1" | while IFS= read -r last; do true; done
echo "Last line of $1 is $last"
This script won't work by default with a pipelines, because unlike ksh and zsh, bash will fork a subshell for each stage.
If you set shopt -s lastpipe in bash 4.2+, bash mimics the ksh and zsh behavior and works just fine.
2. Bash does not wait for process substitutions to finish.
POSIX only requires a shell to wait for the last process in a pipeline, but most shells including bash will wait for all of them.
This makes a notable difference when you have a slow producer, like in a /dev/random password generator:
tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10 # Slow?
head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random) # Fast?
The first example will not benchmark favorably. Once head is satisfied and exits, tr will wait around for its next write() call to discover that the pipe is broken.
Since bash waits for both head and tr to finish, it will appear seem slower.
In the procsub version, bash only waits for head, and lets tr finish in the background.
3. Bash does not currently optimize away forks for single simple commands in process substitutions.
If you invoke an external command like sleep 1, then the Unix process model requires that bash forks and executes the command.
Since forks are expensive, bash optimizes the cases that it can. For example, the command:
bash -c 'sleep 1'
Would naively incur two forks: one to run bash, and one to run sleep. However, bash can optimize it because there's no need for bash to stay around after sleep finishes, so it can instead just replace itself with sleep (execve with no fork). This is very similar to tail call optimization.
( sleep 1 ) is similarly optimized, but <( sleep 1 ) is not. The source code does not offer a particular reason why, so it may just not have come up.
$ strace -f bash -c '/bin/true | /bin/true' 2>&1 | grep -c clone
2
$ strace -f bash -c '/bin/true < <(/bin/true)' 2>&1 | grep -c clone
3
Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default.
And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.
I'd like to redirect the stdout of process proc1 to two processes proc2 and proc3:
proc2 -> stdout
/
proc1
\
proc3 -> stdout
I tried
proc1 | (proc2 & proc3)
but it doesn't seem to work, i.e.
echo 123 | (tr 1 a & tr 1 b)
writes
b23
to stdout instead of
a23
b23
Editor's note:
- >(…) is a process substitution that is a nonstandard shell feature of some POSIX-compatible shells: bash, ksh, zsh.
- This answer accidentally sends the output process substitution's output through the pipeline too: echo 123 | tee >(tr 1 a) | tr 1 b.
- Output from the process substitutions will be unpredictably interleaved, and, except in zsh, the pipeline may terminate before the commands inside >(…) do.
In unix (or on a mac), use the tee command:
$ echo 123 | tee >(tr 1 a) >(tr 1 b) >/dev/null
b23
a23
Usually you would use tee to redirect output to multiple files, but using >(...) you can
redirect to another process. So, in general,
$ proc1 | tee >(proc2) ... >(procN-1) >(procN) >/dev/null
will do what you want.
Under windows, I don't think the built-in shell has an equivalent. Microsoft's Windows PowerShell has a tee command though.
Like dF said, bash allows to use the >(…) construct running a command in place of a filename. (There is also the <(…) construct to substitute the output of another command in place of a filename, but that is irrelevant now, I mention it just for completeness).
If you don't have bash, or running on a system with an older version of bash, you can do manually what bash does, by making use of FIFO files.
The generic way to achieve what you want, is:
decide how many processes should receive the output of your command, and create as many FIFOs, preferably on a global temporary folder:
subprocesses="a b c d"
mypid=$$
for i in $subprocesses # this way we are compatible with all sh-derived shells
do
mkfifo /tmp/pipe.$mypid.$i
done
start all your subprocesses waiting input from the FIFOs:
for i in $subprocesses
do
tr 1 $i </tmp/pipe.$mypid.$i & # background!
done
execute your command teeing to the FIFOs:
proc1 | tee $(for i in $subprocesses; do echo /tmp/pipe.$mypid.$i; done)
finally, remove the FIFOs:
for i in $subprocesses; do rm /tmp/pipe.$mypid.$i; done
NOTE: for compatibility reasons, I would do the $(…) with backquotes, but I couldn't do it writing this answer (the backquote is used in SO). Normally, the $(…) is old enough to work even in old versions of ksh, but if it doesn't, enclose the … part in backquotes.
Unix (bash, ksh, zsh)
dF.'s answer contains the seed of an answer based on tee and output process substitutions
(>(...)) that may or may not work, depending on your requirements:
Note that process substitutions are a nonstandard feature that (mostly)
POSIX-features-only shells such as dash (which acts as /bin/sh on Ubuntu,
for instance), do not support. Shell scripts targeting /bin/sh should not rely on them.
echo 123 | tee >(tr 1 a) >(tr 1 b) >/dev/null
The pitfalls of this approach are:
unpredictable, asynchronous output behavior: the output streams from the commands inside the output process substitutions >(...) interleave in unpredictable ways.
In bash and ksh (as opposed to zsh - but see exception below):
output may arrive after the command has finished.
subsequent commands may start executing before the commands in the process substitutions have finished - bash and ksh do not wait for the output process substitution-spawned processes to finish, at least by default.
jmb puts it well in a comment on dF.'s answer:
be aware that the commands started inside >(...) are dissociated from the original shell, and you can't easily determine when they finish; the tee will finish after writing everything, but the substituted processes will still be consuming the data from various buffers in the kernel and file I/O, plus whatever time is taken by their internal handling of data. You can encounter race conditions if your outer shell then goes on to rely on anything produced by the sub-processes.
zsh is the only shell that does by default wait for the processes run in the output process substitutions to finish, except if it is stderr that is redirected to one (2> >(...)).
ksh (at least as of version 93u+) allows use of argument-less wait to wait for the output process substitution-spawned processes to finish.
Note that in an interactive session that could result in waiting for any pending background jobs too, however.
bash v4.4+ can wait for the most recently launched output process substitution with wait $!, but argument-less wait does not work, making this unsuitable for a command with multiple output process substitutions.
However, bash and ksh can be forced to wait by piping the command to | cat, but note that this makes the command run in a subshell. Caveats:
ksh (as of ksh 93u+) doesn't support sending stderr to an output process substitution (2> >(...)); such an attempt is silently ignored.
While zsh is (commendably) synchronous by default with the (far more common) stdout output process substitutions, even the | cat technique cannot make them synchronous with stderr output process substitutions (2> >(...)).
However, even if you ensure synchronous execution, the problem of unpredictably interleaved output remains.
The following command, when run in bash or ksh, illustrates the problematic behaviors (you may have to run it several times to see both symptoms): The AFTER will typically print before output from the output substitutions, and the output from the latter can be interleaved unpredictably.
printf 'line %s\n' {1..30} | tee >(cat -n) >(cat -n) >/dev/null; echo AFTER
In short:
Guaranteeing a particular per-command output sequence:
Neither bash nor ksh nor zsh support that.
Synchronous execution:
Doable, except with stderr-sourced output process substitutions:
In zsh, they're invariably asynchronous.
In ksh, they don't work at all.
If you can live with these limitations, using output process substitutions is a viable option (e.g., if all of them write to separate output files).
Note that tzot's much more cumbersome, but potentially POSIX-compliant solution also exhibits unpredictable output behavior; however, by using wait you can ensure that subsequent commands do not start executing until all background processes have finished.
See bottom for a more robust, synchronous, serialized-output implementation.
The only straightforward bash solution with predictable output behavior is the following, which, however, is prohibitively slow with large input sets, because shell loops are inherently slow.
Also note that this alternates the output lines from the target commands.
while IFS= read -r line; do
tr 1 a <<<"$line"
tr 1 b <<<"$line"
done < <(echo '123')
Unix (using GNU Parallel)
Installing GNU parallel enables a robust solution with serialized (per-command) output that additionally allows parallel execution:
$ echo '123' | parallel --pipe --tee {} ::: 'tr 1 a' 'tr 1 b'
a23
b23
parallel by default ensures that output from the different commands doesn't interleave (this behavior can be modified - see man parallel).
Note: Some Linux distros come with a different parallel utility, which won't work with the command above; use parallel --version to determine which one, if any, you have.
Windows
Jay Bazuzi's helpful answer shows how to do it in PowerShell. That said: his answer is the analog of the looping bash answer above, it will be prohibitively slow with large input sets and also alternates the output lines from the target commands.
bash-based, but otherwise portable Unix solution with synchronous execution and output serialization
The following is a simple, but reasonably robust implementation of the approach presented in tzot's answer that additionally provides:
synchronous execution
serialized (grouped) output
While not strictly POSIX compliant, because it is a bash script, it should be portable to any Unix platform that has bash.
Note: You can find a more full-fledged implementation released under the MIT license in this Gist.
If you save the code below as script fanout, make it executable and put int your PATH, the command from the question would work as follows:
$ echo 123 | fanout 'tr 1 a' 'tr 1 b'
# tr 1 a
a23
# tr 1 b
b23
fanout script source code:
#!/usr/bin/env bash
# The commands to pipe to, passed as a single string each.
aCmds=( "$#" )
# Create a temp. directory to hold all FIFOs and captured output.
tmpDir="${TMPDIR:-/tmp}/$kTHIS_NAME-$$-$(date +%s)-$RANDOM"
mkdir "$tmpDir" || exit
# Set up a trap that automatically removes the temp dir. when this script
# exits.
trap 'rm -rf "$tmpDir"' EXIT
# Determine the number padding for the sequential FIFO / output-capture names,
# so that *alphabetic* sorting, as done by *globbing* is equivalent to
# *numerical* sorting.
maxNdx=$(( $# - 1 ))
fmtString="%0${#maxNdx}d"
# Create the FIFO and output-capture filename arrays
aFifos=() aOutFiles=()
for (( i = 0; i <= maxNdx; ++i )); do
printf -v suffix "$fmtString" $i
aFifos[i]="$tmpDir/fifo-$suffix"
aOutFiles[i]="$tmpDir/out-$suffix"
done
# Create the FIFOs.
mkfifo "${aFifos[#]}" || exit
# Start all commands in the background, each reading from a dedicated FIFO.
for (( i = 0; i <= maxNdx; ++i )); do
fifo=${aFifos[i]}
outFile=${aOutFiles[i]}
cmd=${aCmds[i]}
printf '# %s\n' "$cmd" > "$outFile"
eval "$cmd" < "$fifo" >> "$outFile" &
done
# Now tee stdin to all FIFOs.
tee "${aFifos[#]}" >/dev/null || exit
# Wait for all background processes to finish.
wait
# Print all captured stdout output, grouped by target command, in sequences.
cat "${aOutFiles[#]}"
Since #dF: mentioned that PowerShell has tee, I thought I'd show a way to do this in PowerShell.
PS > "123" | % {
$_.Replace( "1", "a"),
$_.Replace( "2", "b" )
}
a23
1b3
Note that each object coming out of the first command is processed before the next object is created. This can allow scaling to very large inputs.
You can also save the output in a variable and use that for the other processes:
out=$(proc1); echo "$out" | proc2; echo "$out" | proc3
However, that works only if
proc1 terminates at some point :-)
proc1 doesn't produce too much output (don't know what the limits are there but it's probably your RAM)
But it is easy to remember and leaves you with more options on the output you get from the processes you spawned there, e. g.:
out=$(proc1); echo $(echo "$out" | proc2) / $(echo "$out" | proc3) | bc
I had difficulties doing something like that with the | tee >(proc2) >(proc3) >/dev/null approach.
another way to do would be,
eval `echo '&& echo 123 |'{'tr 1 a','tr 1 b'} | sed -n 's/^&&//gp'`
output:
a23
b23
no need to create a subshell here