I have a file a.txt and each line contains a parameter. Now I want to use mpiexec to call my program such as a.out to calculate with each parameter. So I use linux shell script to handle this. The code is sample
cat a.txt | while read line
do
mpiexec -v -hostfile hosts -np 16 ./a.out ${line}
done
Unexpectedly, the script end after processing only one line of file a.txt. So, it is because of the wrong use of pipe? How can I tackle with this problem?
#!/bin/bash
for LINE in `cat a.txt | xargs -r`; do
mpiexec -v -hostfile hosts -np 16 ./a.out $LINE
done
I had this issue too. Claudio's solution helped set me on the right path to understanding why the loop exits after the first iteration. First off, here is a solution which is pretty close to what you wrote:
cat a.txt | while read line; do
</dev/null mpiexec -np 16 ./a.out ${line}
done
Note that I am just using mpiexec on a local computer, (python's threading situation is bad enough to need this) so I can't test if this works with separate hosts. You can try adding that back in yourself.
The reason that your script didn't work is that mpiexec seems to gobble up whatever is attached to the standard input. I assume it does this so that in case a.out needs that input, it would gobble all the input and send it along with the command to run a.out that gets sent to the other servers. The result is that on the first iteration, read reads the first line from your file. Then mpiexec reads the rest of the lines, even though a.out probably doesn't use them in your case. Then on the second iteration, read tries to read more lines, but since mpiexec already read the rest, read is told that the end of file has been reached, so the loop exits.
Since we want to prevent mpiexec from reading the standard in, we redirect mpiexec's standard in to come from /dev/null. Since /dev/null always contains nothing, mpiexec will read nothing and leave the standard input alone.
Related
I am trying to write a script to slice a 13 Gb file in smaller parts to launch a split computation on a cluster. What I wrote so far works on terminal if I copy and paste it, but stops at the first cycle of the for loop.
set -ueo pipefail
NODES=8
READS=0days_rep2.fasta
Ntot=$(cat $READS | grep 'read' | wc -l)
Ndiv=$(($Ntot/$NODES))
for i in $(seq 0 $NODES)
do
echo $i
start_read=$(cat $READS | grep 'read' | head -n $(($Ndiv*${i}+1)) | tail -n 1)
echo ${start_read}
end_read=$(cat $READS | grep 'read' | head -n $(($Ndiv*${i}+$Ndiv)) | tail -n 1)
echo ${end_read}
done
If I run the script:
(base) [andrea#andrea-xps data]$ bash cluster.sh
0
>baa12ba1-4dc2-4fae-a989-c5817d5e487a runid=314af0bb142c280148f1ff034cc5b458c7575ff1 sampleid=0days_rep2 read=280855 ch=289 start_time=2019-10-26T02:42:02Z
(base) [andrea#andrea-xps data]$
it seems to stop abruptly after the command "echo ${start_read}" without raising any sort of error. If I copy and paste the script in terminal it runs without problems.
I am using Manjaro linux.
Andrea
The problem:
The problem here (as #Jens suggested in a comment) has to do with the use of the -e and pipefail options; -e makes the shell exit immediately if any simple command gets an error, and pipefail makes a pipeline fail if any command in it fails.
But what's failing? Take a look at the command here:
start_read=$(cat $READS | grep 'read' | head -n $(($Ndiv*${i}+1)) | tail -n 1)
Which, clearly, runs the cat, grep, head, and tail commands in a pipeline (which runs in a subshell so the output can be captured and put in the start_read variable). So cat starts up, and starts reading from the file and shoving it down the pipe to grep. grep reads that, picks out the lines containing 'read', and feeds them on toward head. head reads the first line of that (note that on the first pass, Ndiv is 0, so it's running head -n 1) from its input, feeds that on toward the tail command, and then exits. tail passes on the one line it got, then exits as well.
The problem is that when head exited, it hadn't read everything grep had to give it; that left grep trying to shove data into a pipe with nothing on the other end, so the system sent it a SIGPIPE signal to tell it that wasn't going to work, and that caused grep to exit with an error status. And then since it exited, cat was similarly trying to stuff an orphaned pipe, so it got a SIGPIPE as well and also exited with an error status.
Since both cat and grep exited with errors, and pipefail is set, that subshell will also exit with an error status, and that means the parent shell considers the whole assignment command to have failed, and abort the script on the spot.
Solutions:
So, one possible solution is to remove the -e option from the set command. -e is kind of janky in what it considers an exit-worthy error and what it doesn't, so I don't generally like it anyway (see BashFAQ #105 for details).
Another problem with -e is that (as we've seen here) it doesn't give much of any indication of what went wrong, or even that something went wrong! Error checking is important, but so's error reporting.
(Note: the danger in removing -e is that your script might get a serious error partway through... and then blindly keep running, in a situation that doesn't make sense, possibly damaging things in the process. So you should think about what might go wrong as the script runs, and add manual error checking as needed. I'll add some examples to my script suggestion below.)
Anyway, just removing -e is just papering over the fact that this isn't a really good approach to the problem. You're reading (or trying to read) over the entire file multiple times, and processing it through multiple commands each time. You really should only be reading through the thing twice: once to figure out how many reads there are, and once to break it into chunks. You might be able to write a program to do the splitting in awk, but most unix-like systems already have a program specifically for this task: split. There's also no need for cat everywhere, since the other commands are perfectly capable of reading directly from files (again, #Jens pointed this out in a comment).
So I think something like this would work:
#!/bin/bash
set -uo pipefail # I removed the -e 'cause I don't trust it
nodes=8 # Note: lower- or mixed-case variables are safer to avoid conflicts
reads=0days_rep2.fasta
splitprefix=0days_split_
Ntot=$(grep -c 'read' "$reads") || { # grep can both read & count in a single step
# The || means this'll run if there was an error in that command.
# A normal thing to do is print an error message to stderr
# (with >&2), then exit the script with a nonzero (error) status
echo "$0: Error counting reads in $reads" >&2
exit 1
}
Ndiv=$((($Ntot+$nodes-1)/$nodes)) # Force it to round *up*, not down
grep 'read' "$reads" | split -l $Ndiv -a1 - "$splitprefix" || {
echo "$0: Error splitting fasta file" >&2
exit 1
}
This'll create files named "0days_split_a" through "0days_split_h". If you have the GNU version of split, you could add its -d option (use numeric suffixes instead of letters) and/or --additional-suffix=.fasta (to add the .fasta extension to the split files).
Another note: if only a little bit of that big file is read lines, it might be faster to run grep 'read' "$reads" >sometempfile first, and then run the rest of the script on the temp file, so you don't have to read & thin it twice. But if most of the file is read lines, this won't help much.
Alright, we have found the troublemaker: set -e in combination with set -o pipefail.
Gordon Davisson's answer provides all the details. I provide this answer for the sole purpose of reaping an upvote for my debugging efforts in the comments to your answer :-)
Let me present my findings first and put my questions at the end. (1) applies to zsh only and (2), (3) apply to both zsh and bash.
1. stdin of command substitution
ls | echo $(cat)
ls | { echo $(cat) }
The first one prints cat: -: Input/output error; while the second one produces the output of ls.
2. chained commands after pipe
ls | { head -n1; cat}
ls | { read a; cat}
The first command doest work properly. cat encounters EOF and directly exits. But the second form works: the first line is read into a and cat gets the rest of them.
3. mixed stdin
ls | { python -c 'import sys; print(sys.argv)' $(head -n1) }
ls | { python -c 'import sys; print(sys.argv); print(input())' $(head -n1) }
Inside the {} in the first line, the command is to print the cmdline arguments; in the second form, the command also reads a line from stdin.
The first command can run successfully while the second form throws due to that input() reads the EOF.
My questions are:
(as in section 1) What is the difference between the form with {} and without ?
(as in section 2) Is it possible for the head and cat to read the same stdin sequentially? How can the second form succeed while the first form fails?
(as in section 3) How is the stdin of the command in a command substitution connected to the stdin of the original command (echo here). Who reads first? And how to make the stdin kept open so that both commands (python and head) can read the same stdin sequentially?
You are not taking input buffering into account and it explains most of your observations.
head reads several kilobytes of input each time it needs data, which makes it much more efficient. So it is likely that it will read all of stdin before any other process has a chance to. That's obvious in case 2, where the execution order is perhaps clearer.
If input were coming from a regular file, head could seek back to the end of the lines it used before terminating. But since a pipe is not seekable, it cannot do that. If you use "here-strings" -- the <<< syntax, then stdin will turn out to be seekable because here-strings are implemented using a temporary file. I don't know if you can rely on that fact, though.
read does not buffer input, at least not beyond the current line (and even then, only if it has no other line end delimiter specified on the command line). It carefully only reads what it needs precisely because it is generally used in a context where its input comes from a pipe and seeking wouldn't be possible. That's extremely useful -- so much so that the fact that it works is almost invisible -- but it's also one of the reasons shell scripting can be painfully slow.
You can see this more clearly by sending enough data into the pipe to satisfy head's initial read. Try this, for example:
seq 1 10000 | { head -n1; head -n2; }
(I changed the second head to head -n2 because the first head happens to leave stdin positioned exactly at the end of a line, so that the second head sees a blank line as the first line.)
The other thing you need to understand is what command substitution does, and when it does it. Command substitution reads the entire output of a command and inserts it into the command line. That happens even before the command has been identified, never mind started execution.
Consider the following little snippet:
$(printf %cc%co e h) hello, world
It should be clear from that that the command substitution is fully performed before the echo utility (or builtin) is started.
Your first scenario triggers an oddity of zsh which is explained by Stéphane Chazelas in this answer on Unix.SE. Effectively, zsh does the command substitution before the pipeline is set up, so cat is reading from the main zsh's standard input. (Stéphane explains why this is and how it leads to an EIO error. Although I think it is dependent on the precise zsh configuration and option settings, since on my default zsh install, it just locks up my terminal. At some point I'll have to figure out why.) If you use braces, then the redirection is set up before the command substitution is performed.
When I pipe two commands, it seems the first command must finish before the second command could parse the output.
For example,
$ ping -c 5 10.11.12.13 | while read line; do echo $line; done
I expect it would generate output every second, but not. Is it true or I miss something (e.g., buffering effect)?
The problem is: if the first command runs over long period of time and I want to parse the output in real time. How to do it using shell?
Thanks.
You can force the first command to become unbuffered by using a command like unbuffer (from the expect package) or stdbuf. This way, it will flush its output after each line, rather than after it outputs e.g. 4096 bytes.
I have been looking and couldnt find clear clues to verify what I am deducing from a script given to me.
so file.txt is an opened file (by the file descriptor 3) and is constantly adding a new line by a script that records timestamp. Does the following piece make into the while loop each time a new line is added to the file?
exec 3 < /path/file.txt
while read <&3
command
command..
done
So as long as I dont close the file descriptor, a new line added to my file.txt will always activate the while loop, right?
Please help me clear this up. Thanks
To read from file descriptor 3, use read -u 3 (see Bash builtins). Don't forget to specify the variable name into which the value should be read.
Once read detects EOF, it stays at EOF; it won't spot additions to the file after that. So, if the code adding lines to the file is slower than the code in this script, you will reach and end point and the loop will terminate. If you don't want that, consider using tail -f /path/file.txt, and maybe process substitution too:
while read -u 3 line
do
command1
command2
done 3< <(tail -f /path/file.txt)
Or, if you want to do the exec:
exec 3< <(tail -f /path/file.txt)
while read -u 3 line
do
command1
command2
done
Note that the tail -f loops will never finish until you interrupt the script in some way.
So as long as I dont close the file descriptor, a new line added to my
file.txt will always activate the while loop, right?
Answer: wrong.
Redirecting exec 3 < /path/file.txt gives you the ability to read from /path/file.txt using the file descriptor, but does nothing to allow any type of triggering from /path/file.txt to your code. Think about it this way. If there is a new line in /path/file.txt, you can read it, but the redirection provides no way of knowing whether or not a new line has been added to the file for your code to respond to. It's still up to your code to check.
The netcat manpage indicates that, in the absence of the -c and -e options, a shell can be served via nc using the following commands.
$ rm -f /tmp/f; mkfifo /tmp/f
$ cat /tmp/f | /bin/sh -i 2>&1 | nc -l 127.0.0.1 1234 > /tmp/f
Now, as I understand it, both reads and writes from fifos are blocking operations. For example, if I run
$ mkfifo foo
$ cat foo
bash will block, because nothing has been written to foo. How does the pipeline in the example from the nc manpage not block? I assume I am misunderstanding how pipelines are executed.
All the commands in the pipeline run concurrently, not sequentially. So cat /tmp/f will indeed block, but /bin/sh and nc will still be started while that happens. nc will write to the FIFO when a client connects to the port and sends a command, and this will allow cat to unblock.
The pipe character in bash does notthing esle then connecting the output stream of the first command to the input stream of the second. echo "123" | cat is essentially the same as cat < <(echo 123) (the latter does only start one subshell though while the first starts one for each command, but this can be ignored here - plus, it's a bashism and does not work in sh).
$ mkfifo foo
$ cat foo
Does indeed block - but not freeze. The moment any other program writes anything to foo, cat will display it.
WHat you are doign in your netcat call is essentially create a cicrle: anything written into the FIFO will be displayed by cat, and, as cat is connected to sh sent to the latter. sh will then execute the code (as sh just executes anything written to it's input stream) and send the output to nc. nc will sent it to the client.
ANything the client sends to nc will be written into the FIFO - and our circle is complete.
The mistake you made (I think) is to assume the second process of a pipe only reads the data once, not continuously, and therefore has to wait for the first process to end. This is not true, because every process in a pipeline is started in a shubshell, so they all run intependent of each other.
You should also be able to change the order of all commands in your pipeline. As long as the first one reads from the FIFO and the last one writes to it (to complete the circle), it should work.