Why is the output from these parallel processes not messed up? - bash

Everything is executing perfectly.
words.dict file contains one word per line:
$ cat words.dict
cat
car
house
train
sun
today
station
kilometer
house
away
chapter.txt file contains plain text:
$ cat chapter.txt
The cars are very noisy today.
The train station is one kilometer away from his house.
This script below adds in result.txt file all words from words.dict not found with grep command in chapter.txt file, using 10 parallel grep:
$ cat psearch.sh
#!/bin/bash --
> result.txt
max_parallel_p=10
while read line ; do
while [ $(jobs | wc -l) -gt "$max_parallel_p" ]; do sleep 1; done
fgrep -q "$line" chapter.txt || printf "%s\n" "$line" >> result.txt &
done < words.dict
wait
A test:
$ ./psearch.sh
$ cat result.txt
cat
sun
I thought the tests would generate mixed words in result.txt
csat
un
But it really seems to work.
Please have a look and explain me why?

Background jobs are not threads. With a multi-threaded process then you can get that effect. The reason is that each process has just one standard output stream (stdout). In a multi-threaded program all threads share the same output stream, so an unprotected write to stdout can lead to garbled output as you describe. But you do not have a multi-threaded program.
When you use the & qualifier bash creates a new child process with its own stdout stream. Generally (depends on implementation details) this is flushed on a newline. So even though the file might be shared, the granularity is by line.
There is a slim chance that two processes could flush to the file at exactly the same time, but your code, with subprocesses and a sleep, makes it highly unlikely.
You could try taking out the newline from the printf, but given the inefficiency of the rest of the code, and the small dataset, it is still unlikely. It is quite possible that each process is complete before the next starts.

Related

does output from LHS of pipe become an arg for RHS of pipe

I'm having difficulty grasping how pipes work. Initially I thought of them as per the title but I couldn't get a simple example to work e.g.
mkdir temp
cd temp
echo "rubbish" > txtfile
ls | cat
I'm wondering why it returns the output from 'ls' rather than the output of 'cat txtfile' (i.e. "rubbish"). I've read many pipe tutorials but none of them seem to go beyond "STDOUT of LHS becomes STDIN for RHS" and I'm left wondering what is STDIN of RHS. Does it become the first argument? Where does it slot in when RHS of pipe has options or more than one argument. Is there any kind of macro substitution taking place or is my thinking wide of the mark.
Edit: I'm still none the wiser 5 comments later. I'll certainly take a look at Roadowl's pv utility but for now if I type
ls | cut -c 2-4
I get
xtf
which I'd expect. So, does cut take its input from stdin but cat doesn't?
Edit2: I stuck the question up on askubuntu (I originally put it up here by mistake). The answer there https://askubuntu.com/questions/1316848/does-output-from-lhs-of-pipe-become-an-arg-for-rhs-of-pipe throws a bit more light on it.
Edit3: While reading the answers here and ask ubuntu and the links therein it struck me (again) how woeful bash (& cohorts) are. It's almost like they're designed to trip you up. I only started using bash a couple of months back and every time I write a script I have to read endless web pages to get it to work or discover where I'm going wrong. Take a simple [[ $1=="..." ]] condition. You forget the spaces round the operator and the else condition might wipe some files you want without so much as a warning. Yes, you can do great things with it without a lot of typing but at times it's like using a tightrope to get from skyscraper A to skyscraper B to avoid using 2 lifts. What's up with gold c code like cat(ls())? That said, thanks to everyone who contributed.
I guess, you meant while performing
ls | cat
ls should return txtfile and which should go as a file input to cat command.
But, the things happening in the background are different :
First your shell creates a pipe using pipe(int pipefd[2]) system-call. This pipe has 2 ends, one is read and another is write.
When ls command is executing, it writes its output to the write end of the pipe and cat simultaneously reads from the read end of the pipe.
So, here STDOUT of ls is the write end whereas STDIN for cat is read end of the pipe.
While reading from the pipe cat will consider it as a stream of bytes and not as a name of the file.
So basically, cat is printing whatever is coming as a stream of bytes.
Read about pipe() over here : pipe(2) — Linux manual page
ls | cut -c 2-4
Here, cut reads its standard input, gets the line txtfile, takes characters 2 to 4 from it, producing xtf, and prints that on standard output. That's what the command line option tells it to do.
ls | cat
Here, cat reads its standard input, gets the line txtfile, and prints that on standard output, unchanged. That's what cat does. If there were further lines, it would do the same for those.
Both read standard input unless one or more file names are given as arguments. That standard input is connected to the terminal (the same one where you enter the command line), unless you use pipes or redirections to change that.
So, run the command cut -c 2-4, and enter the line abcdefghijkl, and it will print out bcd. Because without any arguments, it reads its standard input, which is the terminal, by default. Similarly for running just cat, you'll get back the same line you entered.
Running ls | cut -c 2-4 changes where the standard input comes from, but it doesn't create any new command line arguments (other than the -c and 2-4 you gave). Command line arguments are not the same as the standard input.
So, echo txtfile | cat is not the same as running cat txtfile, any more than running echo txtfile | cut -c 2-4 is the same as running cut -c 2-4 txtfile. For some reason, you seem to expect the pipe should work differently for cat than it does for cut.

Monitoring a log file until it is complete

I am a high school student attempting to write a script in bash that will submit jobs using the "qsub" command on a supercomputer utilizing a different number of cores. This script will then take the data on the number of cores and the time it took for the supercomputer to complete the simulation from each of the generated log files, called "log.lammps", and store this data in a separate file.
Because it will take each log file a different amount of time to be completely generated, I followed the steps from
https://superuser.com/questions/270529/monitoring-a-file-until-a-string-is-found
to have my script proceed when the last line of the log file with the string "Total wall time: " was generated.
Currently, I am using the following code in a loop so that this can be run for all the specified number of cores:
( tail -f -n0 log.lammps & ) | grep -q "Total wall time:"
However, running the script with this piece of code resulted in the log.lammps file being truncated and the script not completing even when the log.lammps file was completely generated.
Is there any other method for my script to only proceed when the submitted job is completed?
One way to do this is touch a marker file once you're complete, and wait for that:
#start process:
rm -f finished.txt;
( sleep 3 ; echo "scriptdone" > log.lammps ; true ) && touch finished.txt &
# wait for the above to complete
while [ ! -e finished.txt ]; do
sleep 1;
done
echo safe to process log.lammps now...
You could also use inotifywait, or a flock if you want to avoid busy waiting.
EDIT:
to handle the case where one of the first commands might fail, grouped first commands, and then added true to the end such that the group always returns true, and then did && touch finished.txt. This way finished.txt gets modified even if one of the first commands fails, and the loop below does not wait forever.
Try the following approach
# run tail -f in background
(tail -f -n0 log.lammps | grep -q "Total wall time:") > out 2>&1 &
# process id of tail command
tailpid=$!
# wait for some time or till the out file hqave data
sleep 10
# now kill the tail process
kill $tailpid
I tend to do this sort of thing with:
http://stromberg.dnsalias.org/~strombrg/notify-when-up2.html
and
http://stromberg.dnsalias.org/svn/age/trunk/
So something like:
notify-when-up2 --greater-than-or-equal-to 0 'age /etc/passwd' 10
This doesn't look for a specific pattern in your file - it looks for when the file stops changing for a 10 seconds. You can look for a pattern by replacing the age with a grep:
notify-when-up2 --true-command 'grep root /etc/passwd'
notify-when-up2 can do things like e-mail you, give a popup, or page you when a state changes. It's not a pretty approach in some cases, compared to using wait or whatever, but I find myself using a several times a day.
HTH.

Bash split stdin by null and pipe to pipeline

I have a stream that is null delimited, with an unknown number of sections. For each delimited section I want to pipe it into another pipeline until the last section has been read, and then terminate.
In practice, each section is very large (~1GB), so I would like to do this without reading each section into memory.
For example, imagine I have the stream created by:
for I in {3..5}; do seq $I; echo -ne '\0';
done
I'll get a steam that looks like:
1
2
3
^#1
2
3
4
^#1
2
3
4
5
^#
When piped through cat -v.
I would like to pipe each section through paste -sd+ | bc, so I get a stream that looks like:
6
10
15
This is simply an example. In actuality the stream is much larger and the pipeline is more complicated, so solutions that don't rely on streams are not feasible.
I've tried something like:
set -eo pipefail
while head -zn1 | head -c-1 | ifne -n false | paste -sd+ | bc; do :; done
but I only get
6
10
If I leave off bc I get
1+2+3
1+2+3+4
1+2+3+4+5
which is basically correct. This leads me to believe that the issue is potentially related to buffering and the way each process is actually interacting with the pipes between them.
Is there some way to fix the way that these commands exchange streams so that I can get the desired output? Or, alternatively, is there a way to accomplish this with other means?
In principle this is related to this question, and I could certainly write a program that reads stdin into a buffer, looks for the null character, and pipes the output to a spawned subprocess, as the accepted answer does for that question. Given the general support of streams and null delimiters in bash, I'm hoping to do something that's a little more "native". In particular, if I want to go this route, I'll have to escape the pipeline (paste -sd+ | bc) in a string instead of just letting the same shell interpret it. There's nothing too inherently bad about this, but it's a little ugly and will require a bunch of somewhat error prone escaping.
Edit
As was pointed out in an answer, head makes no guarantees about how much it buffers. Unless it only buffers single byte at a time, which would be impractical, this will never work. Thus, it seems like the only solution would be to read it into memory, or write a specific program.
The issue with your original code is that head doesn't guarantee that it won't read more than it outputs. Thus, it can consume more than one (NUL-delimited) chunk of input, even if it's emitting only one chunk of output.
read, by contrast, guarantees that it won't consume more than you ask it for.
set -o pipefail
while IFS= read -r -d '' line; do
bc <<<"${line//$'\n'/+}"
done < <(build_a_stream)
If you want native logic, there's nothing more native than just writing the whole thing in shell.
Calling external tools -- including bc, cut, paste, or others -- involves a fork() penalty. If you're only processing small amounts of data per invocation, the efficiency of the tools is overwhelmed by the cost of starting them.
while read -r -d '' -a numbers; do # read up to the next NUL into an array
sum=0 # initialize an accumulator
for number in "${numbers[#]}"; do # iterate over that array
(( sum += number )) # ...using an arithmetic context for our math
done
printf '%s\n' "$sum"
done < <(build_a_stream)
For all of the above, I tested with the following build_a_stream implementation:
build_a_stream() {
local i j IFS=$'\n'
local -a numbers
for ((i=3; i<=5; i++)); do
numbers=( )
for ((j=0; j<=i; j++)); do
numbers+=( "$j" )
done
printf '%s\0' "${numbers[*]}"
done
}
As discussed, the only real solution seemed to be writing a program to do this specifically. I wrote one in rust called xstream-util. After installing it with cargo install xstream-util, you can pipe the input into
xstream -0 -- bash -c 'paste -sd+ | bc'
to get the desired output
6
10
15
It doesn't avoid having to run the program in bash, so it still needs escaping if the pipeline is complicated. Also, it currently only supports single byte delimiters.

How to count number of forked (sub-?)processes

Somebody else has written (TM) some bash script that forks very many sub-processes. It needs optimization. But I'm looking for a way to measure "how bad" the problem is.
Can I / How would I get a count that says how many sub-processes were forked by this script all-in-all / recursively?
This is a simplified version of what the existing, forking code looks like - a poor man's grep:
#!/bin/bash
file=/tmp/1000lines.txt
match=$1
let cnt=0
while read line
do
cnt=`expr $cnt + 1`
lineArray[$cnt]="${line}"
done < $file
totalLines=$cnt
cnt=0
while [ $cnt -lt $totalLines ]
do
cnt=`expr $cnt + 1`
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
done
It takes the script 20 seconds to look for $1 in 1000 lines of input. This code forks way too many sub-processes. In the real code, there are longer pipes (e.g. progA | progB | progC) operating on each line using grep, cut, awk, sed and so on.
This is a busy system with lots of other stuff going on, so a count of how many processes were forked on the entire system during the run-time of the script would be of some use to me, but I'd prefer a count of processes started by this script and descendants. And I guess I could analyze the script and count it myself, but the script is long and rather complicated, so I'd just like to instrument it with this counter for debugging, if possible.
To clarify:
I'm not looking for the number of processes under $$ at any given time (e.g. via ps), but the number of processes run during the entire life of the script.
I'm also not looking for a faster version of this particular example script (I can do that). I'm looking for a way to determine which of the 30+ scripts to optimize first to use bash built-ins.
You can count the forked processes simply trapping the SIGCHLD signal. If You can edit the script file then You can do this:
set -o monitor # or set -m
trap "((++fork))" CHLD
So fork variable will contain the number of forks. At the end You can print this value:
echo $fork FORKS
For a 1000 lines input file it will print:
3000 FORKS
This code forks for two reasons. One for each expr ... and one for `echo ...|grep...`. So in the reading while-loop it forks every time when a line is read; in the processing while-loop it forks 2 times (one because of expr ... and one for `echo ...|grep ...`). So for a 1000 lines file it forks 3000 times.
But this is not exact! It is just the forks done by the calling shell. There are more forks, because `echo ...|grep...` forks to start a bash to run this code. But after it is also forks twice: one for echo and one for grep. So actually it is 3 forks, not one. So it is rather 5000 FORKS, not 3000.
If You need to count the forks of the forks (of the forks...) as well (or You cannot modify the bash script or You want it to do from an other script), a more exact solution can be to used
strace -fo s.log ./x.sh
It will print lines like this:
30934 execve("./x.sh", ["./x.sh"], [/* 61 vars */]) = 0
Then You need to count the unique PIDs using something like this (first number is the PID):
awk '{n[$1]}END{print length(n)}' s.log
In case of this script I got 5001 (the +1 is the PID of the original bash script).
COMMENTS
Actually in this case all forks can be avoided:
Instead of
cnt=`expr $cnt + 1`
Use
((++cnt))
Instead of
matches=`echo ${lineArray[$cnt]}|grep $match`
if [ "$matches" ] ; then
echo ${lineArray[$cnt]}
fi
You can use bash's internal pattern matching:
[[ ${lineArray[cnt]} =~ $match ]] && echo ${lineArray[cnt]}
Mind that bash =~ uses ERE not RE (like grep). So it will behave like egrep (or grep -E), not grep.
I assume that the defined lineArray is not pointless (otherwise in the reading loop the matching could be tested and the lineArray is not needed) and it is used for other purpose as well. In that case I may suggest a little bit shorter version:
readarray -t lineArray <infile
for line in "${lineArray[#]}";{ [[ $line} =~ $match ]] && echo $line; }
First line reads the complete infile to lineArray without any loop. The second line is process the array element-by-element.
MEASURES
Original script for 1000 lines (on cygwin):
$ time ./test.sh
3000 FORKS
real 0m48.725s
user 0m14.107s
sys 0m30.659s
Modified version
FORKS
real 0m0.075s
user 0m0.031s
sys 0m0.031s
Same on linux:
3000 FORKS
real 0m4.745s
user 0m1.015s
sys 0m4.396s
and
FORKS
real 0m0.028s
user 0m0.022s
sys 0m0.005s
So this version uses no fork (or clone) at all. I may suggest to use this version only for small (<100 KiB) files. In other cases grap, egrep, awk over performs the pure bash solution. But this should be checked by a performance test.
For a thousand lines on linux I got the following:
$ time grep Solaris infile # Solaris is not in the infile
real 0m0.001s
user 0m0.000s
sys 0m0.001s

BASH Script - Safe limits for string from command output

Good day,
I am writing a relatively simple BASH script that performs an SVN UP command, captures the console output, then does some post processing on the text.
For example:
#!/bin/bash
# A script to alter SVN logs a bit
# Update and get output
echo "Waiting for update command to complete..."
TEST_TEXT=$(svn up --set-depth infinity)
echo "Done"
# Count number of lines in output and report it
NUM_LINES=$(echo $TEST_TEXT | grep -c '.*')
echo "Number of lines in output log: $NUM_LINES"
# Print out only lines containing Makefile
echo $TEST_TEXT | grep Makefile
This works as expected (ie: as commented in the code above), but I am concerned about what would happen if I ran this on a very large repository. Is there a limit on the maximum buffer size BASH can use to hold the output of a console command?
I have looked for similar questions, but nothing quite like what I'm searching for. I've read up on how certain scripts need to use the xargs in cases of large intermediate buffers, and I'm wondering if something similar applies here with respect to capturing console output.
eg:
# Might fail if we have a LOT of results
find -iname *.cpp | rm
# Shouldn't fail, regardless of number of results
find -iname *.cpp | xargs rm
Thank you.
Using
var=$(hexdump /dev/urandom | tee out)
bash didn't complain; I killed it at a bit over 1G and 23.5M lines. You don't need to worry as long as your output fits in your system's memory.
I see no reason not to use a temporary file here.
tmp_file=$(mktemp XXXXX)
svn up --set-depth=infinity > $tmp_file
echo "Done"
# Count number of lines in output and report it
NUM_LINES=$(wc -l $tmp_file)
echo "Number of lines in output log: $NUM_LINES"
# Print out only lines containing Makefile
grep Makefile $tmp_file
rm $tmp_file

Resources