Bash subshell to file - bash

I'm looping over a large file, on each line I'm running some commands, when they finish I want the entire output to be appended to a file.
Since there's nothing stopping me from running multiple commands at once, I tried to run this in the background &.
It doesn't work as expected, it just appends the commands to the file as they finish, but not in the order they appear in the subshell
#!/bin/bash
while read -r line; do
(
echo -e "$line\n-----------------"
trivy image --severity CRITICAL $line
# or any other command that might take 1-2 seconds
echo "============="
) >> vulnerabilities.txt &
done <images.txt
Where am I wrong?

Consider using GNU Parallel to get lots of things done in parallel. In your case:
parallel -k -a images.txt trivy image --severity CRITICAL > vulnerabilities.txt
The -k keeps the output in order. Add --bar or --eta for progress reports. Add --dry-run to see what it would do without actually doing anything. Add -j ... to control the number of parallel jobs at any one time - by default, it will run one job per CPU core at a time - so it will basically keep all your cores busy till the jobs are done.
If you want to do more processing on each line, you can declare a bash function and call that with each line as its parameter... see here.

Related

Command Substitution - order of evaluation in Bash

I was trying to run this seemingly simple script which should display the functionality of the -a flag of touch: diff <(stat file.o) <(touch -a file.o; stat file.o). The output of this command is sporadic - obviously sometimes touch gets executed after everything else has been evaluated - but in an example as: diff <(echo first) <(echo second; echo third) - the order is kept. So why doesnt the first command work aswell?
The <( command-list ) syntax does the following:
Run command-list asynchronously
Store output of command-list in a temporary file
Replace itself on the command line with the path to that temporary file
See Process Substitution.
The first point is likely what is tripping you up. There is no guarantee that your first process substitution will run before your second process substitution, therefore touch -a might be executed before either call to stat.
Your second example will always work as expected, because the output of each individual process substitution will be serialized. Even if echo second happens before echo first, they'll still be written to their respective temporary files and echo third will always happen after echo second so they will appear in the correct order in their file. The overall order of the two process substitutions doesn't really matter.
Both commands happen at the same time.
That is to say, touch -a file.o; stat file.o from one process substitution and stat file.o from the other are happening concurrently.
So sometimes the touch happens before the process substitution that only has a stat; that means that both the stat commands see the effect of the touch, because (in that instance) the touch happened first.
As an (ugly, bad-practice) example, you can observe that it no longer happens when you add a delay:
diff <(stat file.o) <(sleep 1; touch -a file.o; stat file.o)

Monitoring a log file until it is complete

I am a high school student attempting to write a script in bash that will submit jobs using the "qsub" command on a supercomputer utilizing a different number of cores. This script will then take the data on the number of cores and the time it took for the supercomputer to complete the simulation from each of the generated log files, called "log.lammps", and store this data in a separate file.
Because it will take each log file a different amount of time to be completely generated, I followed the steps from
https://superuser.com/questions/270529/monitoring-a-file-until-a-string-is-found
to have my script proceed when the last line of the log file with the string "Total wall time: " was generated.
Currently, I am using the following code in a loop so that this can be run for all the specified number of cores:
( tail -f -n0 log.lammps & ) | grep -q "Total wall time:"
However, running the script with this piece of code resulted in the log.lammps file being truncated and the script not completing even when the log.lammps file was completely generated.
Is there any other method for my script to only proceed when the submitted job is completed?
One way to do this is touch a marker file once you're complete, and wait for that:
#start process:
rm -f finished.txt;
( sleep 3 ; echo "scriptdone" > log.lammps ; true ) && touch finished.txt &
# wait for the above to complete
while [ ! -e finished.txt ]; do
sleep 1;
done
echo safe to process log.lammps now...
You could also use inotifywait, or a flock if you want to avoid busy waiting.
EDIT:
to handle the case where one of the first commands might fail, grouped first commands, and then added true to the end such that the group always returns true, and then did && touch finished.txt. This way finished.txt gets modified even if one of the first commands fails, and the loop below does not wait forever.
Try the following approach
# run tail -f in background
(tail -f -n0 log.lammps | grep -q "Total wall time:") > out 2>&1 &
# process id of tail command
tailpid=$!
# wait for some time or till the out file hqave data
sleep 10
# now kill the tail process
kill $tailpid
I tend to do this sort of thing with:
http://stromberg.dnsalias.org/~strombrg/notify-when-up2.html
and
http://stromberg.dnsalias.org/svn/age/trunk/
So something like:
notify-when-up2 --greater-than-or-equal-to 0 'age /etc/passwd' 10
This doesn't look for a specific pattern in your file - it looks for when the file stops changing for a 10 seconds. You can look for a pattern by replacing the age with a grep:
notify-when-up2 --true-command 'grep root /etc/passwd'
notify-when-up2 can do things like e-mail you, give a popup, or page you when a state changes. It's not a pretty approach in some cases, compared to using wait or whatever, but I find myself using a several times a day.
HTH.

Why is the output from these parallel processes not messed up?

Everything is executing perfectly.
words.dict file contains one word per line:
$ cat words.dict
cat
car
house
train
sun
today
station
kilometer
house
away
chapter.txt file contains plain text:
$ cat chapter.txt
The cars are very noisy today.
The train station is one kilometer away from his house.
This script below adds in result.txt file all words from words.dict not found with grep command in chapter.txt file, using 10 parallel grep:
$ cat psearch.sh
#!/bin/bash --
> result.txt
max_parallel_p=10
while read line ; do
while [ $(jobs | wc -l) -gt "$max_parallel_p" ]; do sleep 1; done
fgrep -q "$line" chapter.txt || printf "%s\n" "$line" >> result.txt &
done < words.dict
wait
A test:
$ ./psearch.sh
$ cat result.txt
cat
sun
I thought the tests would generate mixed words in result.txt
csat
un
But it really seems to work.
Please have a look and explain me why?
Background jobs are not threads. With a multi-threaded process then you can get that effect. The reason is that each process has just one standard output stream (stdout). In a multi-threaded program all threads share the same output stream, so an unprotected write to stdout can lead to garbled output as you describe. But you do not have a multi-threaded program.
When you use the & qualifier bash creates a new child process with its own stdout stream. Generally (depends on implementation details) this is flushed on a newline. So even though the file might be shared, the granularity is by line.
There is a slim chance that two processes could flush to the file at exactly the same time, but your code, with subprocesses and a sleep, makes it highly unlikely.
You could try taking out the newline from the printf, but given the inefficiency of the rest of the code, and the small dataset, it is still unlikely. It is quite possible that each process is complete before the next starts.

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together
Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz
Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz
e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.
Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.
I want to write a for loop in bash to iterate over and create output.
sourcedir=/sourcepath/
destdir=/destinationpath/
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam
How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?
Thanks,
for fname in *_R1.fastq.gz
do
base=${fname%_R1*}
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done
In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:
#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3
for fname in * # _R1.fastq.gz
do
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.
Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

running parallel bash background

I have a script that I want to run on a number of files
my_script file_name
but I have many so I have written some code that is meant to process multiple at the same time by first creating 5 'equal' lists of the files I want to process followed by this
my_function() {
while read i; do
my_script $i
done < $1
}
my_function list_1 &
my_function list_2 &
my_function list_3 &
my_function list_4 &
my_function list_5 &
wait
This works for the first file in each list but then finishes. If I change the function to a simple echo it works fine
my_function() {
while read i; do
echo $i
done < $1
}
it prints all the files in each list as I would expect.
Why does it not work if I use 'my_script'?? And is there a 'nicer' way of doing this?
GNU Parallel is made for this:
parallel my_script ::: files*
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
wget -O - pi.dk/3 | sh
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Edit:
If the reason for not installing GNU Parallel is not covered by
http://oletange.blogspot.dk/2013/04/why-not-install-gnu-parallel.html
would you then be kind to elaborate why?
There must be an exit statement in my_script. Replace the exit statement(s) with return statement(s).
Another thing to check is the possibility that the same file is contained in more than one list. There may be contention issues in processing - the file is already being processed and another process attempts to open the same file. Check for any duplicate files with-:
sort file_[1-5] | uniq -d
As an alternative to GNU parallel, there is https://github.com/mauvilsa/run_parallel which is simply a function in bash, so it does not require root access or compiling.
To use it, first source the file
source run_parallel.inc.sh
Then in your example, execute it as
run_parallel -T 5 my_function 'list_{%}'
It could also do the splitting of the lists for you as
run_parallel -T 5 -l full_list -n split my_function '{#}'
To see the usage explanation and some examples, execute run_parallel without any arguments.

Resources