Synchronized Output With Bash's Process Substitution - bash

I have to multiply call an inflexible external tool that takes as arguments some input data and an output file to which it will write the processed data, for example:
some_prog() { echo "modified_$1" > "$2"; }
For varying input, I want to call some_prog, filter the output and write the output of all calls into the same file "out_file". Additionally, I want to add a header line to the output file before each call of some_prog. Given the following dummy filter:
slow_filter() {
read input; sleep "0.000$(($RANDOM % 10))"; echo "filtered_$input"
}
I wrote the following code:
rm -f out_file
for input in test_input{1..8}; do
echo "#Header_for_$input" >> "out_file"
some_prog $input >( slow_filter >> "out_file" )
done
However, this will produce an out_file like this:
#Header_for_test_input1
#Header_for_test_input2
#Header_for_test_input3
#Header_for_test_input4
#Header_for_test_input5
#Header_for_test_input6
#Header_for_test_input7
#Header_for_test_input8
filtered_modified_test_input4
filtered_modified_test_input1
filtered_modified_test_input2
filtered_modified_test_input5
filtered_modified_test_input6
filtered_modified_test_input3
filtered_modified_test_input8
filtered_modified_test_input7
The output I expected was:
#Header_for_test_input1
filtered_modified_test_input1
#Header_for_test_input2
filtered_modified_test_input2
#Header_for_test_input3
filtered_modified_test_input3
#Header_for_test_input4
filtered_modified_test_input4
#Header_for_test_input5
filtered_modified_test_input5
#Header_for_test_input6
filtered_modified_test_input6
#Header_for_test_input7
filtered_modified_test_input7
#Header_for_test_input8
filtered_modified_test_input8
I realized that the >( ) process substitution forks the shell. Is there a way to synchronize the output of the subshells? Or is there another elegant solution to this problem? I want to avoid the obvious approach of writing to different files in each iteration because, in my code, the for loop has a few 100,000 iterations.

Write the header inside the process substitution, specifically in a command group with the filter so that the concatenated output is written to out_file as one stream.
rm -f out_file
for input in test_input{1..8}; do
some_prog "$input" >( { echo "#Header_for_$input"; slow_filter; } >> "out_file" )
done
As process substitution is truly asynchronous and there doesn't appear to be a way to wait for it to complete before executing the next iteration of the loop, I would use an explicit named pipe.
rm -f out_file pipe
mkfifo pipe
for input in test_input{1..8}; do
some_prog "$input" pipe &
echo "#Header_for_$input" >> out_file
slow_filter < pipe >> out_file
done
(If some_prog doesn't work with a named pipe for some reason, you can use a regular file. In that case, you shouldn't run the command in the background.)

Since chepner's approach using a named pipe seems to be very slow in my "real world script" (about 10 times slower than this solution), the easiest and safest way to achieve what I want seems to be a temporary file:
rm -f out_file
tmp_file="$(mktemp --tmpdir my_temp_XXXXX.tmp)"
for input in test_input{1..8}; do
some_prog "$input" "$tmp_file"
{
echo "#Header_for_$input"
slow_filter < "$tmp_file"
} >> out_file
done
rm "$tmp_file"
This way, the temporary file tmp_file gets overwritten in each iteration such that it can be kept in memory if the system's temp directory is a RAM disk.

Related

Reading filenames from a structured file to a bash script

I have a file with a structured list of filenames (file1.sh, file2.sh, ...) and would like to read loop the file names inside a bash script.
cat /home/flora/logs/9681-T13:17:07.091363777.org
%rec: dynamic
Ptrn: Gnu
File: /home/flora/comint.rc
+ /home/flora/engine.rc
+ /home/flora/playa.rc
+ /home/flora/edva.rc
+ /home/flora/dyna.rc
+ /home/flora/lin.rc
Have started with
while read -r fl; do
echo "$fl" | grep -oE '[/].+'
done < "$logfl"
But I want to be more specific by matching the File: , then continue reading the rest using + as a continuation character.
bash doesn't have impose a limit on variables (other than memory). That said, I would start by processing the list of lines one by one:
#!/bin/bash
while read _ f
do
process "$f"
done
where process is whatever function you need to implement.
If you want a variables use an array like this:
#!/bin/bash
while read _ f
do
files+=("$f")
done
In either case pass the input file to script with:
your_script < /home/flora/logs/27043-T13:09:44.893003954.log

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH
#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

How do I prepend to a stream in Bash?

Suppose I have the following command in bash:
one | two
one runs for a long time producing a stream of output and two performs a quick operation on each line of that stream, but two doesn't work at all unless the first value it reads tells it how many values to read per line. one does not output that value, but I know what it is in advance (let's say it's 15). I want to send a 15\n through the pipe before the output of one. I do not want to modify one or two.
My first thought was to do:
echo "$(echo 15; one)" | two
That gives me the correct output, but it doesn't stream through the pipe at all until the command one finishes. I want the output to start streaming right away through the pipe, since it takes a long time to execute (months).
I also tried:
echo 15; one | two
Which, of course, outputs 15, but doesn't send it through the pipe to two.
Is there a way in bash to pass '15\n' through the pipe and then start streaming the output of one through the same pipe?
You just need the shell grouping construct:
{ echo 15; one; } | two
The spaces around the braces and the trailing semicolon are required.
To test:
one() { sleep 5; echo done; }
two() { while read line; do date "+%T - $line"; done; }
{ printf "%s\n" 1 2 3; one; } | two
16:29:53 - 1
16:29:53 - 2
16:29:53 - 3
16:29:58 - done
Use command grouping:
{ echo 15; one; } | two
Done!
You could do this with sed:
Example 'one' script, emits one line per second to show it's line buffered and running.
#!/bin/bash
while [ 1 ]; do
echo "TICK $(date)"
sleep 1
done
Then pipe that through this sed command, note that for your specific example 'ArbitraryText' will be the number of fields. I used ArbitraryText so that it's obvious that this is the inserted text. On OSX, -l is unbuffered with GNU Sed I believe it's -u
$ ./one | sed -l '1i\
> ArbitraryText
> '
What this does is it instructs sed to insert one line before processing the rest of your file, everything else will pass through untouched.
The end result is processed line-by-line without chunk buffering (or, waiting for the input script to finish)
ArbitraryText
TICK Fri Jun 28 13:26:56 PDT 2013
...etc
You should be able to then pipe that into 'two' as you would normally.

Bash - Redirection with wildcards

I'm testing to do redirection with wildcards. Something like:
./TEST* < ./INPUT* > OUTPUT
Anyone have any recommendations? Thanks.
Say you have the following 5 files: TEST1, TEST1, INPUT1, INPUT2, and OUTPUT. The command line
./TEST* < ./INPUT* > OUTPUT
will expand to
./TEST1 ./TEST2 < ./INPUT1 ./INPUT2 > OUTPUT.
In other words, you will run the command ./TEST1 with 2 arguments (./TEST2, ./INPUT2), with its input redirected from ./INPUT1 and its output redirected to OUTPUT.
To address what you are probably trying to do, you can only specify a single file using input redirection. To send input to TEST from both of the INPUT* files, you would need to use something like the following, using process substitution:
./TEST1 < <(cat ./INPUT*) > OUTPUT
To run each of the programs that matches TEST* on all the input files that match INPUT*, use the following loop. It collects the output of all the commands and puts them into a single file OUTPUT.
for test in ./TEST*; do
cat ./INPUT* | $test
done > OUTPUT
There is a program called TEST* that has to get various redirection into into called INPUT*, but the thing is there are many TEST programs and they all have a different number, e.g. TEST678. What I'm trying to do is push all the random INPUT files into all the all TEST programs.
You can write:
for program in TEST* # e.g., program == 'TEST678'
do
suffix="${program#TEST}" # e.g., suffix == '678'
input="INPUT$suffix" # e.g., input == 'INPUT678'
"./$program" < "$input" # e.g., run './TEST678 < INPUT678'
done > OUTPUT
for test in ./TEST*; do
for inp in ./INPUT*; do
$test < $inp >> OUTPUT
done
done

Capturing multiple line output into a Bash variable

I've got a script 'myscript' that outputs the following:
abc
def
ghi
in another script, I call:
declare RESULT=$(./myscript)
and $RESULT gets the value
abc def ghi
Is there a way to store the result either with the newlines, or with '\n' character so I can output it with 'echo -e'?
Actually, RESULT contains what you want — to demonstrate:
echo "$RESULT"
What you show is what you get from:
echo $RESULT
As noted in the comments, the difference is that (1) the double-quoted version of the variable (echo "$RESULT") preserves internal spacing of the value exactly as it is represented in the variable — newlines, tabs, multiple blanks and all — whereas (2) the unquoted version (echo $RESULT) replaces each sequence of one or more blanks, tabs and newlines with a single space. Thus (1) preserves the shape of the input variable, whereas (2) creates a potentially very long single line of output with 'words' separated by single spaces (where a 'word' is a sequence of non-whitespace characters; there needn't be any alphanumerics in any of the words).
Another pitfall with this is that command substitution — $() — strips trailing newlines. Probably not always important, but if you really want to preserve exactly what was output, you'll have to use another line and some quoting:
RESULTX="$(./myscript; echo x)"
RESULT="${RESULTX%x}"
This is especially important if you want to handle all possible filenames (to avoid undefined behavior like operating on the wrong file).
In case that you're interested in specific lines, use a result-array:
declare RESULT=($(./myscript)) # (..) = array
echo "First line: ${RESULT[0]}"
echo "Second line: ${RESULT[1]}"
echo "N-th line: ${RESULT[N]}"
In addition to the answer given by #l0b0 I just had the situation where I needed to both keep any trailing newlines output by the script and check the script's return code.
And the problem with l0b0's answer is that the 'echo x' was resetting $? back to zero... so I managed to come up with this very cunning solution:
RESULTX="$(./myscript; echo x$?)"
RETURNCODE=${RESULTX##*x}
RESULT="${RESULTX%x*}"
Parsing multiple output
Introduction
So your myscript output 3 lines, could look like:
myscript() { echo $'abc\ndef\nghi'; }
or
myscript() { local i; for i in abc def ghi ;do echo $i; done ;}
Ok this is a function, not a script (no need of path ./), but output is same
myscript
abc
def
ghi
Considering result code
To check for result code, test function will become:
myscript() { local i;for i in abc def ghi ;do echo $i;done;return $((RANDOM%128));}
1. Storing multiple output in one single variable, showing newlines
Your operation is correct:
RESULT=$(myscript)
About result code, you could add:
RCODE=$?
even in same line:
RESULT=$(myscript) RCODE=$?
Then
echo $RESULT $RCODE
abc def ghi 66
echo "$RESULT"
abc
def
ghi
echo ${RESULT#Q}
$'abc\ndef\nghi'
printf '%q\n' "$RESULT"
$'abc\ndef\nghi'
but for showing variable definition, use declare -p:
declare -p RESULT RCODE
declare -- RESULT="abc
def
ghi"
declare -- RCODE="66"
2. Parsing multiple output in array, using mapfile
Storing answer into myvar variable:
mapfile -t myvar < <(myscript)
echo ${myvar[2]}
ghi
Showing $myvar:
declare -p myvar
declare -a myvar=([0]="abc" [1]="def" [2]="ghi")
Considering result code
In case you have to check for result code, you could:
RESULT=$(myscript) RCODE=$?
mapfile -t myvar <<<"$RESULT"
declare -p myvar RCODE
declare -a myvar=([0]="abc" [1]="def" [2]="ghi")
declare -- RCODE="40"
3. Parsing multiple output by consecutives read in command group
{ read firstline; read secondline; read thirdline;} < <(myscript)
echo $secondline
def
Showing variables:
declare -p firstline secondline thirdline
declare -- firstline="abc"
declare -- secondline="def"
declare -- thirdline="ghi"
I often use:
{ read foo;read foo total use free foo ;} < <(df -k /)
Then
declare -p use free total
declare -- use="843476"
declare -- free="582128"
declare -- total="1515376"
Considering result code
Same prepended step:
RESULT=$(myscript) RCODE=$?
{ read firstline; read secondline; read thirdline;} <<<"$RESULT"
declare -p firstline secondline thirdline RCODE
declare -- firstline="abc"
declare -- secondline="def"
declare -- thirdline="ghi"
declare -- RCODE="50"
After trying most of the solutions here, the easiest thing I found was the obvious - using a temp file. I'm not sure what you want to do with your multiple line output, but you can then deal with it line by line using read. About the only thing you can't really do is easily stick it all in the same variable, but for most practical purposes this is way easier to deal with.
./myscript.sh > /tmp/foo
while read line ; do
echo 'whatever you want to do with $line'
done < /tmp/foo
Quick hack to make it do the requested action:
result=""
./myscript.sh > /tmp/foo
while read line ; do
result="$result$line\n"
done < /tmp/foo
echo -e $result
Note this adds an extra line. If you work on it you can code around it, I'm just too lazy.
EDIT: While this case works perfectly well, people reading this should be aware that you can easily squash your stdin inside the while loop, thus giving you a script that will run one line, clear stdin, and exit. Like ssh will do that I think? I just saw it recently, other code examples here: https://unix.stackexchange.com/questions/24260/reading-lines-from-a-file-with-bash-for-vs-while
One more time! This time with a different filehandle (stdin, stdout, stderr are 0-2, so we can use &3 or higher in bash).
result=""
./test>/tmp/foo
while read line <&3; do
result="$result$line\n"
done 3</tmp/foo
echo -e $result
you can also use mktemp, but this is just a quick code example. Usage for mktemp looks like:
filenamevar=`mktemp /tmp/tempXXXXXX`
./test > $filenamevar
Then use $filenamevar like you would the actual name of a file. Probably doesn't need to be explained here but someone complained in the comments.
How about this, it will read each line to a variable and that can be used subsequently !
say myscript output is redirected to a file called myscript_output
awk '{while ( (getline var < "myscript_output") >0){print var;} close ("myscript_output");}'

Resources