Reading numbers in scientific notation using bash - bash

As part of an annotation pipeline for De Novo fish genomes I need to compare e-values from BLAST to see whether they are lower than a certain threshold.
To get the semantics right I first evaluated one of the othet columns in the blast-output, and it works fine like this:
for f in FOLDER/*; do
myVar=$(head -1 $f | awk '{print $4}') ;
if [[ $myVar -gt 50 ]]; then echo ..... ;done
$4 is then a column in the BLAST output with whole numerical values (hit length or something)
However, when I try to change the script to working with the e-values, there is some problems with interpretation of the scientific notation etc...
What I WOULD like is this:
for f in FOLDER/*; do
myVar=$(head -1 $f | awk '{print $11}') ;
if [[ $myVar -gt 1.0e-10 ]]; then echo ..... ;done
where $11 points to the e-value for each hit.
Could this be done in a not to cumbersome manner in bash?

With awk, it is possible:
for f in FOLDER/*; do awk '$11 < 1e-10 {print $11}' "$f"; done
This doesn't need the variable to be defined first.

Related

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:
name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89
I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:
apples=10
rocks=3
tables=6
cups=89
I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.
VALUES=$(cat input.txt)
for x in $VALUES; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done
I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.
How can I optimize this task in bash?
Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.
#! /bin/bash
while IFS== read k v ; do
if [[ $k == name ]] ; then
name=$v
elif [[ $k == value ]] ; then
printf '%s=%s\n' "$name" "$v"
fi
done
There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.
1. Replace for with while read
Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.
A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done < input.txt
2. Redirect output outside loop
It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}"
fi
done < input.txt > output.txt
3. Shell string processing
One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).
while IFS='=' read -r key value; do
case "$key" in
name) name="$value";;
value) echo "$name=$value";;
fi
done < input.txt > output.txt
Further reading
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.
If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:
$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt
apples=10
rocks=3
tables=6
cups=89
Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.
Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:
awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
NR != 1 {name = $2}' data.txt
All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):
awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt
With awk there might be a more elegant solution but you can have:
awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt
RS="\n?name=" says that the record separator is name=
FS="\nvalue=" says that the field separator for each record is value=
if($2) says to only proceed the printf is the second field exists

How to display number to two decimal places, even zero .00 using BC or DC [duplicate]

Greetings!
I uses bс to make some calculations in my script. For example:
bc
scale=6
1/2
.500000
For further usage in my script I need "0.500000" insted of ".500000".
Could you help me please to configure bc output number format for my case?
In one line:
printf "%0.6f\n" $(bc -q <<< scale=6\;1/2)
Just do all your calculations and output in awk:
float_scale=6
result=$(awk -v scale=$floatscale 'BEGIN { printf "%.*f\n", scale, 1/2 }')
As an alternative, if you'd prefer to use bc and not use AWK alone or with 'bc', Bash's printf supports floating point numbers even though the rest of Bash doesn't.
result=$(echo "scale=$float_scale; $*" | bc -q 2>/dev/null)
result=$(printf '%*.*f' 0 "$float_scale" "$result")
The second line above could instead be:
printf -v $result '%*.*f' 0 "$float_scale" "$result"
Which works kind of like sprintf would and doesn't create a subshell.
Quick and dirty, since scale only applies to the decimal digits and bc does not seem to have a sprintf-like function:
$ bc
scale = 6
result = 1 / 2
if (0 <= result && result < 1) {
print "0"
}
print result;
echo "scale=3;12/7" | bc -q | sed 's/^\\./0./;s/0*$//;s/\\.$//'
I believe here is modified version of the function:
float_scale=6
function float_eval()
{
local stat=0
local result=0.0
if [[ $# -gt 0 ]]; then
result=$(echo "scale=$float_scale; $*" | bc -q | awk '{printf "%f\n", $0}' 2>/dev/null)
stat=$?
if [[ $stat -eq 0 && -z "$result" ]]; then stat=1; fi
fi
echo $result
return $stat
}
Can you put the bc usage into a little better context? What are you using the results of bc for?
Given the following in a file called some_math.bc
scale=6
output=1/2
print output
on the command line I can do the following to add a zero:
$ bc -q some_math.bc | awk '{printf "%08f\n", $0}'
0.500000
If I only needed the output string to have a zero for formatting purposes, I'd use awk.

Extracting a substring from a variable using bash script

I have a bash variable with value something like this:
10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
There are no spaces within value. This value can be very long or very short. Here pairs such as 65:3.0 exist. I know the value of a number from the first part of pair, say 65. I want to extract the number 3.0 or pair 65:3.0. I am not aware of the position (offset) of 65.
I will be grateful for a bash-script that can do such extraction. Thanks.
Probably awk is the most straight-forward approach:
awk -F: -v RS=',' '$1==65{print $2}' <<< "$var"
3.0
Or to get the pair:
$ awk -F: -v RS=',' '$1==65' <<< "$var"
65:3.0
Here's a pure Bash solution:
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
while read -r -d, i; do
[[ $i = 65:* ]] || continue
echo "$i"
done <<< "$var,"
You may use break after echo "$i" if there's only one 65:... in var, or if you only want the first one.
To get the value 3.0: echo "${i#*:}".
Other (pure Bash) approach, without parsing the string explicitly. I'm assuming you're only looking for the first 65 in the string, and that it is present in the string:
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
value=${var#*,65:}
value=${value%%,*}
echo "$value"
This will be very slow for long strings!
Same as above, but will output all the values corresponding to 65 (or none if there are none):
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
tmpvar=,$var
while [[ $tmpvar = *,65:* ]]; do
tmpvar=${tmpvar#*,65:}
echo "${tmpvar%%,*}"
done
Same thing, this will be slow for long strings!
The fastest I can obtain in pure Bash is my original answer (and it's fine with 10000 fields):
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
IFS=, read -ra ary <<< "$var"
for i in "${ary[#]}"; do
[[ $i = 65:* ]] || continue
echo "$i"
done
In fact, no, the fastest I can obtain in pure Bash is with this regex:
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
[[ ,$var, =~ ,65:([^,]+), ]] && echo "${BASH_REMATCH[1]}"
Test of this vs awk,
where the 65:3.0 is at the end:
printf -v var '%s:3.0,' {100..11000}
var+=65:42.0
time awk -F: -v RS=',' '$1==65{print $2}' <<< "$var"
shows 0m0.020s (rough average) whereas:
time { [[ ,$var, =~ ,65:([^,]+), ]] && echo "${BASH_REMATCH[1]}"; }
shows 0m0.008s (rough average too).
where the 65:3.0 is not at the end:
printf -v var '%s:3.0,' {1..10000}
time awk -F: -v RS=',' '$1==65{print $2}' <<< "$var"
shows 0m0.020s (rough average) and with early exit:
time awk -F: -v RS=',' '$1==65{print $2;exit}' <<< "$var"
shows 0m0.010s (rough average) whereas:
time { [[ ,$var, =~ ,65:([^,]+), ]] && echo "${BASH_REMATCH[1]}"; }
shows 0m0.002s (rough average).
With grep:
grep -o '\b65\b[^,]*' <<<"$var"
65:3.0
Or
grep -oP '\b65\b:\K[^,]*' <<<"$var"
3.0
\K option ignores everything before matched pattern and ignore pattern itself. It's Perl-compatibility(-P) for grep command .
Here is an gnu awk
awk -vRS="(^|,)65:" -F, 'NR>1{print $1}' <<< "$var"
3.0
try
echo $var | tr , '\n' | awk '/65/'
where
tr , '\n' turn comma to new line
awk '/65/' pick the line with 65
or
echo $var | tr , '\n' | awk -F: '$1 == 65 {print $2}'
where
-F: use : as separator
$1 == 65 pick line with 65 as first field
{ print $2} print second field
Using sed
sed -e 's/^.*,\(65:[0-9.]*\),.*$/\1/' <<<",$var,"
output:
65:3.0
There are two different ways to protect against 65:3.0 being the first-in-line or last-in-line. Above, commas are added to surround the variable providing for an occurrence regardless. Below, the Gnu extension \? is used to specify zero-or-one occurrence.
sed -e 's/^.*,\?\(65:[0-9.]*\),\?.*$/\1/' <<<$var
Both handle 65:3.0 regardless of where it appears in the string.
Try egrep like below:
echo $myvar | egrep -o '\b65:[0-9]+.[0-9]+' |

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

Bash script: specify bc output number format

Greetings!
I uses bс to make some calculations in my script. For example:
bc
scale=6
1/2
.500000
For further usage in my script I need "0.500000" insted of ".500000".
Could you help me please to configure bc output number format for my case?
In one line:
printf "%0.6f\n" $(bc -q <<< scale=6\;1/2)
Just do all your calculations and output in awk:
float_scale=6
result=$(awk -v scale=$floatscale 'BEGIN { printf "%.*f\n", scale, 1/2 }')
As an alternative, if you'd prefer to use bc and not use AWK alone or with 'bc', Bash's printf supports floating point numbers even though the rest of Bash doesn't.
result=$(echo "scale=$float_scale; $*" | bc -q 2>/dev/null)
result=$(printf '%*.*f' 0 "$float_scale" "$result")
The second line above could instead be:
printf -v $result '%*.*f' 0 "$float_scale" "$result"
Which works kind of like sprintf would and doesn't create a subshell.
Quick and dirty, since scale only applies to the decimal digits and bc does not seem to have a sprintf-like function:
$ bc
scale = 6
result = 1 / 2
if (0 <= result && result < 1) {
print "0"
}
print result;
echo "scale=3;12/7" | bc -q | sed 's/^\\./0./;s/0*$//;s/\\.$//'
I believe here is modified version of the function:
float_scale=6
function float_eval()
{
local stat=0
local result=0.0
if [[ $# -gt 0 ]]; then
result=$(echo "scale=$float_scale; $*" | bc -q | awk '{printf "%f\n", $0}' 2>/dev/null)
stat=$?
if [[ $stat -eq 0 && -z "$result" ]]; then stat=1; fi
fi
echo $result
return $stat
}
Can you put the bc usage into a little better context? What are you using the results of bc for?
Given the following in a file called some_math.bc
scale=6
output=1/2
print output
on the command line I can do the following to add a zero:
$ bc -q some_math.bc | awk '{printf "%08f\n", $0}'
0.500000
If I only needed the output string to have a zero for formatting purposes, I'd use awk.

Resources