Iterating over a text file in bash and rounding each number - bash

My file looks like this
0 0 1 0.2 1 1
1 1 0.8 0.1 1
0.2 0.4 1 0 1
And I need to a create a new output file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
i.e. if the number is greater than 0.5, it is rounded up to 1, and if it less than 0.5, it is rounded down to 0 and put into a new file.
The file is quite large, with ~ 1400000000 values. I would quite like to write a bash script to do this.
I am guessing the best way to do this would be to iterate over each value in a for loop, with an if statement inside which tests whether the number is greater or less than 0.5 and then prints 0 or 1 dependent.
The pseudocode would look like this, but my bash isn't great so - before you tell my it isnt syntatically correct, I already know
#!/bin/bash
#reads in each line
while read p; do
#loops through each number in each line
for i in p; do
#tests if each number is greater than or equal to 0.5 and prints accordingly
if [i => 0.5]
then
print 1
else
print 0
fi
done < test.txt >
I'm not really sure how to do this. Can anyone help? Thanks.

awk '{
for( i=1; i<=NF; i++ )
$i = $i<0.5 ? 0 : 1
}1' input_file > output_file
$i = $i<0.5 ? 0 : 1 changes each field to 0 or 1 and {...}1 will print the line with the changed values afterwards.

another awk without loops...
$ awk -v RS='[ \n]' '{printf ($1>=0.5) RT}' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
if the values are not between 0 and 1, you may want to change to
$ awk -v RS='[ \n]' '{printf "%.0f%s", $1, RT}' file
note that default rounding is to the even (i.e. 0.5 -> 0, but 1.5 -> 2). If you want always to round up
$ awk -v RS='[ \n]' '{i=int($1); printf "%d%s", i+(($1-i)>=0.5), RT}' file
should take of non-negative numbers. For negatives, there are again two alternatives, round towards zero or towards negative infinity.

Here's one in Perl using regex and look-ahead:
$ perl -p -e 's/0(?=\.[6789])/1/g;s/\.[0-9]+//g' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
I went with the if it less than 0.5, it is rounded down to 0 part.

Related

Extract compound data from SDF file using IDNUMBER and write to a new file

I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.
My first file idnumber.txtlooks like this:
4323-7584
K8933-4943
L2837-0493
The file I am attempting to filter the molecule blocks from has entries as follows:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (K784-9550)
K784-9550
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
4323-7584
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
L2789-0943
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-2738)
4323-2738
> <SALT>
NaCl
$$$$
The file repeats in this fashion, starting with the -ISIS- -- StrEd -- and ending with the $$$$. I need to extract this entire block for each string in IDNUMBER. So the expected output would be the block from above from -ISIS- to the $$$$ that has a matching ID in the IDNUMBER.txt.
Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --
I have tried a few options of sed trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:
#!/bin/bash
cat idnumbers.txt | while read line
do
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done
The logic behind what I was attempting was to find the block that would match the start as the ISIS phrase and end with the relevant ID number, copying that to a file. I realise now that what my logic was doing would skip the $$$$ that terminates each block.
But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf.
Expected output:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
4323-7584
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
L2789-0943
$$$$
Edit:
So I have tried a different approach based on another question but have not been able to figure out how to alter the key assigned to a record in awk based on recognizing the characters at the line containing the IDNUMBER because it is a different field for each record.
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
(NR==FNR){a[$1]=$0; next}
($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
I assume it would be a matter of changing the field reference in the array $1 to an expression that recognizes the line after > <IDNUMBER>(xyz), but I am unsure how to go about achieving that.
I am missing something
In this command
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
you are using following regular expressions
^-ISIS-$
^$line$
^ denotes start of line, $ denotes end of line
1st is looking for -ISIS- spanning whole line, whilst your file has
-ISIS- -- StrEd --
that is -ISIS- as part of line, therefore you should use regular expression without anchors that is -ISIS-
2nd does include $ and then some other characters (line) implying some character being after end, which is impossible, so your code will keeping printing until all file is made, I have not idea if this is desired behavior, but be warned that more common way to do so in GNU sed is using $ as address (meaning last line) for example if you want to print first line holding digit and all following you could do
sed -n '/[0-9]/,$p' file.txt
Maybe this is what you are looking for, some explanation:
[[:blank:]] -> Space or tab only, not newline characters
First regex is looking for the start pattern -ISIS- -- StrEd -- you mentioned (with a variable length of spaces/tabs between), and if it's a match, the variable found is set to 1
Second regex is looking for the end pattern > <IDNUMBER> (xxxx-xxxx) (also with a variable length of spaces/tabs), where xxxx-xxxx is coming from the file idnumber.txt, and if it's a match set found to 2.
So now we know we are between the desired start and end of "idnumber"-text we want to print
Third regex is looking for $$$$ and set found to 3 if matching.
This is the "real" endpoint, so jump with exit to the END section
So if the value of found is less or equal 2 the input line of compound_library.sdf is saved to variable text
At the END block of the awk the value of found is checked for the value 3 so the whole variable text is printed
while IFS= read IdNumber; do
awk '
BEGIN {
found=0
}
/^[[:blank:]]*-ISIS-[[:blank:]]*--[[:blank:]]*StrEd[[:blank:]]*--/ {
found=1
}
/^>[[:blank:]]*<IDNUMBER>[[:blank:]]*\('"${IdNumber}"'\)/ {
found++
#print "IdNumber='"${IdNumber}"', found=" found >>"/dev/stderr"
}
found <= 2 {
text=sprintf("%s%s\n", text, $0)
}
/^\$\$\$\$$/ {
found++
exit
}
END {
if (found == 3) {
printf text
}
}' \
compound_library.sdf
#compound_library.sdf > ${IdNumber}.sdf
done < idnumber.txt

Increment last number of first line in file

I want to write a shell script which can increment the last value of the first line of a certain file structure:
File-structure:
p cnf integer integer
integer integer ... 0
For Example:
p cnf 11 9
1 -2 0
3 -1 5 0
To:
p cnf 11 10
1 -2 0
3 -1 5 0
The dots should stay the same.
If you could use perl:
perl -pe 's/(-*\d+)$/$1+1/e' if $. == 1' inputfile
Here (-*\d+)$ is capturing integer value(optionally negative) at the end of the line and e flag allows the execution of code before replacement, so the value increments.
With GNU awk:
awk 'NR==1{$NF++} {print}' file
or
awk 'NR==1{$NF++}1' file
Output:
p cnf 11 10
1 -2 0
3 -1 5 0
$NF contains last column.

Optimally finding the index of the maximum element in BASH array

I am using bash in order to process software responses on-the-fly and I am looking for a way to find the
index of the maximum element in the array.
The data that gets fed to the bash script is like this:
25 9
72 0
3 3
0 4
0 7
And so I create two arrays. There is
arr1 = [ 25 72 3 0 0 ]
arr2 = [ 9 0 3 4 7 ]
And what I need is to find the index of the maximum number in arr1 in order to use it also for arr2.
But I would like to see if there is a quick - optimal way to do this.
Would it maybe be better to use a dictionary structure [key][value] with the data I have? Would this make the process easier?
I have also found [1] (from user jhnc) but I don't quite think it is what I want.
My brute - force approach is the following:
function MAX {
arr1=( 25 72 3 0 0 )
arr2=( 9 0 3 4 7 )
local indx=0
local max=${arr1[0]}
local flag
for ((i=1; i<${#arr1[#]};i++)); do
#To avoid invalid arithmetic operators when items are floats/doubles
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
if [ $flag == "True" ]; then
indx=${i}
max=${arr1[${i}]}
fi
done
echo "MAX:INDEX = ${max}:${indx}"
echo "${arr1[${indx}]}"
echo "${arr2[${indx}]}"
}
This approach obviously will work, BUT, is it the optimal one? Is there a faster way to perform the task?
arr1 = [ 99.97 0.01 0.01 0.01 0 ]
arr2 = [ 0 6 4 3 2 ]
In this example, if an array contains floats then I would get a
syntax error: invalid arithmetic operator (error token is ".97)
So, I am using
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
In order to overcome this issue.
Finding a maximum is inherently an O(n) operation. But there's no need to spawn a Python process on each iteration to perform the comparison. Write a single awk script instead.
awk 'BEGIN {
split(ARGV[1], a1);
split(ARGV[2], a2);
max=a1[1];
indx=1;
for (i in a1) {
if (a1[i] > max) {
indx = i;
max = a1[i];
}
}
print "MAX:INDEX = " max ":" (indx - 1)
print a1[indx]
print a2[indx]
}' "${arr1[*]}" "${arr2[*]}"
The two shell arrays are passed as space-separated strings to awk, which splits them back into awk arrays.
It's difficult to do it efficiently if you really do need to compare floats. Bash can't do floats, which means invoking an external program for every number comparison. However, comparing every number in bash, is not necessarily needed.
Here is a fast, pure bash, integer only solution, using comparison:
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
# Get the maximum, and also save its index(es)
for i in "${!arr1[#]}"; do
if ((arr1[i]>arr1_max)); then
arr1_max=${arr1[i]}
max_indexes=($i)
elif [[ "${arr1[i]}" == "$arr1_max" ]]; then
max_indexes+=($i)
fi
done
# Print the results
printf '%s\n' \
"Array1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Here is another optimal method, that can handle floats. Comparison in bash is avoided altogether. Instead the much faster sort(1) is used, and is only needed once. Rather than starting a new python instance for every number.
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
arr1_max=$(printf '%s\n' "${arr1[#]}" | sort -n | tail -1)
for i in "${!arr1[#]}"; do
[[ "${arr1[i]}" == "$arr1_max" ]] &&
max_indexes+=($i)
done
# Print the results
printf '%s\n' \
"Array 1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Example output:
Array 1 max is 72
The index(s) of the maximum are:
1
The corresponding values from array 2 are:
0
Unless you need those arrays, you can also feed your input script directly in to something like this:
#!/bin/bash
input-script |
sort -nr |
awk '
(NR==1) {print "Max: "$1"\nCorresponding numbers:"; max = $1}
{if (max == $1) print $2; else exit}'
Example (with some extra numbers):
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
sort -nr |
awk '(NR==1) {max = $1; print "Max: "$1"\nCorresponding numbers:"}
{if (max == $1) print $2; else exit}'
Max: 72
Corresponding numbers:
4
11
0
You can also do it 100% in awk, including sorting:
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
awk '
{
col1[a++] = $1
line[a-1] = $0
}
END {
asort(col1)
col1_max = col1[a-1]
print "Max is "col1_max"\nCorresponding numbers are:"
for (i in line) {
if (line[i] ~ col1_max"\\s") {
split(line[i], max_line)
print max_line[2]
}
}
}'
Max is 72
Corresponding numbers are:
0
11
4
Or, just to get the maximum of column 1, and any single number from column 2, that corresponds with it. As simply as possible:
$ echo \
'25 9
72 0
3 3
0 4
0 7' |
sort -nr |
head -1
72 0

If operator inside for loop

I have input file as below, need to do this conversion col1*0 + col2*1 + col3*2 for every 3 column triplet.
input.txt - All positive numbers, can be decimals, real file has 1000s of columns.
0 0 0 1 0 0
0 1 0 0 0 1
0 0 1 0 0 0
I have the below gawk line that does that:
gawk '{for(i=1;i<=NF;i+=3)x=(x?x FS:"")(($(i+1))+($(i+2)*2));print x;x=y}' input.txt
0 0
1 2
2 0
Additionally, I need to check if 3 numbers are all zeros, if they are all zeros then the conversion should be -9.
Pseudo code:
if($i==0 & $(i+1)==0 & $(i+2)==0) {-9} else {$(i+1)+$(i+2)*2}
#or as all numbers are positive.
if(($i+$(i+1)+$(i+2))==0) {-9} else {$(i+1)+$(i+2)*2}
Expected output:
-9 0
1 2
2 -9
Data description:
This data is output from IMPUTE2 software - a genotype imputation and haplotype phasing program. Rows are SNPs, columns are samples. Every SNP is represented by 3 columns. 3 numbers per SNP with range 0-1 (probability of allele AA AB BB). So in above example we have 3 SNPs and 2 samples. Imputation can also be represented as dosage value, 1 number per SNP with range 0-2. We are trying to covert probability format into dosage format. When IMPUTE2 can't give any probabilities to any of the alleles, it outputs as 0 0 0, then we should convert as no call -9.
You want the sum to be different if the three given columns are 0. For this, you can expand the ternary operator to something like>
gawk '{ for(i=1;i<=NF;i+=3) {
x=$(i+1) + $(i+2)*2; # the sum
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
}
print res; res="" # print stored line and empty for next loop
}' file
That is, append the value -9 if all the elements are 0. Otherwise, the calculated x:
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^
if three columns are 0..........|
If all values are positive, the check can be reformatted to just compare if the sum is 0 or not.
($i + $(i+1) + $(i+2)) ? x : -9
Testing with your file apparently works:
$ gawk '{for(i=1;i<=NF;i+=3) {x=$(i+1) + $(i+2)*2; res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)} print res; res=""}' file
-9 0
1 2
2 -9
another awk one-liner (assuming non-negative input values)
$ awk '{c1=$2+2*$3;c2=$5+2*$6; print c1||$1?c1:-9,c2||$4?c2:-9}' lop
-9 0
1 2
2 -9

awk for loop to break up file into chunks

I have a large file that I would like to break into chunks by field 2. Field 2 ranges in value from about 0 to about 250 million.
1 10492 rs55998931 C T 6 7 3 3 - 0.272727272727273 0.4375
1 13418 . G A 6 1 2 3 DDX11L1 0.25 0.0625
1 13752 . T C 4 4 1 3 DDX11L1 0.153846153846154 0.25
1 13813 . T G 1 4 0 1 DDX11L1 0.0357142857142857 0.2
1 13838 rs200683566 C T 1 4 0 1 DDX11L1 0.0357142857142857 0.2
I want field 2 to be broken up into intervals of 50,000, but overlapping by 2,000. For example, the first three awk commands would look like:
awk '$1=="1" && $2>=0 && $2<=50000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.0kb.50kb
awk '$1=="1" && $2>=48000 && $2<=98000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.48kb.98kb
awk '$1=="1" && $2>=96000 && $2<=146000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.96kb.146kb
I know that there's a way I can do this using a for loop with variables like i and j. Can someone help me out?
awk '$1=="1"{n=int($2/48000); print>("chr1." (48*n) "kb." (48*n+50) "kb");n--; if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb");}' Highalt.Lowalt.allelecounts.filteredformissing.freq
Or spread out over multiple lines:
awk '$1=="1"{
n=int($2/48000)
print>("chr1." (48*n) "kb." (48*n+50) "kb")
n--
if (n>=0 && $2/1000<=48*n+50)
print>("chr1." (48*n) "kb." (48*n+50) "kb")
}' Highalt.Lowalt.allelecounts.filteredformissing.freq
How it works
$1=="1"{
This selects all lines whose first field is 1. (You didn't mention this in the text but your code applied this restriction.
n=int($2/48000)
This computes which bucket the line belongs in.
print>("chr1." (48*n) "kb." (48*n+50) "kb")
This writes the line to the appropriate file
n--
This decrements the bucket number
if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb")
If this line also fits within the overlapping range of the previous bucket, then write it to that bucket also.
}
This closes the group started by selecting $1=="1".

Resources