find lowest value and index of floating point array awk, sed, sort - sorting

I have the following array:
echo $array
0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.2
I have written a code to sort the values and also get the index number:
echo $array | tr -s ' ' '\n' | awk '{print($0" "NR)}' | sort -g -k1,1
0.2 11
0.2 19
0.3 1
0.3 10
0.3 2
0.3 9
0.4 12
0.4 13
0.4 14
0.4 15
0.4 18
0.4 3
0.4 4
0.4 5
0.4 6
0.4 7
0.4 8
0.5 16
0.5 17
I am having a difficult time extracting only the rows which have the lowest value in the first column (i.e., the lowest values in the array, overall). For example, the desired final product for the above example would be:
0.2 11
0.2 19
It should be able to handle instances of one, and multiple lowest value indices. The code does not need to include any sort of awk, sort, sed, or any commands if they do not need to - anything could work (this is just as far as I have gotten with achieving the final task).

Print the output until the number in the first column does not change.
echo $array | tr -s ' ' '\n' | awk '{print($0" "NR)}' | sort -g -k1,1 |
awk 'length(last) == 0 || last == $1 { last=$1; print; }'
Notes:
It's best to always quote variable expansions echo "$array".
If you don't quote $array, you could just printf "%s\n" $array
You could use nl to number lines (but the columns order would be different).

Using asort() funtion in awk
awk '{split($0,a); for (i in a) a[i]=a[i]" "i; n=asort(a); for (i = 1; i <= 2; i++) print a[i]} '
Demo:
$echo $array
0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.2
$echo $array | awk '{split($0,a); for (i in a) a[i]=a[i]" "i; n=asort(a); for (i = 1; i <= 2; i++) print a[i]} '
0.2 11
0.2 19
$
Explanation:
{split($0,a); -- Initialize a array a from input record
for (i in a) a[i]=a[i]" "i; -- Append current row number to existing value
n=asort(a); -- Call sort array function and store number of element in variable n.
for (i = 1; i <= 2; i++) -- loop for first 2 element of array
print a[i]}
Documentation on asort()
P.S. --> storing number of element in n was not required.

Related

Merge many tab seprated files based on first column

I have many TSV files in a directory that have only three columns, I want to merge all of them based on the first column value (both columns have headers that I need to maintain); if this value is present then it must add the value of the corresponding second and third column, and if value missing in any file add NA and so on (see example). Files might have different number of lines and not ordered by first column, although this can be easily done with sort.
I have tried join but that works nicely for only two files. Can join be expanded for all files in a directory? Here are the example of just three files:
S01.tsv
Accesion Val S01
AJ863320 1 0.2
AM930424 1 0.3
AY664038 2 0.5
S02.tsv
Accesion Val S02
AJ863320 2 0.8
AM930424 1 0.25
EU236327 1 0.14
EU434346 2 0.2
S03.tsv
Accesion Val S03
AJ863320 5 0.2
EU236327 1 0.5
EU434346 2 0.3
Outfile should be:
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
AM930424 1 0.3 0.25 NA
AY664038 2 0.5 NA NA
EU236327 1 NA 0.14 0.5
EU434346 2 NA 0.2 0.3
Ok I've tried with awk by taking help here, but not successful
BEGIN { OFS="\t" } # tab separated columns
FNR==1 { f++ } # counter of files
{
a[0][$1]=$1 # reset the key for every record
for(i=2;i<=NF;i++) # for each non-key element
a[f][$1]=a[f][$1] $i ( i==NF?"":OFS ) # combine them to array element
}
END { # in the end
for(i in a[0]) # go thru every key
for(j=0;j<=f;j++) # and all related array elements
printf "%s%s", a[j][i], (j==f?ORS:OFS)
} # output them, nonexistent will output empty
I would harness GNU AWK for this task following way, let S01.tsv content be
Accesion Val S01
AJ863320 1 0.2
AM930424 1 0.3
AY664038 2 0.5
and S02.tsv content be
Accesion Val S02
AJ863320 2 0.8
AM930424 1 0.25
EU236327 1 0.14
EU434346 2 0.2
and S03.tsv content be
Accesion Val S03
AJ863320 5 0.2
EU236327 1 0.5
EU434346 2 0.3
then
awk 'BEGIN{OFS="\t"}NR==1{title=$1 OFS $2}{arr[$1 OFS $2][FILENAME]=$3}END{print title,arr[title]["S01.tsv"],arr[title]["S02.tsv"],arr[title]["S03.tsv"];delete arr[title];for(i in arr){print i,"S01.tsv" in arr[i]?arr[i]["S01.tsv"]:"NA","S02.tsv" in arr[i]?arr[i]["S02.tsv"]:"NA","S03.tsv" in arr[i]?arr[i]["S03.tsv"]:"NA"}}' S01.tsv S02.tsv S03.tsv
gives output
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
EU236327 1 NA 0.14 0.5
AM930424 1 0.3 0.25 NA
EU434346 2 NA 0.2 0.3
AY664038 2 0.5 NA NA
Explanation: I am storing data in 2D array arr, using values from 1st and 2nd column concatenated using output field separator (first dimension) and filename (2nd dimensions). Values in array are values from 3rd column. After data is collected I started by printing title (header) row, which I then delete from array, then I iterate over first dimension of array, and for each element I print key followed by values from each file or NA if there was not value. Observe that I use in check rather than looking for truthiness of value itself as this would alter 0 values into NAs. Disclaimer: this solution assumes you are accepting any order of output rows beyond headers, if this does not hold do not use this solution.
(tested in GNU Awk 5.0.1)
Using GNU awk for arrays of arrays and sorted_in:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
FNR == 1 {
if ( NR == 1 ) {
numCols = split($0,hdrs)
}
else {
hdrs[++numCols] = $3
}
next
}
{
accsValsCols2ss[$1][$2][numCols] = $3
}
END {
for ( colNr=1; colNr<=numCols; colNr++ ) {
printf "%s%s", hdrs[colNr], (colNr<numCols ? OFS : ORS)
}
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( acc in accsValsCols2ss ) {
PROCINFO["sorted_in"] = "#ind_num_asc"
for ( val in accsValsCols2ss[acc] ) {
printf "%s%s%s", acc, OFS, val
for ( colNr=3; colNr<=numCols; colNr++ ) {
s = ( colNr in accsValsCols2ss[acc][val] ? accsValsCols2ss[acc][val][colNr] : "NA" )
printf "%s%s", OFS, s
}
print ""
}
}
}
$ awk -f tst.awk S01.tsv S02.tsv S03.tsv
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
AM930424 1 0.3 0.25 NA
AY664038 2 0.5 NA NA
EU236327 1 NA 0.14 0.5
EU434346 2 NA 0.2 0.3

Wildcard symbol with grep -F

I have the following file
0 0
0 0.001
0 0.032
0 0.1241
0 0.2241
0 0.42
0.0142 0
0.0234 0
0.01429 0.01282
0.001 0.224
0.098 0.367
0.129 0
0.123 0.01282
0.149 0.16
0.1345 0.216
0.293 0
0.2439 0.01316
0.2549 0.1316
0.2354 0.5
0.3345 0
0.3456 0.0116
0.3462 0.316
0.3632 0.416
0.429 0
0.42439 0.016
0.4234 0.3
0.5 0
0.5 0.33
0.5 0.5
Notice that the two columns are sorted ascending, first by the first column and then by the second one. The minimum value is 0 and the maximum is 0.5.
I would like to count the number of lines that are:
0 0
and store that number in a file called "0_0". In this case, this file should contain "1".
Then, the same for those that are:
0 0.0*
For example,
0 0.032
And call it "0_0.0" (it should contain "2"), and this for all combinations only considering the first decimal digit (0 0.1*, 0 0.2* ... 0.0* 0, 0.0* 0.0* ... 0.5 0.5).
I am using this loop:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i" "$j"" file | wc -l > "$i"_"$j"
done
done
rm 0_0 #this 0_0 output is badly done, the good way is with the next command, which accepts \n
pcregrep -M "0 0\n" file | wc -l > 0_0
The problem is that for example, line
0.0142 0
will not be recognized by the iteration "0.0 0", since there are digits after the "0.0". Removing the -F option in grep in order to consider all numbers that start by "0.0" will not work, since the point will be considered a wildcard symbol and therefore for example in the iteration "0.1 0" the line
0.0142 0
will be counted, because 0.0142 is a 0"anything"1.
I hope I am making myself clear!
Is there any way to include a wildcard symbol with grep -F, like in:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i"* "$j"*" file | wc -l > "$i"_"$j"
done
done
(Please notice the asterisks after the variables in the grep command).
Thank you!
Don't use shell loops just to manipulate text, that's what the guys who invented shell also invented awk to do. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
It sounds like all you need is:
awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{ for (pair in cnt) {print cnt[pair] > pair; close(pair)} }' file
That will be vastly more efficient than your nested shell loops approach.
Here's what it'll be outputting to the files it creates:
$ awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{for (pair in cnt) print pair "\t" cnt[pair]}' file
0.0_0.3 1
0_0.4 1
0.5_0 1
0.2_0.5 1
0.4_0.3 1
0.0_0 2
0.1_0.0 1
0.3_0 1
0.1_0.1 1
0.1_0.2 1
0.3_0.0 1
0_0 1
0.1_0 1
0.5_0.3 1
0.4_0 1
0.3_0.3 1
0.2_0.0 1
0_0.0 2
0.5_0.5 1
0.3_0.4 1
0.2_0.1 1
0.0_0.0 1
0_0.1 1
0_0.2 1
0.4_0.0 1
0.2_0 1
0.0_0.2 1

Sum values of specific column in multiple files, considering ranges defined in another file

I have a file (let say file B) like this:
File B:
A1 3 5
A1 7 9
A2 2 5
A3 1 3
The first column defines a filename and the other two define a range in that specific file. In the same directory, I have also three more files (File A1, A2 and A3). Here is 10 sample lines from each file:
File A1:
1 0.6
2 0.04
3 0.4
4 0.5
5 0.009
6 0.2
7 0.3
8 0.2
9 0.15
10 0.1
File A2:
1 0.2
2 0.1
3 0.2
4 0.4
5 0.2
6 0.3
7 0.8
8 0.1
9 0.9
10 0.4
File A3:
1 0.1
2 0.2
3 0.5
4 0.3
5 0.7
6 0.3
7 0.3
8 0.2
9 0.8
10 0.1
I need to add a new column to file B, which in each line gives the sum of values of column two in the defined range and file. For example, file B row 1 means that calculate the sum of values of line 3 to 5 in column two of file A1.
The desired output is something like this:
File B:
A1 3 5 0.909
A1 7 9 0.65
A2 2 5 0.9
A3 1 3 0.8
All files are in tabular text format. How can I perform this task? I have access to bash (ubuntu 14.04) and R but not an expert bash or R programmer.
Any help would be greatly appreciated.
Thanks in advance
Given the first file fileB and 3 input files A1, A2 and A3, each with two columns, this produces the output that you want:
#!/bin/bash
while read -r file start end; do
sum=$(awk -vs="$start" -ve="$end" 'NR==s,NR==e{sum+=$2}END{print sum}' "$file")
echo "$file $start $end $sum"
done < fileB
This uses awk to sum up the values between lines between the range specified by the variables s and e. It's not particularly efficient, as it loops through $file once per line of fileB but depending on the size of your inputs, that may not be a problem.
Output:
A1 3 5 0.909
A1 7 9 0.65
A2 2 5 0.9
A3 1 3 0.8
To redirect the output to a file, just add > output_file to the end of the loop. To overwrite the original file, you will need to first write to a temporary file, then overwrite the original file (e.g. > tmp && mv tmp fileB).

Grep a pattern and ignore others

I have an output with this pattern :
Auxiliary excitation energy for root 3: (variable value)
It appears a consequent number of time in the output, but I only want to grep the last one.
I'm a beginner in bash so I didn't understand the "tail" fonction yet...
Here is what i wrote :
for nn in 0.00000001 0.4 1.0; do
for w in 0.0 0.001 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5; do
a=`grep ' Auxiliary excitation energy for root 3: ' $nn"_"$w.out`
echo $w" "${a:47:16} >> data_$nn.dat
done
done
With $nn and $w parameters.
But with this grep I only have the first pattern. How to only get the last one?
data example :
line 1 Auxiliary excitation energy for root 3: 0.75588889
line 2 Auxiliary excitation energy for root 3: 0.74981555
line 3 Auxiliary excitation energy for root 3: 0.74891111
line 4 Auxiliary excitation energy for root 3: 0.86745155
My command grep line 1, i would like to grep the last line which has my pattern : here line 4 with my example.
To get the last match, you can use:
grep ... | tail -n 1
Where ... are your grep parameters. So your script would read (with a little cleanup):
for nn in 0.00000001 0.4 1.0; do
for w in 0.0 0.001 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5; do
a=$( grep ' Auxiliary excitation energy for root 3: ' $nn"_"$w.out | tail -n 1 )
echo $w" "${a:47:16} >> data_$nn.dat
done
done

Matching numbers in two different files using awk

I have two files (f1 and f2), both made of three columns, of different lengths. I would like to create a new file of four columns in the following way:
f1 f2
1 2 0.2 1 4 0.3
1 3 0.5 1 5 0.2
1 4 0.2 2 3 0.6
2 2 0.5
2 3 0.9
If the numbers in the first two columns are present in both files, then we print the first two numbers and the third number of each file (e.g. in both there is 1 4, in f3 there should be 1 4 0.2 0.3; otherwise, if the two first numbers are missing in f2 just print a zero in the fourth column.
The complete results of these example should be
f3
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6
The script that I wrote is the following:
awk '{str1=$1; str2=$2; str3=$3;
getline < "f2";
if($1==str1 && $2==str2)
print str1,str2,str3,$3 > "f3";
else
print str1,str2,str3,0 > "f3";
}' f1
but it just looks if the same two numbers are in the same row (it does not go through all the f2 file) giving as results
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0
2 2 0.5 0
2 3 0.9 0
This awk should work:
awk 'FNR==NR{a[$1,$2]=$3;next} {print $0, (a[$1,$2])? a[$1,$2]:0}' f2 f1
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6

Resources