Compare values of each records in field 1 to find min and max values AWK - bash

I am new to text preprocessing and AWK language.
I am trying to loop through each record in a given field(field1) and find the max and min of values and store it in a variable.
Algorithm :
1) Set Min = 0 and Max = 0
2) Loop through $1(field 1)
3) Compare FNR of the field 1 and set Max and Min
4) Finally print Max and Min
this is what I tried :
BEGIN{max = 0; min = 0; NF = 58}
{
for(i = 0; i < NF-57; i++)
{
for(j =0; j < NR; j++)
{
min = (min < $j) ? min : $j
max = (max > $j) ? max : $j
}
}
}
END{print max, min}
#Dataset
f1 f2 f3 f4 .... f58
0.3 3.3 0.5 3.6
0.9 4.7 2.5 1.6
0.2 2.7 6.3 9.3
0.5 3.6 0.9 2.7
0.7 1.6 8.9 4.7
Here, f1,f2,..,f58 are the fields or columns in Dataset.
I need to loop through column one(f1) and find Min-Max.
Output Required:
Min = 0.2
Max = 0.9
What I get as a result:
Min = ''(I dont get any result)
Max = 9.3(I get max of all the fields instead of field1)
This is for learning purpose so I asked for one column So that I can try on my own for multiple columns
These is what I have:
This for loop would only loop 4 times as there r only four fields. Will the code inside the for loop execute for each record that is, for 5 times?
for(i = 0; i < NF; i++)
{
if (min[i]=="") min[i]=$i
if (max[i]=="") max[i]=$i
if ($i<min[i]) min[i]=$i
if ($i>max[i]) max[i]=$i
}
END
{
OFS="\t";
print "min","max";
#If I am not wrong, I saved the data in an array and I guess this would be the right way to print all min and max?
for(i=0; i < NF; i++;)
{
print min[i], max[i]
}
}

Here is a working solution which is really much easier than what you are doing:
/^-?[0-9]*(\.[0-9]*)?$/ checks that $1 is indeed a valid number, otherwise it is discarded.
sort -n | awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {a[c++]=$1} END {OFS="\t"; print "min","max";print a[0],a[c-1]}'
If you don't use this, then min and max need to be initialized, for example with the first value:
awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {if (min=="") min=$1; if (max=="") max=$1; if ($1<min) min=$1; if ($1>max) max=$1} END {OFS="\t"; print "min","max";print min, max}'
Readable versions:
sort -n | awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
a[c++]=$1
}
END {
OFS="\t"
print "min","max"
print a[0],a[c-1]
}'
and
awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
if (min=="") min=$1
if (max=="") max=$1
if ($1<min) min=$1
if ($1>max) max=$1
}
END {
OFS="\t"
print "min","max"
print min, max
}'
On your input, is outputs:
min max
0.2 0.9
EDIT (replying to the comment requiring more information on how awk works):
Awk loops through lines (named records) and for each line you have columns (named fields) available. Each awk iteration reads a line and provides among others the NR and NF variables. In your case, you are only interested in the first column, so you will only use $1 which is the first column field. For each record where $1 is matching /^-?[0-9]*(\.[0-9]*)?$/ which is a regex matching positive and negative integers or floats, we are either storing the value in an array a (in the first version) or setting the min/max variables if needed (in the second version).
Here is the explanation for the condition $1 ~ /^-?[0-9]*(\.[0-9]*)?$/:
$1 ~ means we are checking if the first field $1 matches the regex between slashes
^ means we start matching from the beginning of the $1 field
-? means an optional minus sign
[0-9]* is any number of digits (including zero, so .1 or -.1 can be matched)
()? means an optional block which can be present or not
\.[0-9]* if that optional block is present, it should start with a dot and contain zero or more digits (so -. or . can be matched! adapt the regex if you have uncertain input)
$ means we are matching until the last character from the $1 field
If you wanted to loop through fields, you would have to use a for loop from 1 to NF (included) like this:
echo "1 2 3 4" | awk '{for (i=1; i<=NF; i++) {if (min=="") min=$(i); if (max=="") max=$(i); if ($(i)<min) min=$(i); if ($(i)>max) max=$(i)}} END {OFS="\t"; print "min","max";print min, max}'
(please note that I have not checked the input here for simplicity purposes)
Which outputs:
min max
1 4
If you had more lines as an input, awk would also process them after reading the first record, example with this input:
1 2 3 4
5 6 7 8
Outputs:
min max
1 8
To prevent this and only work on the first line, you can add a condition like NR == 1 to process only the first line or add an exit statement after the for loop to stop processing the input after the first line.

If you're looking to only column 1, you may try this:
awk '/^[[:digit:]].*/{if($1<min||!min){min=$1};if($1>max){max=$1}}END{print min,max}' dataset
The script looks for line starting with digit and set the min or max if it didn't find it before.

Related

Bash iterate through fields of a TSV file and divide it by the sum of the column

I have a tsv file with several columns, and I would like to iterate through each field, and divide it by the sum of that column:
Input:
A 1 2 1
B 1 0 3
Output:
A 0.5 1 0.25
B 0.5 0 0.75
I have the following to iterate through the fields, but I am not sure how I can find the sum of the column that the field is located in:
awk -v FS='\t' -v OFS='\t' '{for(i=2;i<=NF;i++){$i=$i/SUM_OF_COLUMN}} 1' input.tsv
You may use this 2-pass awk:
awk '
BEGIN {FS=OFS="\t"}
NR == FNR {
for (i=2; i<=NF; ++i)
sum[i] += $i
next
}
{
for (i=2; i<=NF; ++i)
$i = (sum[i] ? $i/sum[i] : 0)
}
1' file file
A 0.5 1 0.25
B 0.5 0 0.75
With your shown samples please try following awk code in a single pass of Input_file. Simply creating 2 arrays 1 for sum of columns with their indexes and other for values of fields along with their field numbers and in END block of this program traversing till value of FNR(all lines) and then printing values of arrays as per need (where when we are traversing through values then dividing their actual values with sum of that respective column).
awk '
BEGIN{FS=OFS="\t"}
{
arr[FNR,1]=$1
for(i=2;i<=NF;i++){
sum[i]+=$i
arr[FNR,i]=$i
}
}
END{
for(i=1;i<=FNR;i++){
printf("%s\t",arr[i,1])
for(j=2;j<=NF;j++){
printf("%s%s",sum[j]?(arr[i,j]/sum[j]):"N/A",j==NF?ORS:OFS)
}
}
}
' Input_file

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

improve bash loop with awk split

The awk below improved by #hek2mgl runs, however it takes ~15 hours to complete. It is basically matching input files that are 21 - 259 records to a file of 11,137,660 records. It is a lot but hopefully it can be made faster. Maybe If spilt $5 on the hyphen AGRN-6|gc=75 to AGRN - 6|gc=75could speed up the process. Not sure if the below is a start or not. Essentially what it does is use the input files of which there are 4 to search and match in a large 11,000,000 record file. Thank you :).
input
AGRN
CCDC39
CCDC40
CFTR
file that is searched in
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 1 0
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 2 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 3 2
output ($4 $5 average of $7)
chr1:955543 AGRN-6|gc=75 1.3
awk
BEGIN{FS="[\t| -]+"}
# Read search terms from file1 into 's'
FNR==NR {
s[$0=1]
next
}
{
# Check if $5 matches one of the search terms
for(i in s) {
if($5 ~ i) {
# check for match
if s[$5] exists
s[$5] {
# Store first two fields for later usage
a[$5]=$1
b[$5]=$2
# Add $9 to total of $9 per $5
t[$5]+=$8
# Increment count of occurences of $5
c[$5]++
next
}
}
}
END {
# Calculate average and print output for all search terms
# that has been found
for( i in t ) {
avg = t[i] / c[i]
printf "%s:%s\t%s\t%s\n", a[i], b[i], i, avg | "sort -k3,3n"
}
}
Simplify:
awk '
NR == FNR {input[$0]; next}
{
split($5, a, "-")
if (a[1] in input) {
key = $4 OFS $5
n[key]++
sum[key] += $7
}
}
END {
for (key in n)
printf "%s %.1f\n", key, sum[key]/n[key]
}
' input file
Your code is broken because of the over-use of arrays, but mainly this:
FNR==NR {
s[$0=1]
# ^^^^^
next
}
Array s will only have a single key, the number "1" because for each line you assign the value "1" to $0. You should write
s[$0] = 1
I'd be interested to hear what the speed is of the following, I'm not sure it will be much slower since it doesn't require awk to do anything clumsy but it still requires the number of input selection passes to complete. If you want to optimize it I think you need to use associative arrays and hash the input selection match to its own array. That way you can have it done in one pass over the file - though still the same amount of potential passes per line unless you can skip searching after the first match you may be slightly quicker.
Input file: select.txt
Search file: search_file.txt
while IFS= read a; do
awk "BEGIN {cnt=0;var=0}{ if (\$5~ \"${a}\") { var=var+\$7;field4=\$4; cnt+=1; field5=\$5; }; } END{print field4\" \"field5\" \"var/cnt}" search_file.txt
done < select.txt

Hi, trying to obtain the mean from the array values using awk?

Im new to bash programming. Here im trying to obtain the mean from the array values.
Heres what im trying:
${GfieldList[#]} | awk '{ sum += $1; n++ } END { if (n > 0) print "mean: " sum / n; }';
Using $1 Im not able to get all the values? Guys pls help me out in this...
For each non-empty line of input, this will sum everything on the line and print the mean:
$ echo 21 20 22 | awk 'NF {sum=0;for (i=1;i<=NF;i++)sum+=$i; print "mean=" sum / NF; }'
mean=21
How it works
NF
This serves as a condition: the statements which follow will only be executed if the number of fields on this line, NF, evaluates to true, meaning non-zero.
sum=0
This initializes sum to zero. This is only needed if there is more than one line.
for (i=1;i<=NF;i++)sum+=$i
This sums all the fields on this line.
print "mean=" sum / NF
This prints the sum of the fields divided by the number of fields.
The bare
${GfieldList[#]}
will not print the array to the screen. You want this:
printf "%s\n" "${GfieldList[#]}"
All those quotes are definitely needed .

Bash First Element in List Recognition

I'm very new to Bash so I'm sorry if this question is actually very simple. I am dealing with a text file that contains many vertical lists of numbers 2-32 counting up by 2, and each number has a line of other text following it. The problem is that some of the lists are missing numbers. Any pointers for a code that could go through and check to see if each number is there, and if not add a line and put the number in.
One list might look like:
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8 daflgkdsakfjhasdlkjhfasdjkhf
12 dlsagflakdjshgflksdhflksdahfl
All the way down to 32. How would I in this case make it so the 10 is recognized as missing and then added in above the 12? Thanks!
Here's one awk-based solution (formatted for readability, not necessarily how you would type it):
awk ' { value[0 + $1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i]
}' input.txt
It basically just records the existing lines in a key/value pair (associative array), then at the end, prints all the records you care about, along with the (possibly empty) value saved earlier.
Note: if the first column needs to be seen as a string instead of an integer, this variant should work:
awk ' { value[$1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i ""]
}' input.txt
You can use awk to figure out the missing line and add it back:
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
Testing:
cat file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
20 daflgkdsakfjhasdlkjhfasdjkhf
24 dlsagflakdjshgflksdhflksdahfl
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8
10
12
14
16
18
20 daflgkdsakfjhasdlkjhfasdjkhf
22
24 dlsagflakdjshgflksdhflksdahfl
26
28
30
32

Resources