How to count the occurence of negative and positive values in a column using awk? - bash

I have a file that looks like this:
FID IID data1 data2 data3
1 RQ00001-2 1.670339 -0.792363849 -0.634434791
2 RQ00002-0 -0.238737767 -1.036163943 -0.423512414
3 RQ00004-9 -0.363886913 -0.98661685 -0.259951265
3 RQ00004-9 -9 -0.98661685 0.259951265
I want to count the number of positive numbers in column 3 (data 1) versus negative numbers excluding -9. Therefore, for column 3 it will be 1 positive vs 2 negative. I didn't include -9 as this stands for missing data. For data2, this would be 3 negative versus 1 positive. For the last column it will be 3 negative versus 1 positive.
I preferably would like to use awk, but since I am new I need help. I use the command below but this just counts all the - values but I need it to exclude -9. Is there a more sophisticated way of doing this?
awk '$3 ~ /^-/{cnt++} END{print cnt}' filename.txt

Assumptions:
determine the number of negative and positive values for the 3rd thru Nth columns
One awk idea:
awk '
NR>1 { for (i=3;i<=NF;i++) {
if ($i == -9) continue
else if ($i < 0) neg[i]++
else pos[i]++
}
}
END { printf "Neg/Pos"
for (i=3;i<=NF;i++)
printf "%s%s/%s",OFS,neg[i]+0,pos[i]+0
print ""
}
' filename.txt
This generates:
Neg/Pos 2/1 4/0 3/1
NOTE: OP hasn't provided an example of the expected output; all of the counts are located in the arrays so modifying the output format should be relatively easy once OP has provided a sample output

You can use this awk solution:
awk -v c=3 '
NR > 1 && $c != -9 {
if ($c < 0)
++neg
else
++pos
}
END {
printf "Positive: %d, Negative: %d\n", pos, neg
}' file
Positive: 1, Negative: 2
Running it with c=5:
awk -v c=5 'NR > 1 && $c != -9 {if ($c < 0) ++neg; else ++pos} END {printf "Positive: %d, Negative: %d\n", pos, neg}' file
Positive: 1, Negative: 3

$ awk '
NR == 1 {
for(i = 3; i <= NF; i++) header[i] = $i
}
NR > 1 {
for(i = 3; i <= NF; i++) {
pos[i] += ($i >= 0); neg[i] += (($i != -9) && ($i < 0))
}
}
END {
for(i in pos) {
if (header[i] == "") header[i] = "column " i
printf("%-10s: %d positive, %d negative\n", header[i], pos[i], neg[i])
}
}' file
data1 : 1 positive, 2 negative
data2 : 0 positive, 4 negative
data3 : 1 positive, 3 negative

awk '
NR > 1 && $3 != -9 {$3 >= 0 ? ++p : ++n}
END {print "pos: "p+0, "neg: "n+0}'
Gives:
pos: 1 neg: 2
You can change ++n to --p to get a single number p, equal to number of positive minus number of negative.

Below you find some examples how you can achieve this:
Note: we assume that -0.0 and 0.0 are positive.
Count negative numbers in column n:
$ awk '(FNR>1){c+=($n<0)}END{print "pos:",(NR-1-c),"neg:"c+0}' file
Count negative numbers in column n, but ignore -9:
$ awk '(FNR>1){c+=($n<0);d+=($n==-9)}END{print "pos:",(NR-1-c-2*d),"neg:"c-d}' file
Count negative numbers columns m to n:
$ awk '(FNR>1){for(i=m;i<=n;++i) c[i]+=($i<0)}
END{for(i=m;i<=n;++i) print i,"pos:",(NR-1-c[i]),"neg:"c[i]+0}' file
Count negative numbers in columns m to n, but ignore -9:
$ awk '(FNR>1){for(i=m;i<=n;++i) {c+=($i<0);d+=($i==-9)}}
END{for(i=m;i<=n;++i) print i,"pos:",(NR-1-c[i]-2*d[i]),"neg:"c[i]-d[i]}' file

Related

Bash iterate through fields of a TSV file and divide it by the sum of the column

I have a tsv file with several columns, and I would like to iterate through each field, and divide it by the sum of that column:
Input:
A 1 2 1
B 1 0 3
Output:
A 0.5 1 0.25
B 0.5 0 0.75
I have the following to iterate through the fields, but I am not sure how I can find the sum of the column that the field is located in:
awk -v FS='\t' -v OFS='\t' '{for(i=2;i<=NF;i++){$i=$i/SUM_OF_COLUMN}} 1' input.tsv
You may use this 2-pass awk:
awk '
BEGIN {FS=OFS="\t"}
NR == FNR {
for (i=2; i<=NF; ++i)
sum[i] += $i
next
}
{
for (i=2; i<=NF; ++i)
$i = (sum[i] ? $i/sum[i] : 0)
}
1' file file
A 0.5 1 0.25
B 0.5 0 0.75
With your shown samples please try following awk code in a single pass of Input_file. Simply creating 2 arrays 1 for sum of columns with their indexes and other for values of fields along with their field numbers and in END block of this program traversing till value of FNR(all lines) and then printing values of arrays as per need (where when we are traversing through values then dividing their actual values with sum of that respective column).
awk '
BEGIN{FS=OFS="\t"}
{
arr[FNR,1]=$1
for(i=2;i<=NF;i++){
sum[i]+=$i
arr[FNR,i]=$i
}
}
END{
for(i=1;i<=FNR;i++){
printf("%s\t",arr[i,1])
for(j=2;j<=NF;j++){
printf("%s%s",sum[j]?(arr[i,j]/sum[j]):"N/A",j==NF?ORS:OFS)
}
}
}
' Input_file

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Compare values of each records in field 1 to find min and max values AWK

I am new to text preprocessing and AWK language.
I am trying to loop through each record in a given field(field1) and find the max and min of values and store it in a variable.
Algorithm :
1) Set Min = 0 and Max = 0
2) Loop through $1(field 1)
3) Compare FNR of the field 1 and set Max and Min
4) Finally print Max and Min
this is what I tried :
BEGIN{max = 0; min = 0; NF = 58}
{
for(i = 0; i < NF-57; i++)
{
for(j =0; j < NR; j++)
{
min = (min < $j) ? min : $j
max = (max > $j) ? max : $j
}
}
}
END{print max, min}
#Dataset
f1 f2 f3 f4 .... f58
0.3 3.3 0.5 3.6
0.9 4.7 2.5 1.6
0.2 2.7 6.3 9.3
0.5 3.6 0.9 2.7
0.7 1.6 8.9 4.7
Here, f1,f2,..,f58 are the fields or columns in Dataset.
I need to loop through column one(f1) and find Min-Max.
Output Required:
Min = 0.2
Max = 0.9
What I get as a result:
Min = ''(I dont get any result)
Max = 9.3(I get max of all the fields instead of field1)
This is for learning purpose so I asked for one column So that I can try on my own for multiple columns
These is what I have:
This for loop would only loop 4 times as there r only four fields. Will the code inside the for loop execute for each record that is, for 5 times?
for(i = 0; i < NF; i++)
{
if (min[i]=="") min[i]=$i
if (max[i]=="") max[i]=$i
if ($i<min[i]) min[i]=$i
if ($i>max[i]) max[i]=$i
}
END
{
OFS="\t";
print "min","max";
#If I am not wrong, I saved the data in an array and I guess this would be the right way to print all min and max?
for(i=0; i < NF; i++;)
{
print min[i], max[i]
}
}
Here is a working solution which is really much easier than what you are doing:
/^-?[0-9]*(\.[0-9]*)?$/ checks that $1 is indeed a valid number, otherwise it is discarded.
sort -n | awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {a[c++]=$1} END {OFS="\t"; print "min","max";print a[0],a[c-1]}'
If you don't use this, then min and max need to be initialized, for example with the first value:
awk '$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {if (min=="") min=$1; if (max=="") max=$1; if ($1<min) min=$1; if ($1>max) max=$1} END {OFS="\t"; print "min","max";print min, max}'
Readable versions:
sort -n | awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
a[c++]=$1
}
END {
OFS="\t"
print "min","max"
print a[0],a[c-1]
}'
and
awk '
$1 ~ /^-?[0-9]*(\.[0-9]*)?$/ {
if (min=="") min=$1
if (max=="") max=$1
if ($1<min) min=$1
if ($1>max) max=$1
}
END {
OFS="\t"
print "min","max"
print min, max
}'
On your input, is outputs:
min max
0.2 0.9
EDIT (replying to the comment requiring more information on how awk works):
Awk loops through lines (named records) and for each line you have columns (named fields) available. Each awk iteration reads a line and provides among others the NR and NF variables. In your case, you are only interested in the first column, so you will only use $1 which is the first column field. For each record where $1 is matching /^-?[0-9]*(\.[0-9]*)?$/ which is a regex matching positive and negative integers or floats, we are either storing the value in an array a (in the first version) or setting the min/max variables if needed (in the second version).
Here is the explanation for the condition $1 ~ /^-?[0-9]*(\.[0-9]*)?$/:
$1 ~ means we are checking if the first field $1 matches the regex between slashes
^ means we start matching from the beginning of the $1 field
-? means an optional minus sign
[0-9]* is any number of digits (including zero, so .1 or -.1 can be matched)
()? means an optional block which can be present or not
\.[0-9]* if that optional block is present, it should start with a dot and contain zero or more digits (so -. or . can be matched! adapt the regex if you have uncertain input)
$ means we are matching until the last character from the $1 field
If you wanted to loop through fields, you would have to use a for loop from 1 to NF (included) like this:
echo "1 2 3 4" | awk '{for (i=1; i<=NF; i++) {if (min=="") min=$(i); if (max=="") max=$(i); if ($(i)<min) min=$(i); if ($(i)>max) max=$(i)}} END {OFS="\t"; print "min","max";print min, max}'
(please note that I have not checked the input here for simplicity purposes)
Which outputs:
min max
1 4
If you had more lines as an input, awk would also process them after reading the first record, example with this input:
1 2 3 4
5 6 7 8
Outputs:
min max
1 8
To prevent this and only work on the first line, you can add a condition like NR == 1 to process only the first line or add an exit statement after the for loop to stop processing the input after the first line.
If you're looking to only column 1, you may try this:
awk '/^[[:digit:]].*/{if($1<min||!min){min=$1};if($1>max){max=$1}}END{print min,max}' dataset
The script looks for line starting with digit and set the min or max if it didn't find it before.

awk: extracting columns based on column values

I have a file that looks somewhat like this:
C1 C2 C3 C4 C5
0 0 0 0 0
0 1 0 0 0
0 0 0 1 0
0 0 0 0 0
but much larger...
I want to extract only the columns that have all 0's in them, so my output file should look like this:
C1 C3 C5
0 0 0
0 0 0
0 0 0
0 0 0
Can this be done with a simple awk one-liner (similar to awk: print columns based on values of another column for example)? If no, is there another way to do this effectively using bash?
Try following awk
awk 'NR==1 {next} NR==FNR { for(i=1;i<=NF;i++) sum[i]+=$i; next } { for(i=1;i<=NF;i++) if (sum[i]==0) printf " %s", $i; print "" }' file{,}
Output
C1 C3 C5
0 0 0
0 0 0
0 0 0
0 0 0
Idea here is to iterated of file twice. Once it calculates sum of all columns and in next iteration it prints only columns having sum equal to zero.
This assumes all column entries have positive numbers only
Another, may be better, approach would be to set a flag if any entry in a column is non-zero. And then print only those columns for which correspondig flag is zero.
awk 'NR==1 {next} NR==FNR { for(i=1;i<=NF;i++) if ($i) flag[i]=1; next } { for(i=1;i<=NF;i++) if (!flag[i]) printf " %s", $i; print "" }' file{,}
This approach allows positive as well as negative numbers and removes any restriction.
Or as suggested by #fedorqui in a comment
awk 'NR==1 {next} NR==FNR { for(i=1;i<=NF;i++) if ($i) flag[i]=1; next } { for(i=1;i<=NF;i++) if (flag[i]) $i="" } 1' file{,}
this works for data with negative number or other strings like 'foo' or 'bar'
one-liner:
awk 'NR==1{next}NR==FNR{while(++i<=NF)if($i!="0")k[i];i=0;next}{while(++x<=NF)if(!(x in k))printf "%s ",$x;x=0;print ""}' file file
more readable:
awk 'NR==1{next}
NR==FNR{while(++i<=NF)if($i!="0")k[i];i=0;next}
{while(++x<=NF)
if(!(x in k)) printf "%s ",$x
x=0
print ""}' file file
A loooong solution.
Convert column to row
awk '{
for (f = 1; f <= NF; f++) { a[NR, f] = $f }
}
NF > nf { nf = NF }
END {
for (f = 1; f <= nf; f++) {
for (r = 1; r <= NR; r++) {
printf a[r, f] (r==NR ? RS : FS)
}
}
}' file >tmp1
Print only rows with only 0
awk '{for (i=2;i<=NF;i++) f+=$i} !f; {f=0}' tmp1 >tmp2
Convert back
awk '{
for (f = 1; f <= NF; f++) { a[NR, f] = $f }
}
NF > nf { nf = NF }
END {
for (f = 1; f <= nf; f++) {
for (r = 1; r <= NR; r++) {
printf a[r, f] (r==NR ? RS : FS)
}
}
}' tmp2
Gives
C1 C3 C5
0 0 0
0 0 0
0 0 0
0 0 0

How can I remove selected lines with an awk script?

I'm piping a program's output through some awk commands, and I'm almost where I need to be. The command thus far is:
myprogram | awk '/chk/ { if ( $12 > $13) printf("%s %d\n", $1, $12 - $13); else printf("%s %d\n", $1, $13 - $12) } ' | awk '!x[$0]++'
The last bit is a poor man's uniq, which isn't available on my target. Given the chance the command above produces an output such as this:
GR_CB20-chk_2, 0
GR_CB20-chk_2, 3
GR_CB200-chk_2, 0
GR_CB200-chk_2, 1
GR_HB20-chk_2, 0
GR_HB20-chk_2, 6
GR_HB20-chk_2, 0
GR_HB200-chk_2, 0
GR_MID20-chk_2, 0
GR_MID20-chk_2, 3
GR_MID200-chk_2, 0
GR_MID200-chk_2, 2
What I'd like to have is this:
GR_CB20-chk_2, 3
GR_CB200-chk_2, 1
GR_HB20-chk_2, 6
GR_HB200-chk_2, 0
GR_MID20-chk_2, 3
GR_MID200-chk_2, 2
That is, I'd like to print only line that has a maximum value for a given tag (the first 'field'). The above example is representative of the at data in that the output will be sorted (as though it had been piped through a sort command).
Based on my answer to a similar need, this script keeps things in order and doesn't accumulate a big array. It prints the line with the highest value from each group.
#!/usr/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prevs) {
if ( FNR > 1 ) print prevline
prevval = $2
prevline = $0
}
else if ( $2 > prevval ) {
prevval = $2
prevline = $0
}
prevs = s
}
END {
print prevline
}
If you don't need the items to be in the same order they were output from myprogram, the following works:
... | awk '{ if ($2 > x[$1]) x[$1] = $2 } END { for (k in x) printf "%s %s", k, x[k] }'

Resources