awk: extracting columns based on column values - bash

I have a file that looks somewhat like this:
C1 C2 C3 C4 C5
0 0 0 0 0
0 1 0 0 0
0 0 0 1 0
0 0 0 0 0
but much larger...
I want to extract only the columns that have all 0's in them, so my output file should look like this:
C1 C3 C5
0 0 0
0 0 0
0 0 0
0 0 0
Can this be done with a simple awk one-liner (similar to awk: print columns based on values of another column for example)? If no, is there another way to do this effectively using bash?

Try following awk
awk 'NR==1 {next} NR==FNR { for(i=1;i<=NF;i++) sum[i]+=$i; next } { for(i=1;i<=NF;i++) if (sum[i]==0) printf " %s", $i; print "" }' file{,}
Output
C1 C3 C5
0 0 0
0 0 0
0 0 0
0 0 0
Idea here is to iterated of file twice. Once it calculates sum of all columns and in next iteration it prints only columns having sum equal to zero.
This assumes all column entries have positive numbers only
Another, may be better, approach would be to set a flag if any entry in a column is non-zero. And then print only those columns for which correspondig flag is zero.
awk 'NR==1 {next} NR==FNR { for(i=1;i<=NF;i++) if ($i) flag[i]=1; next } { for(i=1;i<=NF;i++) if (!flag[i]) printf " %s", $i; print "" }' file{,}
This approach allows positive as well as negative numbers and removes any restriction.
Or as suggested by #fedorqui in a comment
awk 'NR==1 {next} NR==FNR { for(i=1;i<=NF;i++) if ($i) flag[i]=1; next } { for(i=1;i<=NF;i++) if (flag[i]) $i="" } 1' file{,}

this works for data with negative number or other strings like 'foo' or 'bar'
one-liner:
awk 'NR==1{next}NR==FNR{while(++i<=NF)if($i!="0")k[i];i=0;next}{while(++x<=NF)if(!(x in k))printf "%s ",$x;x=0;print ""}' file file
more readable:
awk 'NR==1{next}
NR==FNR{while(++i<=NF)if($i!="0")k[i];i=0;next}
{while(++x<=NF)
if(!(x in k)) printf "%s ",$x
x=0
print ""}' file file

A loooong solution.
Convert column to row
awk '{
for (f = 1; f <= NF; f++) { a[NR, f] = $f }
}
NF > nf { nf = NF }
END {
for (f = 1; f <= nf; f++) {
for (r = 1; r <= NR; r++) {
printf a[r, f] (r==NR ? RS : FS)
}
}
}' file >tmp1
Print only rows with only 0
awk '{for (i=2;i<=NF;i++) f+=$i} !f; {f=0}' tmp1 >tmp2
Convert back
awk '{
for (f = 1; f <= NF; f++) { a[NR, f] = $f }
}
NF > nf { nf = NF }
END {
for (f = 1; f <= nf; f++) {
for (r = 1; r <= NR; r++) {
printf a[r, f] (r==NR ? RS : FS)
}
}
}' tmp2
Gives
C1 C3 C5
0 0 0
0 0 0
0 0 0
0 0 0

Related

How to count the occurence of negative and positive values in a column using awk?

I have a file that looks like this:
FID IID data1 data2 data3
1 RQ00001-2 1.670339 -0.792363849 -0.634434791
2 RQ00002-0 -0.238737767 -1.036163943 -0.423512414
3 RQ00004-9 -0.363886913 -0.98661685 -0.259951265
3 RQ00004-9 -9 -0.98661685 0.259951265
I want to count the number of positive numbers in column 3 (data 1) versus negative numbers excluding -9. Therefore, for column 3 it will be 1 positive vs 2 negative. I didn't include -9 as this stands for missing data. For data2, this would be 3 negative versus 1 positive. For the last column it will be 3 negative versus 1 positive.
I preferably would like to use awk, but since I am new I need help. I use the command below but this just counts all the - values but I need it to exclude -9. Is there a more sophisticated way of doing this?
awk '$3 ~ /^-/{cnt++} END{print cnt}' filename.txt
Assumptions:
determine the number of negative and positive values for the 3rd thru Nth columns
One awk idea:
awk '
NR>1 { for (i=3;i<=NF;i++) {
if ($i == -9) continue
else if ($i < 0) neg[i]++
else pos[i]++
}
}
END { printf "Neg/Pos"
for (i=3;i<=NF;i++)
printf "%s%s/%s",OFS,neg[i]+0,pos[i]+0
print ""
}
' filename.txt
This generates:
Neg/Pos 2/1 4/0 3/1
NOTE: OP hasn't provided an example of the expected output; all of the counts are located in the arrays so modifying the output format should be relatively easy once OP has provided a sample output
You can use this awk solution:
awk -v c=3 '
NR > 1 && $c != -9 {
if ($c < 0)
++neg
else
++pos
}
END {
printf "Positive: %d, Negative: %d\n", pos, neg
}' file
Positive: 1, Negative: 2
Running it with c=5:
awk -v c=5 'NR > 1 && $c != -9 {if ($c < 0) ++neg; else ++pos} END {printf "Positive: %d, Negative: %d\n", pos, neg}' file
Positive: 1, Negative: 3
$ awk '
NR == 1 {
for(i = 3; i <= NF; i++) header[i] = $i
}
NR > 1 {
for(i = 3; i <= NF; i++) {
pos[i] += ($i >= 0); neg[i] += (($i != -9) && ($i < 0))
}
}
END {
for(i in pos) {
if (header[i] == "") header[i] = "column " i
printf("%-10s: %d positive, %d negative\n", header[i], pos[i], neg[i])
}
}' file
data1 : 1 positive, 2 negative
data2 : 0 positive, 4 negative
data3 : 1 positive, 3 negative
awk '
NR > 1 && $3 != -9 {$3 >= 0 ? ++p : ++n}
END {print "pos: "p+0, "neg: "n+0}'
Gives:
pos: 1 neg: 2
You can change ++n to --p to get a single number p, equal to number of positive minus number of negative.
Below you find some examples how you can achieve this:
Note: we assume that -0.0 and 0.0 are positive.
Count negative numbers in column n:
$ awk '(FNR>1){c+=($n<0)}END{print "pos:",(NR-1-c),"neg:"c+0}' file
Count negative numbers in column n, but ignore -9:
$ awk '(FNR>1){c+=($n<0);d+=($n==-9)}END{print "pos:",(NR-1-c-2*d),"neg:"c-d}' file
Count negative numbers columns m to n:
$ awk '(FNR>1){for(i=m;i<=n;++i) c[i]+=($i<0)}
END{for(i=m;i<=n;++i) print i,"pos:",(NR-1-c[i]),"neg:"c[i]+0}' file
Count negative numbers in columns m to n, but ignore -9:
$ awk '(FNR>1){for(i=m;i<=n;++i) {c+=($i<0);d+=($i==-9)}}
END{for(i=m;i<=n;++i) print i,"pos:",(NR-1-c[i]-2*d[i]),"neg:"c[i]-d[i]}' file

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Extract desired column with values

Please help me with this small script I am making I am trying to grep some columns with values from a big file (tabseparated) (mainFileWithValues.txt) which has this format:
A B C ......... (total 700 columns)
80 2.08 23
14 1.88 30
12 1.81 40
Column names are in column.nam
cat columnnam.nam
A
B
.
.
.
till 20 nmes
I am first taking column number from a big file using:
sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c
Then using cut I am extracting values
I have made a for loop
#/bin/bash
for i in `cat columnnam.nam`
do
cut -f`sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c` mainFileWithValues.txt > test.txt
done
cat test.txt
A
80
14
12
B
2.08
1.88
1.81
my problem is I want output test.txt to be in columns like main file.
i.e.
A B
80 2.08
How can I fix this in this script?
Here is one-liner:
awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
Explanation:
[ note : case insensitive header match, remove tolower(), if you want strict match ]
awk '
FNR==NR{ # Here we read columns.nam file
h[NR]=$1; # h -> array, NR -> as array key, $1 -> as array value
next # go to next line
}
{ # Here we read second file
for(i=1; i in h; i++) # iterate array h
{
if(FNR==1) # if we are reading 1st row of second file, will parse header
{
for(j=1; j<=NF; j++) # iterate over fields of 1st row fields
{
# if it was the field we are looking for
if(tolower(h[i])==tolower($j))
{
# then
# d -> array, i -> as array key which is column order number
# j -> as array value which is column number
d[i]=j;
break
}
}
}
# for all records
# if field we searched was found then print such field
# from d[i] we access, column number
printf("%s%s",i>1 ? OFS:"", i in d ? $(d[i]): "");
}
# print newline char
print ""
}
' columns.nam mainfile
Test Results:
$ cat mainfile
A B C
80 2.08 23
14 1.88 30
12 1.81 40
$ cat columns.nam
A
C
$ awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
A C
80 23
14 30
12 40
You can also make script and run
akshay#db-3325:/tmp$ cat col_parser.awk
FNR == NR {
h[NR] = $1;
next
}
{
for (i = 1; i in h; i++) {
if (FNR == 1) {
for (j = 1; j <= NF; j++) {
if (tolower(h[i]) == tolower($j)) {
d[i] = j;
break
}
}
}
printf("%s%s", i > 1 ? OFS : "", i in d ? $(d[i]) : "");
}
print ""
}
akshay#db-3325:/tmp$ awk -v OFS="\t" -f col_parser.awk columns.nam mainfile
A C
80 23
14 30
12 40
Similar Answer
AWK to display a column based on Column name and remove header and last delimiter
Another awk approach:
awk 'NR == FNR {
hdr[$1]
next
}
FNR == 1 {
for (i=1; i<=NF; i++)
if ($i in hdr)
h[i]
}
{
s=""
for (i in h)
s = s (s == "" ? "" : OFS) $i
print s
}' column.nam mainFileWithValues.txt
A B
80 2.08
14 1.88
12 1.81
To get formatted output pipe above command to column -t

Min and max coordinates for same values in different column

I have one question, I think about script for my data and I am totally lost.
INPUT:
1 BR.100.200
2 BR.100.200
3 BR.100.200
4 BR.100.200
1 BAL.11.235
2 BAL.11.235
3 BAL.11.235
1 JOJ.21.354
2 JOJ.21.354
OUTPUT :
BR.100.200 1 4
BAL.11.235 1 3
JOJ.21.354 1 2
Than I want: if the $2 is same for columns, write for this same values maximal and minimal values in $1. Please i prefer awk language or bash or sed.
Thank you
Filip
Could probz be made better but this works
awk '!x[$2]{x[$2]=$1}y[$2]<$1{y[$2]=$1}x[$2]>$1{x[$2]=$1}END{for(i in y)print i,x[i],y[i]}' file
More readable
awk '!min[$2]{min[$2]=$1} max[$2]<$1{max[$2]=$1} min[$2]>$1{min[$2]=$1} END{for(i in max)print i, min[i], max[i]}' file
#!/usr/bin/awk -f
NF == 0 { next }
$2 in min {
if ($1 < min[$2]) {
min[$2] = $1
} else if ($1 > max[$2]) {
max[$2] = $1
}
next
}
{
min[$2] = max[$2] = $1
keys[i++] = $2
}
END {
for (i = 0; i in keys; ++i) {
key = keys[i]
if (i) {
print ""
}
printf "%s\t%s\t%s\n", key, min[key], max[key]
}
}
Run with:
awk -f script.awk your_file.txt
Output:
BR.100.200 1 4
BAL.11.235 1 3
JOJ.21.354 1 2
awk '{if (NR == 1) {temp1=$2;min=$1;max=$1;} else if ((NR % 2)!=0) {temp2=$2; if (temp1 == temp2) {max=$1} else {print (temp1,min,max); temp1=$2;min=$1;max=$1} } } END{if ((NR % 2)!=0) {temp2=$2; if (temp1 == temp2) {max=$1} else {print (temp1,min,max);} print (temp2,min,max) } }' inputfile

How can I remove selected lines with an awk script?

I'm piping a program's output through some awk commands, and I'm almost where I need to be. The command thus far is:
myprogram | awk '/chk/ { if ( $12 > $13) printf("%s %d\n", $1, $12 - $13); else printf("%s %d\n", $1, $13 - $12) } ' | awk '!x[$0]++'
The last bit is a poor man's uniq, which isn't available on my target. Given the chance the command above produces an output such as this:
GR_CB20-chk_2, 0
GR_CB20-chk_2, 3
GR_CB200-chk_2, 0
GR_CB200-chk_2, 1
GR_HB20-chk_2, 0
GR_HB20-chk_2, 6
GR_HB20-chk_2, 0
GR_HB200-chk_2, 0
GR_MID20-chk_2, 0
GR_MID20-chk_2, 3
GR_MID200-chk_2, 0
GR_MID200-chk_2, 2
What I'd like to have is this:
GR_CB20-chk_2, 3
GR_CB200-chk_2, 1
GR_HB20-chk_2, 6
GR_HB200-chk_2, 0
GR_MID20-chk_2, 3
GR_MID200-chk_2, 2
That is, I'd like to print only line that has a maximum value for a given tag (the first 'field'). The above example is representative of the at data in that the output will be sorted (as though it had been piped through a sort command).
Based on my answer to a similar need, this script keeps things in order and doesn't accumulate a big array. It prints the line with the highest value from each group.
#!/usr/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prevs) {
if ( FNR > 1 ) print prevline
prevval = $2
prevline = $0
}
else if ( $2 > prevval ) {
prevval = $2
prevline = $0
}
prevs = s
}
END {
print prevline
}
If you don't need the items to be in the same order they were output from myprogram, the following works:
... | awk '{ if ($2 > x[$1]) x[$1] = $2 } END { for (k in x) printf "%s %s", k, x[k] }'

Resources