awk command to sum pairs of lines and filter out under particular condition - bash

I have a file with numbers and I want to sum numbers from two lines and this for each column, then in my last step I want to filter out pairs of lines that has a count bigger or equal than 3 of '0' sum counts. I write a small example to make it clear:
This is my file (without the comments ofc), it contains 2 pairs of lines (=4 lines) with 5 columns.
2 6 0 8 9 # pair 1.A
0 1 0 5 1 # pair 1.B
0 2 0 3 0 # pair 2.A
0 0 0 0 0 # pair 2.B
And I need to sum up pairs of lines so I get something like this (intermediate step)
2 7 0 13 10 # sum pair 1, it has one 0
0 2 0 3 0 # sum pair 2, it has three 0
Then I want to print the original lines, but only those which the sum of 0 (of the sum of the two lines) is lower than 3, therefore I should get printed this:
2 6 0 8 9 # pair 1.A
0 1 0 5 1 # pair 1.B
Because the sum of the second pair of lines has three 0, then it should be excluded
So from the first file I need to get the last output.
So far what I have been able to do is to sum pairs of lines, count zeros, and identify those with a count lower than 3 of 0 but I don't know how to print the two lines that contributed to the SUM, I am only able to print one of the two lines (the last one). This is the awk I am using:
awk '
NR%2 { split($0, a); next }
{ for (i=1; i<=NF; i++) if (a[i]+$i == 0) SUM +=1;
if (SUM < 3) print $0; SUM=0 }' myfile
(That's what I get now)
0 1 0 5 1 # pair 1.B
Thanks!

Another variation, could be useful to avoid loop iterations in some input cases:
awk '!(NR%2){ zeros=0; for(i=1;i<=NF;i++) { if(a[i]+$i==0) zeros++; if(zeros>=3) next }
print prev ORS $0 }{ split($0,a); prev=$0 }' file
The output:
2 6 0 8 9
0 1 0 5 1

Well, after digging a little bit more I found that it was rather simple to print the previous line (I was complicating myself)
awk '
NR%2 { split($0, a) ; b=$0; next }
{ for (i=1; i<=NF; i++) if (a[i]+$i == 0) SUM +=1;
if (SUM < 3) print b"\n"$0; SUM=0}' myfile
So I just have to save the first line in a variable b and print when the condition is favorable.
Hope it can help other people too

$ cat tst.awk
!(NR%2) {
split(prev,p)
zeroCnt = 0
for (i=1; i<=NF; i++) {
zeroCnt += (($i + p[i]) == 0 ? 1 : 0)
}
if (zeroCnt < 3) {
print prev ORS $0
}
}
{ prev = $0 }
$ awk -f tst.awk file
2 6 0 8 9
0 1 0 5 1

Related

How to count the occurence of negative and positive values in a column using awk?

I have a file that looks like this:
FID IID data1 data2 data3
1 RQ00001-2 1.670339 -0.792363849 -0.634434791
2 RQ00002-0 -0.238737767 -1.036163943 -0.423512414
3 RQ00004-9 -0.363886913 -0.98661685 -0.259951265
3 RQ00004-9 -9 -0.98661685 0.259951265
I want to count the number of positive numbers in column 3 (data 1) versus negative numbers excluding -9. Therefore, for column 3 it will be 1 positive vs 2 negative. I didn't include -9 as this stands for missing data. For data2, this would be 3 negative versus 1 positive. For the last column it will be 3 negative versus 1 positive.
I preferably would like to use awk, but since I am new I need help. I use the command below but this just counts all the - values but I need it to exclude -9. Is there a more sophisticated way of doing this?
awk '$3 ~ /^-/{cnt++} END{print cnt}' filename.txt
Assumptions:
determine the number of negative and positive values for the 3rd thru Nth columns
One awk idea:
awk '
NR>1 { for (i=3;i<=NF;i++) {
if ($i == -9) continue
else if ($i < 0) neg[i]++
else pos[i]++
}
}
END { printf "Neg/Pos"
for (i=3;i<=NF;i++)
printf "%s%s/%s",OFS,neg[i]+0,pos[i]+0
print ""
}
' filename.txt
This generates:
Neg/Pos 2/1 4/0 3/1
NOTE: OP hasn't provided an example of the expected output; all of the counts are located in the arrays so modifying the output format should be relatively easy once OP has provided a sample output
You can use this awk solution:
awk -v c=3 '
NR > 1 && $c != -9 {
if ($c < 0)
++neg
else
++pos
}
END {
printf "Positive: %d, Negative: %d\n", pos, neg
}' file
Positive: 1, Negative: 2
Running it with c=5:
awk -v c=5 'NR > 1 && $c != -9 {if ($c < 0) ++neg; else ++pos} END {printf "Positive: %d, Negative: %d\n", pos, neg}' file
Positive: 1, Negative: 3
$ awk '
NR == 1 {
for(i = 3; i <= NF; i++) header[i] = $i
}
NR > 1 {
for(i = 3; i <= NF; i++) {
pos[i] += ($i >= 0); neg[i] += (($i != -9) && ($i < 0))
}
}
END {
for(i in pos) {
if (header[i] == "") header[i] = "column " i
printf("%-10s: %d positive, %d negative\n", header[i], pos[i], neg[i])
}
}' file
data1 : 1 positive, 2 negative
data2 : 0 positive, 4 negative
data3 : 1 positive, 3 negative
awk '
NR > 1 && $3 != -9 {$3 >= 0 ? ++p : ++n}
END {print "pos: "p+0, "neg: "n+0}'
Gives:
pos: 1 neg: 2
You can change ++n to --p to get a single number p, equal to number of positive minus number of negative.
Below you find some examples how you can achieve this:
Note: we assume that -0.0 and 0.0 are positive.
Count negative numbers in column n:
$ awk '(FNR>1){c+=($n<0)}END{print "pos:",(NR-1-c),"neg:"c+0}' file
Count negative numbers in column n, but ignore -9:
$ awk '(FNR>1){c+=($n<0);d+=($n==-9)}END{print "pos:",(NR-1-c-2*d),"neg:"c-d}' file
Count negative numbers columns m to n:
$ awk '(FNR>1){for(i=m;i<=n;++i) c[i]+=($i<0)}
END{for(i=m;i<=n;++i) print i,"pos:",(NR-1-c[i]),"neg:"c[i]+0}' file
Count negative numbers in columns m to n, but ignore -9:
$ awk '(FNR>1){for(i=m;i<=n;++i) {c+=($i<0);d+=($i==-9)}}
END{for(i=m;i<=n;++i) print i,"pos:",(NR-1-c[i]-2*d[i]),"neg:"c[i]-d[i]}' file

How to sort a specified column in l、Linux

This is my two column sequence, I want to combine them into 1 column and sort them in Linux, but I don't know how to write the shell script to handle them.
GGCTGCAGCTAACAGGTGA TACTCGGGGAGCTGCGG
CCTCTGGCTCGCAGGTCATGGC CAGCGTCTTGCGCTCCT
GCTGCAGCTACATGGTGTCG CGCTCCGCTTCTCTCTACG
The sorted results are as follows (first column first, second column second, and split by "\t")
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
what should I do?
You can do it easily in awk by storing the second column in an array and then outputting the saved values in the END rule, e.g.
awk '
{
print ++n, $1 # output first column
a[n] = $2 # save second column in array
}
END {
j = n + 1 # j is next counter
for (i=1;i<=n;i++) # loop 1 - n
print j++, a[i] # output j and array value
}
' file.txt
Example Use/Output
With your input in file.txt, you can just copy/middle-mouse-paste the above in an xterm with file.txt in the current directory, e.g.
$ awk '
> {
> print ++n, $1 # output first column
> a[n] = $2 # save second column in array
> }
> END {
> j = n + 1 # j is next counter
> for (i=1;i<=n;i++) # loop 1 - n
> print j++, a[i] # output j and array value
> }
> ' file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
Or as a 1-liner:
$ awk '{print ++n, $1; a[n]=$2} END {j=n+1; for (i=1;i<=n;i++) print j++, a[i]}' file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG
If you would like to create an awk script from the above, you can simply create the script file (say cmbcols.awk) as:
{
print ++n, $1 # output first column
a[n] = $2 # save second column in array
}
END {
j = n + 1 # j is next counter
for (i=1;i<=n;i++) # loop 1 - n
print j++, a[i] # output j and array value
}
Then to run the script on the file file.txt you can do:
$ awk -f cmbcols.awk file.txt
1 GGCTGCAGCTAACAGGTGA
2 CCTCTGGCTCGCAGGTCATGGC
3 GCTGCAGCTACATGGTGTCG
4 TACTCGGGGAGCTGCGG
5 CAGCGTCTTGCGCTCCT
6 CGCTCCGCTTCTCTCTACG

How to sum rows in a tsv file using awk?

My input:
Position A B C D No
1 0 0 0 0 0
2 1 0 1 0 0
3 0 6 0 0 0
4 0 0 0 0 0
5 0 5 0 0 0
I have a TSV file, like the above, where I wish to sum the rows of numbers in the ABCD columns only, not the Position column.
Desired output would have a TSV, two columns with Position and Sum in the first row,
Position Sum
1 0
2 2
3 6
4 0
5 5
So far I have:
awk 'BEGIN{print"Position\tSum"}{if(NR==1)next; sum=$2+$3+$4+$5 printf"%d\t%d\n",$sum}' infile.tsv > outfile.tsv
You were very close, try this:
awk 'BEGIN{print"Position\tSum"}{if(NR==1)next; sum=$2+$3+$4+$5; printf "%d\t%d\n",$1,sum; }' infile.tsv > outfile.tsv
But I say it's way cleaner with newlines and spaces:
awk '
BEGIN {
print"Position\tSum";
}
{
if (NR==1) {
next;
}
sum = $2 + $3 + $4 + $5 + $6;
printf "%d\t%d\n", $1, sum;
}'
a minimalist script can be
$ awk '{print $1 "\t" (NR==1?"Sum":$2+$3+$4+$5)}' file
Could you please try following, what you were trying to hard code field numbers which will not work in many cases so I am coming with a loop approach(where we are skipping first field and taking sum of all fields then).
awk 'FNR==1{print $1,"sum";next} {for(i=2;i<NF;i++){sum+=$i};print $1,sum;sum=""}' Input_file
Change awk to awk 'BEGIN{OFS="\t"} rest part same of code in case you need output in TAB form.

Subset a file by row and column numbers

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.
One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.
to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]

Aggregate rows with specified granularity

Input:
11 1
12 2
13 3
21 1
24 2
33 1
50 1
Let's say 1st column specify index. I'd like to reduce size of my data as follows:
I sum values from second column with granularity of 10 according to indices. An example:
First I consider range of 0-9 of indices. There aren't any indices from that range so sum equals 0. Next I go to the next range 10-19. There're 3 indices (11,12,13) which meet the range. I sum values from 2nd column for them, it equals 1+2+3=6. And so on...
Desirable output:
0 0
10 6
20 3
30 1
40 0
50 1
That's what I made up:
M=0;
awk 'FNR==NR
{
if ($1 < 10)
{ A[$1]+=$2;next }
else if($1 < $M+10)
{
A[$M]+=$2;
next
}
else
{ $M=$M+10;
A[$M]+=2;
next
}
}END{for(i in A){print i" "A[i]}}' input_file
Sorry but I'm not quite good at AWK.
After some changes:
awk 'FNR==NR {
M=10;
if ($1 < 10){
A[$1]+=$2;next
} else if($1 < M+10) {
A[M]+=$2;
next
} else {
M=sprintf("%d",$1/10);
M=M*10;
A[M]+=$2;
next
}
}END{for(i in A){print i" "A[i]}}' input
This is GNU awk
{
ind=int($1/10)*10
if (mxi<ind) mxi=ind
a[ind]++
}
END {
for (i=0; i<=mxi; i+=10) {
s=(a[i]*(a[i]+1))/2
print i " " s
}
}

Resources