Optimize AWK script for large dataset - bash

For the following input data,
Chr C rsid D A1 A2 ID1_AA ID1_AB ID1_BB ID2_AA ID2_AB ID2_BB ID3_AA ID3_AB ID3_BB ID4_AA ID4_AB ID4_BB ID5_AA ID5_AB ID5_BB
10 p rsid1 q A G 0.00 0.85 0.15 0.70 0.10 0.20 0.40 0.50 0.10 0.30 0.30 0.40 0.10 0.20 0.80
10 p rsid2 q C T 0.90 0.10 0.00 0.80 0.10 0.10 0.70 0.10 0.20 0.30 0.40 0.30 0.30 0.20 0.40
10 p rsid3 q A G 0.40 0.50 0.10 0.80 0.20 0.00 0.20 0.30 0.50 0.50 0.30 0.20 0.20 0.30 0.40
I need to generate the following output data.
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
The table show the sum of 3 columns (_AA, _AB & _BB) by multiplying with a constant factor (1, 2, 3) for every ID (ID1, ID2, ID3, etc).
Example: for rsID1 --> ID1 -> (ID1_AA*1 + ID1_AB*2 + ID1_BB*3) = (0.00*1 + 0.85*2 + 0.15*3) = 2.15
I wrote the following AWK script to establish the task and it works absolutely fine.
Please note: I'm a very beginner in AWK.
awk '{
if(NR <= 1) { # header line
str = $3;
for(i=7; i<=NF; i+=3) {
split($i,s,"_”);
str = str"\t"s[1]
}
print str
} else { # data line
k = 0;
for(i=7; i<=NF; i+=3)
arr[k++] = $i*1 + $(i+1)*2 + $(i+2)*3;
str=$3;
for(i=0; i<=(NF-6)/3; i++)
str = str"\t"arr[i];
print str
}
}' input.txt > out.txt
Later I was told the input data can be as big as 60 Million rows & 300 Thousand columns which means the output data will be 60Mx100K. If I'm not wrong, AWK reads one line at a time & hence at an instant there will be 300K columns of data held in memory. Is it a problem? Given the situation, how can I improve my code?

While both approaches have pros/cons and they can both handle any number of rows/columns since they only store 1 row at a time in memory, I'd use this approach rather than the answer posted by Akshay since you have 300,000 columns per line so his approach would require you to test NR==1 almost 100,000 times per line whereas the approach below will just perform the test 1 time per line so it should be noticeably more efficient:
$ cat tst.awk
BEGIN { OFS="\t" }
{
printf "%s", $3
if (NR==1) {
gsub(/_[^[:space:]]+/,"")
for (i=7; i<=NF; i+=3) {
printf "%s%s", OFS, $i
}
}
else {
for (i=7; i<=NF; i+=3) {
printf "%s%.2f", OFS, $i + $(i+1)*2 + $(i+2)*3
}
}
print ""
}
$ awk -f tst.awk file
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
I highly recommend you read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn what awk is and how to use it.

awk -v OFS="\t" '
{
printf("%s",$3);
for(i=7;i<=NF; i+=3)
{
if(FNR==1)
{
sub(/_.*/,"",$i)
f = $i
}else
{
f = sprintf("%5.2f",$i*1 + $(i+1)*2 + $(i+2)*3)
}
printf("%s%s",OFS,f)
}
print ""
}
' file
Output
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00

Do u think making use of a low level language like C?
C++ or C is not automagically faster than awk, also, the code is less readable and more fragile.
I show another solution using c++, to compare
//p.cpp
#include <stdio.h>
//to modify this value
#define COLUMNS 5
int main() {
char column3[256];
bool header=true;
while (scanf("%*s\t%*s\t%255s\t%*s\t%*s\t%*s\t", column3) == 1) {
printf("%s", column3);
if(header){
header=false;
char name[256];
for(int i=0; i<COLUMNS; ++i){
scanf("%[^_]_%*s\t%*s\t%*s\t", name);
printf("\t%s", name);
}
}else{
float nums[3];
for(int i=0; i<COLUMNS; ++i){
scanf("%f %f %f", nums, nums + 1, nums + 2);
float sum = nums[0]+nums[1]*2+nums[2]*3;
printf("\t%2.2f", sum);
}
}
printf("\n");
}
}
Run it, like
g++ p.cpp -o p
cat file | ./p
Benchmark
with 1 millon of lines in input and 300 columns
Ed Morton solution: 2m 34s
c++: 1m 19s

Related

Calculating mean from values in columns specified on the first line using awk

I have a huge file (hundreds of lines, ca. 4,000 columns) structured like this
locus 1 1 1 2 2 3 3 3
exon 1 2 3 1 2 1 2 3
data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
and I need to calculate mean from all values (on each data line separately) with the same locus number (i.e., the same number in the first line), i.e.
data1: mean from first three values (three columns with locus '1':
17.07, 7.11, 10.58), next two values (10.21, 19.34) and next three values (14.69, 3.32, 21.07)
I would like to have output like this
data1 mean1 mean2 mean3
data1 mean1 mean2 mean3
I was thinking about using bash and awk...
Thank you for your advice.
You can use GNU datamash version 1.1.0 or newer (I used last version - 1.1.1):
#!/bin/bash
lines=$(wc -l < "$1")
datamash -W transpose < "$1" |
datamash -H groupby 1 mean 3-"$lines" |
datamash transpose
Usage: mean_value.sh input.txt | column -t (column -t needed for pretty view, it is not necessary)
Output:
GroupBy(locus) 1 2 3
mean(data1) 11.586666666667 14.775 13.026666666667
mean(data2) 13.586666666667 18.565 10.933333333333
if it was me, i would use R, not awk:
library(data.table)
x = fread('data.txt')
#> x
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1: locus 1.00 1.00 1.00 2.00 2.00 3.00 3.00 3.00
#2: exon 1.00 2.00 3.00 1.00 2.00 1.00 2.00 3.00
#3: data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
#4: data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
# save first column of names for later
cnames = x$V1
# remove first column
x[,V1:=NULL]
# matrix transpose: makes rows into columns
x = t(x)
# convert back from matrix to data.table
x = data.table(x,keep.rownames=F)
# set the column names
colnames(x) = cnames
#> x
# locus exon data1 data2
#1: 1 1 17.07 21.42
#...
# ditch useless column
x[,exon:=NULL]
#> x
# locus data1 data2
#1: 1 17.07 21.42
# apply mean() function to each column, grouped by locus
x[,lapply(.SD,mean),locus]
# locus data1 data2
#1: 1 11.58667 13.58667
#2: 2 14.77500 18.56500
#3: 3 13.02667 10.93333
for convenience, here's the whole thing again without comments:
library(data.table)
x = fread('data.txt')
cnames = x$V1
x[,V1:=NULL]
x = t(x)
x = data.table(x,keep.rownames=F)
colnames(x) = cnames
x[,exon:=NULL]
x[,lapply(.SD,mean),locus]
awk ' NR==1{for(i=2;i<NF+1;i++) multi[i]=$i}
NR>2{
for(i in multi)
{
data[multi[i]] = 0
count[multi[i]] = 0
}
for(i=2;i<NF+1;i++)
{
data[multi[i]] += $i
count[multi[i]] += 1
};
printf "%s ",$1;
for(i in data)
printf "%s ", data[i]/count[i];
print ""
}' <file_name>
Replace <file_name> with your data file

Min-Max Normalization using AWK

I dont know Why I am unable to loop through all the records. currently it goes for last record and prints the normalization for it.
Normalization formula:
New_Value = (value - min[i]) / (max[i] - min[i])
Program
{
for(i = 1; i <= NF; i++)
{
if (min[i]==""){ min[i]=$i;} #initialise min
if (max[i]==""){ max[i]=$i;} #initialise max
if ($i<min[i]) { min[i]=$i;} #new min
if ($i>max[i]) { max[i]=$i;} #new max
}
}
END {
for(j = 1; j <= NF; j++)
{
normalized_value[j] = ($j - min[j])/(max[j] - min[j]);
print $j, normalized_value[j];
}
}
Dataset
4 14 24 34
3 13 23 33
1 11 21 31
2 12 22 32
5 15 25 35
Current Output
5 1
15 1
25 1
35 1
Required Output
0.75 0.75 0.75 0.75
0.50 0.50 0.50 0.50
0.00 0.00 0.00 0.00
0.25 0.25 0.25 0.25
1.00 1.00 1.00 1.00
I would process the file twice, once to determine the minima/maxima, once to calculate the normalized values:
awk '
NR==1 {
for (i=1; i<=NF; i++) {
min[i]=$i
max[i]=$i
}
next
}
NR==FNR {
for (i=1; i<=NF; i++) {
if ($i < min[i]) {min[i]=$i}
else if ($i > max[i]) {max[i]=$i}
}
next
}
{
for (i=1; i<=NF; i++) printf "%.2f%s", ($i-min[i])/(max[i]-min[i]), FS
print ""
}
' file file
# ^^^^ ^^^^ same file twice!
outputs
0.75 0.75 0.75 0.75
0.50 0.50 0.50 0.50
0.00 0.00 0.00 0.00
0.25 0.25 0.25 0.25
1.00 1.00 1.00 1.00
The given answer uses same file to be loaded twice, this can be avoided with following modified script:
# initialization on min, max and value array to be used later
NR == 1 {
for (i=1; i<=NF; i++) {
value[i] = $i
min[i] = $i
max[i] = $i
}
}
# finding min and max for each column
NR > 1 {
for (i=1; i<=NF; i++) {
value[((NR-1)*NF)+i] = $i
if ($i < min[i]) {min[i] = $i}
else if ($i > max[i]) {max[i] = $i}
}
}
END {
nrows = NF
ncolumns = NR
for (i=0; i<(ncolumns); i++ ) {
for (j=1; j<(nrows); j++ ) {
printf "%.2f%s", (value[(i*nrows)+j]-min[j])/(max[j]-min[j]), OFS
}
printf "%.2f\n", (value[(i*nrows)+j]-min[j])/(max[j]-min[j])
}
}
Save the above awk script as norm.awk. You can run this from shell (and redirect if needed) as:
awk -f norm.awk data.txt > norm_output.txt
or you can run this norm.awk script from vim itself as:
:%!awk -f norm.awk
Which will replace the existing values with min-max normalized values.

Merging similar columns from different files into a matrix

I have the files with following format each with the first column being common amongst all the files:
File1.txt
ID Score
ABCD 0.9
BCBS 0.2
NBNC 0.67
TCGS 0.8
File2.txt
ID Score
ABCD 0.3
BCBS 0.9
NBNC 0.73
TCGS 0.12
File3.txt
ID Score
ABCD 0.23
BCBS 0.65
NBNC 0.94
TCGS 0.56
I want to merge the second column (Score column) of all the files with the first column being common and display the file name minus the extension of each file as the header to identify as to where did the score come from such that the matrix would look something like
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
$ cat tst.awk
BEGIN { OFS="\t" }
FNR>1 { id[FNR] = $1; score[FNR,ARGIND] = $2 }
END {
printf "%s%s", "ID", OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
sub(/\..*/,"",ARGV[colNr])
printf "%s%s", ARGV[colNr], (colNr<ARGIND?OFS:ORS)
}
for (rowNr=2; rowNr<=FNR; rowNr++) {
printf "%s%s", id[rowNr], OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", score[rowNr,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
$ awk -f tst.awk File1.txt File2.txt File3.txt
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
Pick some string that can't occur in your input as the OFS, I used tab.
If you don't have GNU awk add FNR==1{ ARGIND++ } at the start of the script.
Another alternative
$ awk 'NR==1{$0=$1"\t"FILENAME}1' File1 > all;
for f in File{2..6};
do
paste all <(p $f) > temp && cp temp all;
done
define function p as
p() { awk 'NR==1{print FILENAME;next} {print $2}' $1; }
I copied your data to 6 identical files File1..File6 and the script produced this. Most of the work is setting up the column names
ID File1 File2 File3 File4 File5 File6
ABCD 0.9 0.9 0.9 0.9 0.9 0.9
BCBS 0.2 0.2 0.2 0.2 0.2 0.2
NBNC 0.67 0.67 0.67 0.67 0.67 0.67
TCGS 0.8 0.8 0.8 0.8 0.8 0.8

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Awk - extracting information from an xyz-format matrix

I have an x y z matrix of the format:
1 1 0.02
1 2 0.10
1 4 0.22
2 1 0.70
2 2 0.22
3 2 0.44
3 3 0.42
...and so on. I'm interested in summing all of the z values (column 3) for a particular x value (column 1) and printing the output on separate lines (with the x value as a prefix), such that the output for the previous example would appear as:
1 0.34
2 0.92
3 0.86
I have a strong feeling that awk is the right tool for the job, but knowledge of awk is really lacking and I'd really appreciate any help that anyone can offer.
Thanks in advance.
I agree that awk is a good tool for this job — this is pretty much exactly the sort of task it was designed for.
awk '{ sum[$1] += $3 } END { for (i in sum) print i, sum[i] }' data
For the given data, I got:
2 0.92
3 0.86
1 0.34
Clearly, you could pipe the output to sort -n and get the results in sorted order after all.
To get that in sorted order with awk, you have to go outside the realm of POSIX awk and use the GNU awk extension function asorti:
gawk '{ sum[$1] += $3 }
END { n = asorti(sum, map); for (i = 1; i <= n; i++) print map[i], sum[map[i]] }' data
Output:
1 0.34
2 0.92
3 0.86

Resources