I'm trying to print out
for i in 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40; do
awk -v a="$i" '{printf "%10.2f %10.2f\n", a, ($8*627.509)}' e1.txt > e2.txt
done
But when I open this file,
1.40 -12111939.85
1.40 -12112479.17
1.40 -12112817.98
1.40 -12112997.55
1.40 -12113047.39
1.40 -12112998.93
1.40 -12112873.57
1.40 -12112695.74
1.40 -12112504.02
1.40 -12112346.74
1.40 -12112316.49
1.40 -12112204.51
1.40 -12112149.56
Ignore the second column as it reads the value and operates from other txt file, e1.txt.
As it is shown, only the last for-loop index variable is used in this case. But I wish to print the for-loop values of 0.80 ~ 1.40 accordingly to each line.
For efficiency, I would avoid processing the same file 13 times.
The BEGIN block looks awkward because awk can't declare an array literal.
awk '
BEGIN {
a = "0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40"
n = split(a, as)
}
{
for (i=1; i <= n; i++)
printf "%10.2f %10.2f\n", as[i], ($8 * 627.509)
}
' e1.txt > e2.txt
If you want all the 0.80 first and all the 1.40 last, you can:
awk '...' e1.txt | sort -g > e2.txt
Related
I have a dataset in a single column that I would like to split into any number of new columns when a certain string is found (in this case 'male_position'.
>cat test.file
male_position
0.00
0.00
1.05
1.05
1.05
1.05
3.1
5.11
12.74
30.33
40.37
40.37
male_position
0.00
1.05
2.2
4.0
4.0
8.2
25.2
30.1
male_position
1.0
5.0
I would like the script to produce new tab separated columns each time 'male_position' is encountered but just print each each line/data point below that (added to that column) until the next occurrence of 'male_position':
script.awk test.file > output
0.00 0.00 1.0
0.00 1.05 5.0
1.05 2.2
1.05 4.0
1.05 4.0
1.05 8.2
3.1 25.2
5.11 30.1
12.74
30.33
40.37
40.37
Any ideas?
update -
I have tried to adapt code based on this post(Linux split a column into two different columns in a same CSV file)
cat script.awk
BEGIN {
line = 0; #Initialize at zero
}
/male_position/ { #every time we hit the delimiter
line = 0; #resed line to zero
}
!/male_position/{ #otherwise
a[line] = a[line]" "$0; # Add the new input line to the output line
line++; # increase the counter by one
}
END {
for (i in a )
print a[i] # print the output
}
Results....
$ awk -f script.awk test.file
1.05 2.2
1.05 4.0
1.05 4.0
1.05 8.2
3.1 25.2
5.11 30.1
12.74
30.33
40.37
40.37
0.00 0.00 1.0
0.00 1.05 5.0
UPDATE 2 #######
I can recreate the expected with the test.file case. Running the script (script.awk) on Linux with test file and 'awk.script"(see above) seemed to work. However, that simple example file has only decreasing numbers of columns (data points) between the delimiter (male_position). When you increase the number of columns between, the output seems to fail...
cat test.file2
male_position
0.00
0.00
1.05
1.05
1.05
1.05
3.1
5.11
12.74
male_position
0
5
10
male_position
0
1
2
3
5
awk -f script.awk test.file2
0.00 0 0
0.00 5 1
1.05 10 2
1.05 3
1.05 5
1.05
3.1
5.11
12.74
there is no 'padding' of the lines after the the last observation for a given column, so a column with more values than the predeeding column has its values fall in line with the previous column ( the 3 and the 5 are in column 2, when they should be in column 3).
Here's a csplit+paste solution
$ csplit --suppress-matched -zs test.file2 /male_position/ {*}
$ ls
test.file2 xx00 xx01 xx02
$ paste xx*
0.00 0 0
0.00 5 1
1.05 10 2
1.05 3
1.05 5
1.05
3.1
5.11
12.74
From man csplit
csplit - split a file into sections determined by context lines
-z, --elide-empty-files
remove empty output files
-s, --quiet, --silent
do not print counts of output file sizes
--suppress-matched
suppress the lines matching PATTERN
/male_position/ is the regex used to split the input file
{*} specifies to create as many splits as possible
use -f and -n options to change the default output file names
paste xx* to paste the files column wise, TAB is default separator
Following awk may help you on same.
awk '/male_position/{count++;max=val>max?val:max;val=1;next} {array[val++,count]=$0} END{for(i=1;i<=max;i++){for(j=1;j<=count;j++){printf("%s%s",array[i,j],j==count?ORS:OFS)}}}' OFS="\t" Input_file
Adding a non-one liner form of solution too now.
awk '
/male_position/{
count++;
max=val>max?val:max;
val=1;
next}
{
array[val++,count]=$0
}
END{
for(i=1;i<=max;i++){
for(j=1;j<=count;j++){ printf("%s%s",array[i,j],j==count?ORS:OFS) }}
}
' OFS="\t" Input_file
For the following input data,
Chr C rsid D A1 A2 ID1_AA ID1_AB ID1_BB ID2_AA ID2_AB ID2_BB ID3_AA ID3_AB ID3_BB ID4_AA ID4_AB ID4_BB ID5_AA ID5_AB ID5_BB
10 p rsid1 q A G 0.00 0.85 0.15 0.70 0.10 0.20 0.40 0.50 0.10 0.30 0.30 0.40 0.10 0.20 0.80
10 p rsid2 q C T 0.90 0.10 0.00 0.80 0.10 0.10 0.70 0.10 0.20 0.30 0.40 0.30 0.30 0.20 0.40
10 p rsid3 q A G 0.40 0.50 0.10 0.80 0.20 0.00 0.20 0.30 0.50 0.50 0.30 0.20 0.20 0.30 0.40
I need to generate the following output data.
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
The table show the sum of 3 columns (_AA, _AB & _BB) by multiplying with a constant factor (1, 2, 3) for every ID (ID1, ID2, ID3, etc).
Example: for rsID1 --> ID1 -> (ID1_AA*1 + ID1_AB*2 + ID1_BB*3) = (0.00*1 + 0.85*2 + 0.15*3) = 2.15
I wrote the following AWK script to establish the task and it works absolutely fine.
Please note: I'm a very beginner in AWK.
awk '{
if(NR <= 1) { # header line
str = $3;
for(i=7; i<=NF; i+=3) {
split($i,s,"_”);
str = str"\t"s[1]
}
print str
} else { # data line
k = 0;
for(i=7; i<=NF; i+=3)
arr[k++] = $i*1 + $(i+1)*2 + $(i+2)*3;
str=$3;
for(i=0; i<=(NF-6)/3; i++)
str = str"\t"arr[i];
print str
}
}' input.txt > out.txt
Later I was told the input data can be as big as 60 Million rows & 300 Thousand columns which means the output data will be 60Mx100K. If I'm not wrong, AWK reads one line at a time & hence at an instant there will be 300K columns of data held in memory. Is it a problem? Given the situation, how can I improve my code?
While both approaches have pros/cons and they can both handle any number of rows/columns since they only store 1 row at a time in memory, I'd use this approach rather than the answer posted by Akshay since you have 300,000 columns per line so his approach would require you to test NR==1 almost 100,000 times per line whereas the approach below will just perform the test 1 time per line so it should be noticeably more efficient:
$ cat tst.awk
BEGIN { OFS="\t" }
{
printf "%s", $3
if (NR==1) {
gsub(/_[^[:space:]]+/,"")
for (i=7; i<=NF; i+=3) {
printf "%s%s", OFS, $i
}
}
else {
for (i=7; i<=NF; i+=3) {
printf "%s%.2f", OFS, $i + $(i+1)*2 + $(i+2)*3
}
}
print ""
}
$ awk -f tst.awk file
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
I highly recommend you read the book Effective Awk Programming, 4th Edition, by Arnold Robbins to learn what awk is and how to use it.
awk -v OFS="\t" '
{
printf("%s",$3);
for(i=7;i<=NF; i+=3)
{
if(FNR==1)
{
sub(/_.*/,"",$i)
f = $i
}else
{
f = sprintf("%5.2f",$i*1 + $(i+1)*2 + $(i+2)*3)
}
printf("%s%s",OFS,f)
}
print ""
}
' file
Output
rsid ID1 ID2 ID3 ID4 ID5
rsid1 2.15 1.50 1.70 2.10 2.90
rsid2 1.10 1.30 1.50 2.00 1.90
rsid3 1.70 1.20 2.30 1.70 2.00
Do u think making use of a low level language like C?
C++ or C is not automagically faster than awk, also, the code is less readable and more fragile.
I show another solution using c++, to compare
//p.cpp
#include <stdio.h>
//to modify this value
#define COLUMNS 5
int main() {
char column3[256];
bool header=true;
while (scanf("%*s\t%*s\t%255s\t%*s\t%*s\t%*s\t", column3) == 1) {
printf("%s", column3);
if(header){
header=false;
char name[256];
for(int i=0; i<COLUMNS; ++i){
scanf("%[^_]_%*s\t%*s\t%*s\t", name);
printf("\t%s", name);
}
}else{
float nums[3];
for(int i=0; i<COLUMNS; ++i){
scanf("%f %f %f", nums, nums + 1, nums + 2);
float sum = nums[0]+nums[1]*2+nums[2]*3;
printf("\t%2.2f", sum);
}
}
printf("\n");
}
}
Run it, like
g++ p.cpp -o p
cat file | ./p
Benchmark
with 1 millon of lines in input and 300 columns
Ed Morton solution: 2m 34s
c++: 1m 19s
I have the files with following format each with the first column being common amongst all the files:
File1.txt
ID Score
ABCD 0.9
BCBS 0.2
NBNC 0.67
TCGS 0.8
File2.txt
ID Score
ABCD 0.3
BCBS 0.9
NBNC 0.73
TCGS 0.12
File3.txt
ID Score
ABCD 0.23
BCBS 0.65
NBNC 0.94
TCGS 0.56
I want to merge the second column (Score column) of all the files with the first column being common and display the file name minus the extension of each file as the header to identify as to where did the score come from such that the matrix would look something like
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
$ cat tst.awk
BEGIN { OFS="\t" }
FNR>1 { id[FNR] = $1; score[FNR,ARGIND] = $2 }
END {
printf "%s%s", "ID", OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
sub(/\..*/,"",ARGV[colNr])
printf "%s%s", ARGV[colNr], (colNr<ARGIND?OFS:ORS)
}
for (rowNr=2; rowNr<=FNR; rowNr++) {
printf "%s%s", id[rowNr], OFS
for (colNr=1; colNr<=ARGIND; colNr++) {
printf "%s%s", score[rowNr,colNr], (colNr<ARGIND?OFS:ORS)
}
}
}
$ awk -f tst.awk File1.txt File2.txt File3.txt
ID File1 File2 File3
ABCD 0.9 0.3 0.23
BCBS 0.2 0.9 0.65
NBNC 0.67 0.73 0.94
TCGS 0.8 0.12 0.56
Pick some string that can't occur in your input as the OFS, I used tab.
If you don't have GNU awk add FNR==1{ ARGIND++ } at the start of the script.
Another alternative
$ awk 'NR==1{$0=$1"\t"FILENAME}1' File1 > all;
for f in File{2..6};
do
paste all <(p $f) > temp && cp temp all;
done
define function p as
p() { awk 'NR==1{print FILENAME;next} {print $2}' $1; }
I copied your data to 6 identical files File1..File6 and the script produced this. Most of the work is setting up the column names
ID File1 File2 File3 File4 File5 File6
ABCD 0.9 0.9 0.9 0.9 0.9 0.9
BCBS 0.2 0.2 0.2 0.2 0.2 0.2
NBNC 0.67 0.67 0.67 0.67 0.67 0.67
TCGS 0.8 0.8 0.8 0.8 0.8 0.8
This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
I have an x y z matrix of the format:
1 1 0.02
1 2 0.10
1 4 0.22
2 1 0.70
2 2 0.22
3 2 0.44
3 3 0.42
...and so on. I'm interested in summing all of the z values (column 3) for a particular x value (column 1) and printing the output on separate lines (with the x value as a prefix), such that the output for the previous example would appear as:
1 0.34
2 0.92
3 0.86
I have a strong feeling that awk is the right tool for the job, but knowledge of awk is really lacking and I'd really appreciate any help that anyone can offer.
Thanks in advance.
I agree that awk is a good tool for this job — this is pretty much exactly the sort of task it was designed for.
awk '{ sum[$1] += $3 } END { for (i in sum) print i, sum[i] }' data
For the given data, I got:
2 0.92
3 0.86
1 0.34
Clearly, you could pipe the output to sort -n and get the results in sorted order after all.
To get that in sorted order with awk, you have to go outside the realm of POSIX awk and use the GNU awk extension function asorti:
gawk '{ sum[$1] += $3 }
END { n = asorti(sum, map); for (i = 1; i <= n; i++) print map[i], sum[map[i]] }' data
Output:
1 0.34
2 0.92
3 0.86