Merge many tab seprated files based on first column - bash

I have many TSV files in a directory that have only three columns, I want to merge all of them based on the first column value (both columns have headers that I need to maintain); if this value is present then it must add the value of the corresponding second and third column, and if value missing in any file add NA and so on (see example). Files might have different number of lines and not ordered by first column, although this can be easily done with sort.
I have tried join but that works nicely for only two files. Can join be expanded for all files in a directory? Here are the example of just three files:
S01.tsv
Accesion Val S01
AJ863320 1 0.2
AM930424 1 0.3
AY664038 2 0.5
S02.tsv
Accesion Val S02
AJ863320 2 0.8
AM930424 1 0.25
EU236327 1 0.14
EU434346 2 0.2
S03.tsv
Accesion Val S03
AJ863320 5 0.2
EU236327 1 0.5
EU434346 2 0.3
Outfile should be:
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
AM930424 1 0.3 0.25 NA
AY664038 2 0.5 NA NA
EU236327 1 NA 0.14 0.5
EU434346 2 NA 0.2 0.3
Ok I've tried with awk by taking help here, but not successful
BEGIN { OFS="\t" } # tab separated columns
FNR==1 { f++ } # counter of files
{
a[0][$1]=$1 # reset the key for every record
for(i=2;i<=NF;i++) # for each non-key element
a[f][$1]=a[f][$1] $i ( i==NF?"":OFS ) # combine them to array element
}
END { # in the end
for(i in a[0]) # go thru every key
for(j=0;j<=f;j++) # and all related array elements
printf "%s%s", a[j][i], (j==f?ORS:OFS)
} # output them, nonexistent will output empty

I would harness GNU AWK for this task following way, let S01.tsv content be
Accesion Val S01
AJ863320 1 0.2
AM930424 1 0.3
AY664038 2 0.5
and S02.tsv content be
Accesion Val S02
AJ863320 2 0.8
AM930424 1 0.25
EU236327 1 0.14
EU434346 2 0.2
and S03.tsv content be
Accesion Val S03
AJ863320 5 0.2
EU236327 1 0.5
EU434346 2 0.3
then
awk 'BEGIN{OFS="\t"}NR==1{title=$1 OFS $2}{arr[$1 OFS $2][FILENAME]=$3}END{print title,arr[title]["S01.tsv"],arr[title]["S02.tsv"],arr[title]["S03.tsv"];delete arr[title];for(i in arr){print i,"S01.tsv" in arr[i]?arr[i]["S01.tsv"]:"NA","S02.tsv" in arr[i]?arr[i]["S02.tsv"]:"NA","S03.tsv" in arr[i]?arr[i]["S03.tsv"]:"NA"}}' S01.tsv S02.tsv S03.tsv
gives output
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
EU236327 1 NA 0.14 0.5
AM930424 1 0.3 0.25 NA
EU434346 2 NA 0.2 0.3
AY664038 2 0.5 NA NA
Explanation: I am storing data in 2D array arr, using values from 1st and 2nd column concatenated using output field separator (first dimension) and filename (2nd dimensions). Values in array are values from 3rd column. After data is collected I started by printing title (header) row, which I then delete from array, then I iterate over first dimension of array, and for each element I print key followed by values from each file or NA if there was not value. Observe that I use in check rather than looking for truthiness of value itself as this would alter 0 values into NAs. Disclaimer: this solution assumes you are accepting any order of output rows beyond headers, if this does not hold do not use this solution.
(tested in GNU Awk 5.0.1)

Using GNU awk for arrays of arrays and sorted_in:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
FNR == 1 {
if ( NR == 1 ) {
numCols = split($0,hdrs)
}
else {
hdrs[++numCols] = $3
}
next
}
{
accsValsCols2ss[$1][$2][numCols] = $3
}
END {
for ( colNr=1; colNr<=numCols; colNr++ ) {
printf "%s%s", hdrs[colNr], (colNr<numCols ? OFS : ORS)
}
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( acc in accsValsCols2ss ) {
PROCINFO["sorted_in"] = "#ind_num_asc"
for ( val in accsValsCols2ss[acc] ) {
printf "%s%s%s", acc, OFS, val
for ( colNr=3; colNr<=numCols; colNr++ ) {
s = ( colNr in accsValsCols2ss[acc][val] ? accsValsCols2ss[acc][val][colNr] : "NA" )
printf "%s%s", OFS, s
}
print ""
}
}
}
$ awk -f tst.awk S01.tsv S02.tsv S03.tsv
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
AM930424 1 0.3 0.25 NA
AY664038 2 0.5 NA NA
EU236327 1 NA 0.14 0.5
EU434346 2 NA 0.2 0.3

Related

find lowest value and index of floating point array awk, sed, sort

I have the following array:
echo $array
0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.2
I have written a code to sort the values and also get the index number:
echo $array | tr -s ' ' '\n' | awk '{print($0" "NR)}' | sort -g -k1,1
0.2 11
0.2 19
0.3 1
0.3 10
0.3 2
0.3 9
0.4 12
0.4 13
0.4 14
0.4 15
0.4 18
0.4 3
0.4 4
0.4 5
0.4 6
0.4 7
0.4 8
0.5 16
0.5 17
I am having a difficult time extracting only the rows which have the lowest value in the first column (i.e., the lowest values in the array, overall). For example, the desired final product for the above example would be:
0.2 11
0.2 19
It should be able to handle instances of one, and multiple lowest value indices. The code does not need to include any sort of awk, sort, sed, or any commands if they do not need to - anything could work (this is just as far as I have gotten with achieving the final task).
Print the output until the number in the first column does not change.
echo $array | tr -s ' ' '\n' | awk '{print($0" "NR)}' | sort -g -k1,1 |
awk 'length(last) == 0 || last == $1 { last=$1; print; }'
Notes:
It's best to always quote variable expansions echo "$array".
If you don't quote $array, you could just printf "%s\n" $array
You could use nl to number lines (but the columns order would be different).
Using asort() funtion in awk
awk '{split($0,a); for (i in a) a[i]=a[i]" "i; n=asort(a); for (i = 1; i <= 2; i++) print a[i]} '
Demo:
$echo $array
0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.2
$echo $array | awk '{split($0,a); for (i in a) a[i]=a[i]" "i; n=asort(a); for (i = 1; i <= 2; i++) print a[i]} '
0.2 11
0.2 19
$
Explanation:
{split($0,a); -- Initialize a array a from input record
for (i in a) a[i]=a[i]" "i; -- Append current row number to existing value
n=asort(a); -- Call sort array function and store number of element in variable n.
for (i = 1; i <= 2; i++) -- loop for first 2 element of array
print a[i]}
Documentation on asort()
P.S. --> storing number of element in n was not required.

Passing for loop using non-integers to awk

I am trying to write code which will achieve:
Where $7 is less than $i (0 - 1 in increments of 0.05), print the line and pass to word count. The way I tried to do this was:
for i in $(seq 0 0.05 1); do awk '{if ($7 <= $i) print $0}' file.txt | wc -l ; done
This just ends up returning the word count of the full file (~40 million lines) for each instance of $i. When, for example using $7 <= 0.00, it should be returning ~67K.
I feel like there may be a way to do this within awk, but I have not seen any suggestions which allow for non-integers.
Thanks in advance.
Pass $i to awk as a variable with -v and so:
for i in $(seq 0 0.05 1); do awk -v i=$i '{if ($7 <= i) print $0}' file.txt | wc -l ; done
Some made up data:
$ cat file.txt
1 2 3 4 5 6 7 a b c d e f
1 2 3 4 5 6 0.6 a b c
1 2 3 4 5 6 0.57 a b c d e f g h i j
1 2 3 4 5 6 1 a b c d e f g
1 2 3 4 5 6 0.21 a b
1 2 3 4 5 6 0.02 x y z
1 2 3 4 5 6 0.00 x y z l j k
One possible 100% awk solution:
awk '
BEGIN { line_count=0 }
{ printf "================= %s\n",$0
for (i=0; i<=20; i++)
{ if ($7 <= i/20)
{ printf "matching seq : %1.2f\n",i/20
line_count++
seq_count[i]++
next
}
}
}
END { printf "=================\n\n"
for (i=0; i<=20; i++)
{ if (seq_count[i] > 0)
{ printf "seq = %1.2f : %8s (count)\n",i/20,seq_count[i] }
}
printf "\nseq = all : %8s (count)\n",line_count
}
' file.txt
# the output:
================= 1 2 3 4 5 6 7 a b c d e f
================= 1 2 3 4 5 6 0.6 a b c
matching seq : 0.60
================= 1 2 3 4 5 6 0.57 a b c d e f g h i j
matching seq : 0.60
================= 1 2 3 4 5 6 1 a b c d e f g
matching seq : 1.00
================= 1 2 3 4 5 6 0.21 a b
matching seq : 0.25
================= 1 2 3 4 5 6 0.02 x y z
matching seq : 0.05
================= 1 2 3 4 5 6 0.00 x y z l j k
matching seq : 0.00
=================
seq = 0.00 : 1 (count)
seq = 0.05 : 1 (count)
seq = 0.25 : 1 (count)
seq = 0.60 : 2 (count)
seq = 1.00 : 1 (count)
seq = all : 6 (count)
BEGIN { line_count=0 } : initialize a total line counter
print statement is merely for debug purposes; will print out every line from file.txt as it's processed
for (i=0; i<=20; i++) : depending on implementation, some versions of awk may have rounding/accuracy problems with non-integer numbers in sequences (eg, increment by 0.05), so we'll use whole integers for our sequence, and divide by 20 (for this particular case) to provide us with our 0.05 increments during follow-on testing
$7 <= i/20 : if field #7 is less than or equal to (i/20) ...
printf "matching seq ... : print the sequence value we just matched on (i/20)
line_count++ : add '1' to our total line counter
seq_count[i]++ : add '1' to our sequence counter array
next : break out of our sequence loop (since we found our matching sequence value (i/20), and process the next line in the file
END ... : print out our line counts
for (x=1; ...) / if / printf : loop through our array of sequences, printing the line count for each sequence (i/20)
printf "\nseq = all... : print out our total line count
NOTE: Some of the awk code can be further reduced but I'll leave this as is since it's a little easier to understand if you're new to awk.
One (obvious?) benefit of a 100% awk solution is that our sequence/looping construct is internal to awk thus allowing us to limit ourselves to one loop through the input file (file.txt); when the sequence/looping construct is outside of awk we find ourselves having to process the input file once for each pass through the sequence/loop (eg, for this exercise we would have to process the input file 21 times !!!).
Using a bit of guesswork as to what you actually want to accomplish, I came up with this:
awk '{ for (i=20; 20*$7<=i && i>0; i--) bucket[i]++ }
END { for (i=1; i<=20; i++) print bucket[i] " lines where $7 <= " i/20 }'
With the mock data from mark's second answer I get this output:
2 lines where $7 <= 0.05
2 lines where $7 <= 0.1
2 lines where $7 <= 0.15
2 lines where $7 <= 0.2
3 lines where $7 <= 0.25
3 lines where $7 <= 0.3
3 lines where $7 <= 0.35
3 lines where $7 <= 0.4
3 lines where $7 <= 0.45
3 lines where $7 <= 0.5
3 lines where $7 <= 0.55
5 lines where $7 <= 0.6
5 lines where $7 <= 0.65
5 lines where $7 <= 0.7
5 lines where $7 <= 0.75
5 lines where $7 <= 0.8
5 lines where $7 <= 0.85
5 lines where $7 <= 0.9
5 lines where $7 <= 0.95
6 lines where $7 <= 1

remove lines based on value of two columns

I have a huge file (my_file.txt) with ~ 8,000,000 lines that looks like this:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs374183434 0 NA -2.22383195384362
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
I want to find the duplicates based on the first three columns and then remove the line with a lower value in the 7th columns, the first part I can accomplish with:
awk -F"\t" '!seen[$2, $3]++' my_file.txt
But I don't know how to do the part about removing the duplicate with a lower value, the desired output would be this one:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
Speed is an issue so I could use awk, sed or another bash command
Thanks
$ awk '(i=$1 FS $2 FS $3) && !(i in seventh) || seventh[i] < $7 {seventh[i]=$7; all[i]=$0} END {for(i in a) print all[i]}' my_file.txt
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13110 13110 rs540538026 0 NA -1.33177622457982
Thanks to #fedorqui for the advanced indexing. :D
Explained:
(i=$1 FS $2 FS $3) && !(i in seventh) || $7 > seventh[i] { # set index to first 3 fields
# AND if index not yet stored in array
# OR the seventh field is greater than the previous value of the seventh field by the same index:
seventh[i]=$7 # new biggest value
all[i]=$0 # store that record
}
END {
for(i in all) # for all stored records of the biggest seventh value
print all[i] # print them
}

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Matching numbers in two different files using awk

I have two files (f1 and f2), both made of three columns, of different lengths. I would like to create a new file of four columns in the following way:
f1 f2
1 2 0.2 1 4 0.3
1 3 0.5 1 5 0.2
1 4 0.2 2 3 0.6
2 2 0.5
2 3 0.9
If the numbers in the first two columns are present in both files, then we print the first two numbers and the third number of each file (e.g. in both there is 1 4, in f3 there should be 1 4 0.2 0.3; otherwise, if the two first numbers are missing in f2 just print a zero in the fourth column.
The complete results of these example should be
f3
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6
The script that I wrote is the following:
awk '{str1=$1; str2=$2; str3=$3;
getline < "f2";
if($1==str1 && $2==str2)
print str1,str2,str3,$3 > "f3";
else
print str1,str2,str3,0 > "f3";
}' f1
but it just looks if the same two numbers are in the same row (it does not go through all the f2 file) giving as results
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0
2 2 0.5 0
2 3 0.9 0
This awk should work:
awk 'FNR==NR{a[$1,$2]=$3;next} {print $0, (a[$1,$2])? a[$1,$2]:0}' f2 f1
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6

Resources