There are multiple threads explaining here and here on how to perform merging between two files using awk for example.
My problem is a bit more complicated since my files are very huge. file1.tsv is 288gb and 109 columns and file2.tsv is 16gb with 4 columns. I would like to join these files based on the first two columns:
file1.tsv (tab-separated) with 109 columns (here showing first 4 and last column):
CHROM POS REF ALT ... FILTER
chr1 10031 T C ... AC0;AS_VQSR
chr1 10037 T C ... AS_VQSR
chr1 10040 T A ... PASS
chr1 10043 T C ... AS_VQSR
chr1 10055 T C ... AS_VQSR
chr1 10057 A C ... AC0
file2.tsv (tab-separated) with 4 columns:
CHROM POS CHROM_hg19 POS_hg19
chr1 10031 chr1 10034
chr1 10037 chr1 10042
chr1 10043 chr1 10084
chr1 10055 chr1 10253
chr1 10057 chr1 10434
I wish to add the two last columns from file2.tsv to file1.tsv by matching on CHROM and POS while keeping all non-matching rows from file1.txt:
file3.txt
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR chr1 10084
chr1 10055 T C ... AS_VQSR chr1 10253
chr1 10057 A C ... AC0 chr1 10434
But as you have figured, these files are big. I tried the following:
awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1.txt file2.txt
And as soon as I hit enter, I saw my memory rocketing and no results being produced. I am unsure if this will produce the correct results at the end or how much memory it will use. Is there a better way to join my files in any methods using awk or any Bash programs?
Thank you in advance.
With join, sed and bash (Process Substitution):
join -t $'\t' -a 1 <(sed 's/\t/:/' file1.tsv) <(sed 's/\t/:/' file2.tsv) | sed 's/:/\t/' > file3.txt
This solution assumes that the first two columns are sorted together in ascending order in both files.
See: man join
If all else fails you could brute-force it and read a line from file1 then read lines from file2 until you hit a match or higher number, then read the next line from file1, etc. The advantage to that approach is that very little is being stored in memory so it should work no matter how large your files are.
This isn't quite right but I don't have any more time to think about it so consider it a start and if anyone wants to finish it off and post the finished product as an answer, be my guest:
$ cat tst.awk
BEGIN {
f1name = ARGV[1]
f2name = ARGV[2]
ARGV[1] = ARGV[2] = ""
while ( !done ) {
if ( (f1stat = (getline line1 < f1name)) > 0 ) {
split(line1,f1)
f1key = f1[1] FS f1[2]
}
matched = 0
while ( !eof && !matched ) {
if ( (f2stat = (getline line2 < f2name)) > 0 ) {
split(line2,f2)
f2key = f2[1] FS f2[2]
matched = (f1key == f2key)
}
else {
eof = 1
}
}
print line1, (matched ? f2[3] OFS f2[4] : "-" OFS "-")
if ( (f1stat <= 0) && (f2stat <= 0) ) {
done = 1
}
}
}
.
$ awk -f tst.awk file1.tsv file2.tsv
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR - -
chr1 10055 T C ... AS_VQSR - -
chr1 10057 A C ... AC0 - -
chr1 10057 A C ... AC0 - -
I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.
chr1 1 A T sample1
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 T A sample1
chr3 4 T A sample1
chr1 1 A T sample2
chr2 3 T A sample2
chr3 4 T A sample2
chr1 1 A T sample3
chr2 1 G C sample3
chr3 4 T A sample3
chr1 1 A T sample4
chr2 1 G C sample4
chr5 1 A T sample4
chr5 2 G C sample4
If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.
The file is not currently sorted.
Thanks in advance!
edit: Thanks for all these informative answers. I selected the one that was most familiar to how I am used to working but the other answers look great too and I will learn from them.
Using GNU datamash, tr and awk assuming that input and output are tab-separated:
$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3 4 T A sample1 sample2 sample3
First, use datamash to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field.
The output would look like this:
$ datamash -s -g1,2,3,4 collapse 5 < file
chr1 1 A T sample1,sample2,sample3,sample4
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 G C sample3,sample4
chr2 2 T A sample1
chr2 3 T A sample2
chr3 4 T A sample1,sample2,sample3
chr5 1 A T sample4
chr5 2 G C sample4
Then pipe the output to tr to convert the commas to tabs and finally use awk to print the rows with seven fields.
Using awk:
awk '
BEGIN{ FS=OFS="\t" }
{
idx=$1 FS $2 FS $3 FS $4
cnt[idx]++
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
}
END{
for (i in cnt)
if (cnt[i]==3) print i, data[i]
}
' file
Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.
In the end block, loop over the cnt array and print the index and the value of the data array if the count is three.
For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)
#!/bin/sh
file="$1"
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF
Then:
$ ./demo.sh input.tsv
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
This may be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
prt()
cnt = samples = ""
prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }
.
$ sort -k1,4 file | awk -f tst.awk
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
sort uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk script is storing almost nothing in memory.
I'm trying to replace the value of multiple columns in a file using awk. The reason to use awk is that the file is very large and cant do it loading it in memory. I've tried to do with pandas (python).
I have a large database as a textfile. I put here a example of the info in the file (tab-delimited):
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 7 1 1 2 5 7
chr1 10 T A 7 1 1 3 0 1
chr1 10 T G 7 2 1 1 8 2
chr1 11 None None 2 0 0 0 5 4
chr1 11 G T 2 1 0 0 2 3
If the first two columns (CHROM,POS) are the same in the rows, I have to sum the values of the columns that contain '_00' in the header.
So, the expected output, is:
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 21 4 3 6 13 10
chr1 10 T A 21 4 3 6 13 10
chr1 10 T G 21 4 3 6 13 10
chr1 11 None None 4 1 0 0 7 7
chr1 11 G T 4 1 0 0 7 7
I dont know how to do this, because I'm very new in programing, so, I have to do the following with this awk code.
awk -F'\t' 'FNR==1{next};
{keys[$1"\t"$2]
for (i=5;i<=10;i++)
{sum[$1"\t"$2, i] += $i}
}END {for (key in keys) { printf "%s", key
for (i=5;i<=10;i++) {printf "%s%s", "\t", sum[key,i]} printf "\n"}} OFS='\t' out.txt
With this code, and using as 'out.txt' the first textfile, I get:
chr1 10 21 4 3 6 13 10
chr1 11 4 1 0 0 7 7
Now, I'm trying to replace, in the rows with chr1 10, the 6 values in the first row, and in the rows with chr1 11, the 6 values in the second row.
I have accomplished to change the value in one column with the this code:
awk -F"\t" 'NR==FNR{h[$1"\t"$2]=$3;next}
{
printf $1"\t"$2"\t"$3"\t"$4"\t"h[$1"\t"$2]"\t";
for (i=6;i<=NF;i++)
{printf "%s",$i "\t"};
printf "\n"
}' OFS="\t" file1 file2
but need to do the same for all the columns.
How can I do it using a similar code?
Note: I have more columns that doesn't have '_00' in the header name
here you go with a memory efficient perl on-liner which should solve your problem. You may need to add the correct input filed separator e.g. -F'\t' and a regex to skip comment lines.
perl -lane 'if(!$prev || $prev eq "$F[0]:$F[1]"){push #r,[#F[4..$#F]]; push #snp,join"\t",#F[0..3]}else{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp; #snp=(join"\t",#F[0..3]); #o=(); #r=([#F[4..$#F]])} $prev="$F[0]:$F[1]"; END{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp;}' < \
<(echo -e "chr1 10 A T 1 2 3\nchr1 10 A G 1 2 3\nchr1 11 A T 4 5 6\nchr2 12 G C 7 8 9")
formatted version with comments for you :)
if(!$prev || $prev eq "$F[0]:$F[1]"){ # CHROM:POS compare to previous line
push #r,[#F[4..$#F]]; # store values in array of array reference
push #snp,join"\t",#F[0..3] # store CHROM,POS,REF,ALT
}else{
for $r (#r){ # CHROM:POS is new
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1 # sum up values in array references
};
print join"\t",($_,#o) for #snp; # join CHROM,POS,REF,ALT with summed values
#snp=(join"\t",#F[0..3]); # re-initialize
#o=();
#r=([#F[4..$#F]])
}
$prev="$F[0]:$F[1]"; # store CHROM:POS info
END{ # print final lines
for $r (#r){
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1
};
print join"\t",($_,#o) for #snp;
}
I have a file with the following fields (and an example value to the right):
hg18.ensGene.bin 0
hg18.ensGene.name ENST00000371026
hg18.ensGene.chrom chr1
hg18.ensGene.strand -
hg18.ensGene.txStart 67051161
hg18.ensGene.txEnd 67163158
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
hg18.ensGene.name2 ENSG00000152763
hg18.ensGene.exonFrames 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0,
This is a shortened version of the file:
0 ENST00000371026 chr1 - 67051161 67163158 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, uc009waw.1,uc009wax.1,uc001dcx.1,
0 ENST00000371023 chr1 - 67075869 67163055 67075869,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163055, ENSG00000152763 0,1,1,1,2,1,2,0,2,0, uc001dcy.1
0 ENST00000395250 chr1 - 67075991 67163158 67075991,67076022,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076018,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,0,1,1,1,2,0,-1,-1,-1,-1, n/a
I need to sum the difference of the exon starts and ends for example:
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
difference:
1290,157,227,99,122,158,152,87,203,195,156,140,157,113,185,175,226
sum (hg18.ensGene.exonLenSum):
3842
And I would like the output to have the following fields:
hg18.ensGene.name
hg18.ensGene.name2
hg18.ensGene.exonLenSum
such as this:
ENST00000371026 ENST00000371023 3842
I would like to do this with one awk script for all lines in the input file. How can I do this? This is useful for calculating exon lengths, say for a RPMK (Reads Per Kilobase exon Model per million mapped reads) calculation.
so ross$ awk -f gene.awk gene.dat
ENST00000371026 ENSG00000152763 3842
ENST00000371023 ENSG00000152763 1645
ENST00000395250 ENSG00000152763 1622
so ross$ cat gene.awk
/./ {
name = $2
name2 = $9
s = $7
e = $8
sc = split(s, sa, ",")
ec = split(e, ea, ",")
if (sc != ec) {
print "starts != ends ", name, name2, sc, ec
}
diffsum = 0
for(i = 1; i <= sc; ++i) {
diffsum += ea[i] - sa[i]
}
print name, name2, diffsum
}
using the UCSC mysql anonymous server:
mysql -N -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select name,name2,exonStarts,exonEnds from ensGene' |\
awk -F ' ' '{n=split($3,a1,"[,]"); split($4,a2,"[,]"); size=0; for(i=1;i<=n;++i) {size+=int(a2[i]-a1[i]);} printf("%s\t%s\t%d\n",$1,$2,size); }'
result:
ENST00000404059 ENSG00000219789 632
ENST00000326632 ENSG00000146556 1583
ENST00000408384 ENSG00000221311 138
ENST00000409575 ENSG00000222003 1187
ENST00000409981 ENSG00000222027 1187
ENST00000359752 ENSG00000197490 126
ENST00000379479 ENSG00000205292 873
ENST00000326183 ENSG00000177693 918
ENST00000407826 ENSG00000219467 2820
ENST00000405199 ENSG00000220902 1231
(...)