Extracting certain locus from multiple samples from text file - bash

After profiling STR locus in a population, the output gave me 122 files each of which contains about unique 800,000 locus.
There are 2 examples of my files:
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02035 chr2 1489653 (aatg)8 (aatg)11 4
HG02035 chr2 68011947 (tcta)11 (tcta)11 4
HG02035 chr2 218014855 (ggaa)16 (ggaa)16 4
HG02035 chr3 45540739 (tcta)15 (tcta)16 43
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02040 chr2 1489653 (aatg)8 (aatg)8 4
HG02040 chr2 68011947 (tcta)10 (tcta)10 4
HG02040 chr2 218014855 (ggaa)21 (ggaa)21 4
HG02040 chr3 45540739 (tcta)17 (tcta)17 4
I've been trying to extract variants for each of 800,000 STR locus. I expect the output should be like this for chromosome 1 at position of 230769616:
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02072 chr1 230769616 (tcta)10 (tcta)15 4
HG02121 chr1 230769616 (tcta)2 (tcta)2 4
HG02131 chr1 230769616 (tcta)16 (tcta)16 4
HG02513 chr1 230769616 (tcta)14 (tcta)14 4
I tried this command:
awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf
It worked but it take lots of time to create large number of files for each locus.
I am struggling to find an optimal solution to solve this.

You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).
Assuming you want to get a separate output file for every $2+$3 pair, you should be using the following with any awk:
tail -n +2 -q *.vcf | sort -k2,3 |
awk '
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur }
{ print > out }
'
If you want to have the header line present in every output file then tweak that to:
{ head -n 1 file1.vcf; tail -n +2 -q *.vcf | sort -k2,3; } |
awk '
NR==1 { hdr=$0; next }
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur; print hdr > out }
{ print > out }
'

My VCF file look like this:
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4
this is NOT a vcf file
for such file, sort on chrom,pos, compress with bgzip and index with tabix and query with tabix. http://www.htslib.org/doc/tabix.html

You can try processing everything in memory before printing them.
FNR > 1 {
i = $2 "_" $3
b[i, ++a[i]] = $0
}
END {
for (i in a) {
n = i ".locus.tsv"
for (j = 1; j <= a[i]; ++j)
print b[i, j] > n
close(n)
}
}
This may work depending on the size of your files and the amount of memory your machine has. Using another language that allows having a dynamic array as value can also be more efficient.

Related

How to join two huge files based on the first two columns in awk/Bash programs?

There are multiple threads explaining here and here on how to perform merging between two files using awk for example.
My problem is a bit more complicated since my files are very huge. file1.tsv is 288gb and 109 columns and file2.tsv is 16gb with 4 columns. I would like to join these files based on the first two columns:
file1.tsv (tab-separated) with 109 columns (here showing first 4 and last column):
CHROM POS REF ALT ... FILTER
chr1 10031 T C ... AC0;AS_VQSR
chr1 10037 T C ... AS_VQSR
chr1 10040 T A ... PASS
chr1 10043 T C ... AS_VQSR
chr1 10055 T C ... AS_VQSR
chr1 10057 A C ... AC0
file2.tsv (tab-separated) with 4 columns:
CHROM POS CHROM_hg19 POS_hg19
chr1 10031 chr1 10034
chr1 10037 chr1 10042
chr1 10043 chr1 10084
chr1 10055 chr1 10253
chr1 10057 chr1 10434
I wish to add the two last columns from file2.tsv to file1.tsv by matching on CHROM and POS while keeping all non-matching rows from file1.txt:
file3.txt
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR chr1 10084
chr1 10055 T C ... AS_VQSR chr1 10253
chr1 10057 A C ... AC0 chr1 10434
But as you have figured, these files are big. I tried the following:
awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1.txt file2.txt
And as soon as I hit enter, I saw my memory rocketing and no results being produced. I am unsure if this will produce the correct results at the end or how much memory it will use. Is there a better way to join my files in any methods using awk or any Bash programs?
Thank you in advance.
With join, sed and bash (Process Substitution):
join -t $'\t' -a 1 <(sed 's/\t/:/' file1.tsv) <(sed 's/\t/:/' file2.tsv) | sed 's/:/\t/' > file3.txt
This solution assumes that the first two columns are sorted together in ascending order in both files.
See: man join
If all else fails you could brute-force it and read a line from file1 then read lines from file2 until you hit a match or higher number, then read the next line from file1, etc. The advantage to that approach is that very little is being stored in memory so it should work no matter how large your files are.
This isn't quite right but I don't have any more time to think about it so consider it a start and if anyone wants to finish it off and post the finished product as an answer, be my guest:
$ cat tst.awk
BEGIN {
f1name = ARGV[1]
f2name = ARGV[2]
ARGV[1] = ARGV[2] = ""
while ( !done ) {
if ( (f1stat = (getline line1 < f1name)) > 0 ) {
split(line1,f1)
f1key = f1[1] FS f1[2]
}
matched = 0
while ( !eof && !matched ) {
if ( (f2stat = (getline line2 < f2name)) > 0 ) {
split(line2,f2)
f2key = f2[1] FS f2[2]
matched = (f1key == f2key)
}
else {
eof = 1
}
}
print line1, (matched ? f2[3] OFS f2[4] : "-" OFS "-")
if ( (f1stat <= 0) && (f2stat <= 0) ) {
done = 1
}
}
}
.
$ awk -f tst.awk file1.tsv file2.tsv
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR - -
chr1 10055 T C ... AS_VQSR - -
chr1 10057 A C ... AC0 - -
chr1 10057 A C ... AC0 - -

Using specific columns, output rows that are present 3 times in a text file

I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.
chr1 1 A T sample1
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 T A sample1
chr3 4 T A sample1
chr1 1 A T sample2
chr2 3 T A sample2
chr3 4 T A sample2
chr1 1 A T sample3
chr2 1 G C sample3
chr3 4 T A sample3
chr1 1 A T sample4
chr2 1 G C sample4
chr5 1 A T sample4
chr5 2 G C sample4
If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.
The file is not currently sorted.
Thanks in advance!
edit: Thanks for all these informative answers. I selected the one that was most familiar to how I am used to working but the other answers look great too and I will learn from them.
Using GNU datamash, tr and awk assuming that input and output are tab-separated:
$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3 4 T A sample1 sample2 sample3
First, use datamash to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field.
The output would look like this:
$ datamash -s -g1,2,3,4 collapse 5 < file
chr1 1 A T sample1,sample2,sample3,sample4
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 G C sample3,sample4
chr2 2 T A sample1
chr2 3 T A sample2
chr3 4 T A sample1,sample2,sample3
chr5 1 A T sample4
chr5 2 G C sample4
Then pipe the output to tr to convert the commas to tabs and finally use awk to print the rows with seven fields.
Using awk:
awk '
BEGIN{ FS=OFS="\t" }
{
idx=$1 FS $2 FS $3 FS $4
cnt[idx]++
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
}
END{
for (i in cnt)
if (cnt[i]==3) print i, data[i]
}
' file
Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.
In the end block, loop over the cnt array and print the index and the value of the data array if the count is three.
For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)
#!/bin/sh
file="$1"
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF
Then:
$ ./demo.sh input.tsv
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
This may be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
prt()
cnt = samples = ""
prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }
.
$ sort -k1,4 file | awk -f tst.awk
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
sort uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk script is storing almost nothing in memory.

How to replace the value of multiple columns in a file based on two columns in another file with bash?

I'm trying to replace the value of multiple columns in a file using awk. The reason to use awk is that the file is very large and cant do it loading it in memory. I've tried to do with pandas (python).
I have a large database as a textfile. I put here a example of the info in the file (tab-delimited):
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 7 1 1 2 5 7
chr1 10 T A 7 1 1 3 0 1
chr1 10 T G 7 2 1 1 8 2
chr1 11 None None 2 0 0 0 5 4
chr1 11 G T 2 1 0 0 2 3
If the first two columns (CHROM,POS) are the same in the rows, I have to sum the values of the columns that contain '_00' in the header.
So, the expected output, is:
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 21 4 3 6 13 10
chr1 10 T A 21 4 3 6 13 10
chr1 10 T G 21 4 3 6 13 10
chr1 11 None None 4 1 0 0 7 7
chr1 11 G T 4 1 0 0 7 7
I dont know how to do this, because I'm very new in programing, so, I have to do the following with this awk code.
awk -F'\t' 'FNR==1{next};
{keys[$1"\t"$2]
for (i=5;i<=10;i++)
{sum[$1"\t"$2, i] += $i}
}END {for (key in keys) { printf "%s", key
for (i=5;i<=10;i++) {printf "%s%s", "\t", sum[key,i]} printf "\n"}} OFS='\t' out.txt
With this code, and using as 'out.txt' the first textfile, I get:
chr1 10 21 4 3 6 13 10
chr1 11 4 1 0 0 7 7
Now, I'm trying to replace, in the rows with chr1 10, the 6 values in the first row, and in the rows with chr1 11, the 6 values in the second row.
I have accomplished to change the value in one column with the this code:
awk -F"\t" 'NR==FNR{h[$1"\t"$2]=$3;next}
{
printf $1"\t"$2"\t"$3"\t"$4"\t"h[$1"\t"$2]"\t";
for (i=6;i<=NF;i++)
{printf "%s",$i "\t"};
printf "\n"
}' OFS="\t" file1 file2
but need to do the same for all the columns.
How can I do it using a similar code?
Note: I have more columns that doesn't have '_00' in the header name
here you go with a memory efficient perl on-liner which should solve your problem. You may need to add the correct input filed separator e.g. -F'\t' and a regex to skip comment lines.
perl -lane 'if(!$prev || $prev eq "$F[0]:$F[1]"){push #r,[#F[4..$#F]]; push #snp,join"\t",#F[0..3]}else{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp; #snp=(join"\t",#F[0..3]); #o=(); #r=([#F[4..$#F]])} $prev="$F[0]:$F[1]"; END{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp;}' < \
<(echo -e "chr1 10 A T 1 2 3\nchr1 10 A G 1 2 3\nchr1 11 A T 4 5 6\nchr2 12 G C 7 8 9")
formatted version with comments for you :)
if(!$prev || $prev eq "$F[0]:$F[1]"){ # CHROM:POS compare to previous line
push #r,[#F[4..$#F]]; # store values in array of array reference
push #snp,join"\t",#F[0..3] # store CHROM,POS,REF,ALT
}else{
for $r (#r){ # CHROM:POS is new
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1 # sum up values in array references
};
print join"\t",($_,#o) for #snp; # join CHROM,POS,REF,ALT with summed values
#snp=(join"\t",#F[0..3]); # re-initialize
#o=();
#r=([#F[4..$#F]])
}
$prev="$F[0]:$F[1]"; # store CHROM:POS info
END{ # print final lines
for $r (#r){
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1
};
print join"\t",($_,#o) for #snp;
}

Sorting on a column alphanumerically

I have following file and I want to sort it alphanumerically based on 6 th column such that an E1 is followed by I1 and then E2 and so on of a specific ID before the ' : ', when I do sort -V -k6 file it puts all the ID:Is at the end and not where they should be.However when I do sort -k6 it does put the Es and Is of the IDs together but with some IDs belonging to different series interspersed (I have highlighted them here), how can I get the sorting such that no two IDs are mixed and the column is in the order it should be:
chr1 259017 259121 104 - ENSG00000228463:E2
chr1 259122 267095 7973 - ENSG00000228463:I1
chr1 267096 267253 157 - ENSG00000228463:E1
chr1 317720 317781 61 + ENSG00000237094:E1
chr1 317782 320161 2379 + ENSG00000237094:I1
chr1 320162 320653 491 + ENSG00000237094:E2
chr1 320654 320880 226 + ENSG00000237094:I2
chr1 320881 320938 57 + ENSG00000237094:E3
chr1 320939 321031 92 + ENSG00000237094:I3
chr1 321032 321290 258 + ENSG00000237094:E4
chr1 321291 322037 746 + ENSG00000237094:I4
chr1 322038 322228 190 + ENSG00000237094:E5
chr1 322229 322671 442 + ENSG00000237094:I5
chr1 322672 323073 401 + ENSG00000237094:E6
chr1 323074 323860 786 + ENSG00000237094:I6
chr1 323861 324060 199 + ENSG00000237094:E7
chr1 324061 324287 226 + ENSG00000237094:I7
chr1 324288 324345 57 + ENSG00000237094:E8
chr1 324346 324438 92 + ENSG00000237094:I8
chr1 324439 326514 2075 + ENSG00000237094:E9
**chr1 326096 326569 473 + ENSG00000250575:E1**
chr1 326515 327551 1036 + ENSG00000237094:I9
**chr1 326570 327347 777 + ENSG00000250575:I1**
**chr1 327348 328112 764 + ENSG00000250575:E2**
chr1 327552 328453 901 + ENSG00000237094:E10
chr1 328454 329783 1329 + ENSG00000237094:I10
**chr1 329431 329620 189 - ENSG00000233653:E2**
**chr1 329621 329949 328 - ENSG00000233653:I1**
chr1 329784 329976 192 + ENSG00000237094:E11
Original answer:
sed 's/:[EI]/&_ /' foo.txt | #separate the number at the end with a space
sort -k6 | sort -n -k7 | #sort by code, then by [EI] number
sed 's/_ //' #remove the underscore space
I like to do things like this by 'protecting' strings with a placeholder to isolate what I'm interested in, then replacing them later.
Closer:
sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'
But this naively assumes that sort works in a very specific way that it doesn't... so sometimes E2 will come before E1...
I'm not sure it can be done with sort alone, awk might be the way to go...
So I came back to this question and wrote some python code that actually accomplishes the task:
#!/usr/bin/env python
import sys
import re
from collections import defaultdict
#loop through args
for thisarg in sys.argv[1:]:
#initialize a defualt dict
bysign = defaultdict(list)
#read the file
try:
thisfile = open(thisarg,'r')
for line in thisfile:
#split each line by space and colon
dat = re.split('[ :]*',line.strip())
#append line to dictionary indexed by ENSG code
bysign[dat[-2]].append(line.strip())
thisfile.close()
except IOError:
print "no such file {:}".format(thisarg)
#extract the keys from the dictionary
mykeys = bysign.keys()
#sort the keys
mykeys.sort()
for key in mykeys:
#initialize another, smaller dictionary
bytuple = dict()
#loop through all the lines that have the same ENSG code
group = bysign[key]
for line in group:
#extract the E/I code
ei=line.split(':')[-1]
#convert the E/I code to a (char,int) tuple
letter = ei[0]
number = int(ei[1:])
#use that tuple to index the smaller dict
bytuple[(letter,number)] = line
#extract the keys from the sub-dictionary
eikeys = bytuple.keys()
#sort the keys
eikeys.sort()
#print the results
for k in eikeys:
print bytuple[k]
I hope you already figured it out by now. Curious if anyone cares enough to improve my python.

Using awk create two arrays from two column values, find difference and sum differences, and output data

I have a file with the following fields (and an example value to the right):
hg18.ensGene.bin 0
hg18.ensGene.name ENST00000371026
hg18.ensGene.chrom chr1
hg18.ensGene.strand -
hg18.ensGene.txStart 67051161
hg18.ensGene.txEnd 67163158
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
hg18.ensGene.name2 ENSG00000152763
hg18.ensGene.exonFrames 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0,
This is a shortened version of the file:
0 ENST00000371026 chr1 - 67051161 67163158 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,2,0,0,1,2,0,0,1,1,1,2,1,2,0,2,0, uc009waw.1,uc009wax.1,uc001dcx.1,
0 ENST00000371023 chr1 - 67075869 67163055 67075869,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163055, ENSG00000152763 0,1,1,1,2,1,2,0,2,0, uc001dcy.1
0 ENST00000395250 chr1 - 67075991 67163158 67075991,67076022,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932, 67076018,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158, ENSG00000152763 0,0,1,1,1,2,0,-1,-1,-1,-1, n/a
I need to sum the difference of the exon starts and ends for example:
hg18.ensGene.exonStarts 67051161,67060631,67065090,67066082,67071855,67072261,67073896,67075980,67078739,67085754,67100417,67109640,67113051,67129424,67131499,67143471,67162932,
hg18.ensGene.exonEnds 67052451,67060788,67065317,67066181,67071977,67072419,67074048,67076067,67078942,67085949,67100573,67109780,67113208,67129537,67131684,67143646,67163158,
difference:
1290,157,227,99,122,158,152,87,203,195,156,140,157,113,185,175,226
sum (hg18.ensGene.exonLenSum):
3842
And I would like the output to have the following fields:
hg18.ensGene.name
hg18.ensGene.name2
hg18.ensGene.exonLenSum
such as this:
ENST00000371026 ENST00000371023 3842
I would like to do this with one awk script for all lines in the input file. How can I do this? This is useful for calculating exon lengths, say for a RPMK (Reads Per Kilobase exon Model per million mapped reads) calculation.
so ross$ awk -f gene.awk gene.dat
ENST00000371026 ENSG00000152763 3842
ENST00000371023 ENSG00000152763 1645
ENST00000395250 ENSG00000152763 1622
so ross$ cat gene.awk
/./ {
name = $2
name2 = $9
s = $7
e = $8
sc = split(s, sa, ",")
ec = split(e, ea, ",")
if (sc != ec) {
print "starts != ends ", name, name2, sc, ec
}
diffsum = 0
for(i = 1; i <= sc; ++i) {
diffsum += ea[i] - sa[i]
}
print name, name2, diffsum
}
using the UCSC mysql anonymous server:
mysql -N -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18 -e 'select name,name2,exonStarts,exonEnds from ensGene' |\
awk -F ' ' '{n=split($3,a1,"[,]"); split($4,a2,"[,]"); size=0; for(i=1;i<=n;++i) {size+=int(a2[i]-a1[i]);} printf("%s\t%s\t%d\n",$1,$2,size); }'
result:
ENST00000404059 ENSG00000219789 632
ENST00000326632 ENSG00000146556 1583
ENST00000408384 ENSG00000221311 138
ENST00000409575 ENSG00000222003 1187
ENST00000409981 ENSG00000222027 1187
ENST00000359752 ENSG00000197490 126
ENST00000379479 ENSG00000205292 873
ENST00000326183 ENSG00000177693 918
ENST00000407826 ENSG00000219467 2820
ENST00000405199 ENSG00000220902 1231
(...)

Resources