Using specific columns, output rows that are present 3 times in a text file - sorting

I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.
chr1 1 A T sample1
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 T A sample1
chr3 4 T A sample1
chr1 1 A T sample2
chr2 3 T A sample2
chr3 4 T A sample2
chr1 1 A T sample3
chr2 1 G C sample3
chr3 4 T A sample3
chr1 1 A T sample4
chr2 1 G C sample4
chr5 1 A T sample4
chr5 2 G C sample4
If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.
The file is not currently sorted.
Thanks in advance!
edit: Thanks for all these informative answers. I selected the one that was most familiar to how I am used to working but the other answers look great too and I will learn from them.

Using GNU datamash, tr and awk assuming that input and output are tab-separated:
$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3 4 T A sample1 sample2 sample3
First, use datamash to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field.
The output would look like this:
$ datamash -s -g1,2,3,4 collapse 5 < file
chr1 1 A T sample1,sample2,sample3,sample4
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 G C sample3,sample4
chr2 2 T A sample1
chr2 3 T A sample2
chr3 4 T A sample1,sample2,sample3
chr5 1 A T sample4
chr5 2 G C sample4
Then pipe the output to tr to convert the commas to tabs and finally use awk to print the rows with seven fields.
Using awk:
awk '
BEGIN{ FS=OFS="\t" }
{
idx=$1 FS $2 FS $3 FS $4
cnt[idx]++
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
}
END{
for (i in cnt)
if (cnt[i]==3) print i, data[i]
}
' file
Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.
In the end block, loop over the cnt array and print the index and the value of the data array if the count is three.

For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)
#!/bin/sh
file="$1"
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF
Then:
$ ./demo.sh input.tsv
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3

This may be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
prt()
cnt = samples = ""
prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }
.
$ sort -k1,4 file | awk -f tst.awk
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
sort uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk script is storing almost nothing in memory.

Related

Extracting certain locus from multiple samples from text file

After profiling STR locus in a population, the output gave me 122 files each of which contains about unique 800,000 locus.
There are 2 examples of my files:
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02035 chr2 1489653 (aatg)8 (aatg)11 4
HG02035 chr2 68011947 (tcta)11 (tcta)11 4
HG02035 chr2 218014855 (ggaa)16 (ggaa)16 4
HG02035 chr3 45540739 (tcta)15 (tcta)16 43
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02040 chr2 1489653 (aatg)8 (aatg)8 4
HG02040 chr2 68011947 (tcta)10 (tcta)10 4
HG02040 chr2 218014855 (ggaa)21 (ggaa)21 4
HG02040 chr3 45540739 (tcta)17 (tcta)17 4
I've been trying to extract variants for each of 800,000 STR locus. I expect the output should be like this for chromosome 1 at position of 230769616:
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02072 chr1 230769616 (tcta)10 (tcta)15 4
HG02121 chr1 230769616 (tcta)2 (tcta)2 4
HG02131 chr1 230769616 (tcta)16 (tcta)16 4
HG02513 chr1 230769616 (tcta)14 (tcta)14 4
I tried this command:
awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf
It worked but it take lots of time to create large number of files for each locus.
I am struggling to find an optimal solution to solve this.
You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).
Assuming you want to get a separate output file for every $2+$3 pair, you should be using the following with any awk:
tail -n +2 -q *.vcf | sort -k2,3 |
awk '
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur }
{ print > out }
'
If you want to have the header line present in every output file then tweak that to:
{ head -n 1 file1.vcf; tail -n +2 -q *.vcf | sort -k2,3; } |
awk '
NR==1 { hdr=$0; next }
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur; print hdr > out }
{ print > out }
'
My VCF file look like this:
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4
this is NOT a vcf file
for such file, sort on chrom,pos, compress with bgzip and index with tabix and query with tabix. http://www.htslib.org/doc/tabix.html
You can try processing everything in memory before printing them.
FNR > 1 {
i = $2 "_" $3
b[i, ++a[i]] = $0
}
END {
for (i in a) {
n = i ".locus.tsv"
for (j = 1; j <= a[i]; ++j)
print b[i, j] > n
close(n)
}
}
This may work depending on the size of your files and the amount of memory your machine has. Using another language that allows having a dynamic array as value can also be more efficient.

Getting sum of values in a particular column with some conditions

I have a tab delimited file like this:
chr1 104517 105076 abc 148
chr1 127781 128051 def 89
chr1 186884 186981 xyz 97
chr1 127781 128051 def 55
chr1 890934 891105 abc 50
chr1 104517 105076 abc 24
chr1 890934 891105 xyz 19
First, for every values in column 4 I wanted sum of the values in column 5. Like
abc 222
def 144
xyz 116
I did it with this code:
awk -F'\t' '{ SUM[$4] += $5 } END { for (j in SUM) print j, SUM[j] }' filename
Now I want to do this separately for every unique combination of first three columns. For example, in case of above input file, I want this output:
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Can someone please tell me the way to do this in bash script?
Thank you
I'd turn to perl instead of awk for its better support for complex data structures:
$ perl -M5.020 -lane '
our $data;
$data->{$F[0]}{$F[1]}{$F[2]}{$F[3]} += $F[4];
END {
for my $c1 (sort keys %$data) {
for my $c2 (sort { $a <=> $b } keys %{$data->{$c1}}) {
for my $c3 (sort { $a <=> $b } keys %{$data->{$c1}{$c2}}) {
my $rest = $data->{$c1}{$c2}{$c3};
print join("\t", $c1, $c2, $c3, %$rest{sort keys %$rest});
}
}
}
}' input.tsv
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Basically, builds a 4-dimensional hash table using the first four columns of each line as keys, with the sum of the fifth column as the final value. Then walks the levels of the table in sorted order and prints the result.

How to join two huge files based on the first two columns in awk/Bash programs?

There are multiple threads explaining here and here on how to perform merging between two files using awk for example.
My problem is a bit more complicated since my files are very huge. file1.tsv is 288gb and 109 columns and file2.tsv is 16gb with 4 columns. I would like to join these files based on the first two columns:
file1.tsv (tab-separated) with 109 columns (here showing first 4 and last column):
CHROM POS REF ALT ... FILTER
chr1 10031 T C ... AC0;AS_VQSR
chr1 10037 T C ... AS_VQSR
chr1 10040 T A ... PASS
chr1 10043 T C ... AS_VQSR
chr1 10055 T C ... AS_VQSR
chr1 10057 A C ... AC0
file2.tsv (tab-separated) with 4 columns:
CHROM POS CHROM_hg19 POS_hg19
chr1 10031 chr1 10034
chr1 10037 chr1 10042
chr1 10043 chr1 10084
chr1 10055 chr1 10253
chr1 10057 chr1 10434
I wish to add the two last columns from file2.tsv to file1.tsv by matching on CHROM and POS while keeping all non-matching rows from file1.txt:
file3.txt
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR chr1 10084
chr1 10055 T C ... AS_VQSR chr1 10253
chr1 10057 A C ... AC0 chr1 10434
But as you have figured, these files are big. I tried the following:
awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1.txt file2.txt
And as soon as I hit enter, I saw my memory rocketing and no results being produced. I am unsure if this will produce the correct results at the end or how much memory it will use. Is there a better way to join my files in any methods using awk or any Bash programs?
Thank you in advance.
With join, sed and bash (Process Substitution):
join -t $'\t' -a 1 <(sed 's/\t/:/' file1.tsv) <(sed 's/\t/:/' file2.tsv) | sed 's/:/\t/' > file3.txt
This solution assumes that the first two columns are sorted together in ascending order in both files.
See: man join
If all else fails you could brute-force it and read a line from file1 then read lines from file2 until you hit a match or higher number, then read the next line from file1, etc. The advantage to that approach is that very little is being stored in memory so it should work no matter how large your files are.
This isn't quite right but I don't have any more time to think about it so consider it a start and if anyone wants to finish it off and post the finished product as an answer, be my guest:
$ cat tst.awk
BEGIN {
f1name = ARGV[1]
f2name = ARGV[2]
ARGV[1] = ARGV[2] = ""
while ( !done ) {
if ( (f1stat = (getline line1 < f1name)) > 0 ) {
split(line1,f1)
f1key = f1[1] FS f1[2]
}
matched = 0
while ( !eof && !matched ) {
if ( (f2stat = (getline line2 < f2name)) > 0 ) {
split(line2,f2)
f2key = f2[1] FS f2[2]
matched = (f1key == f2key)
}
else {
eof = 1
}
}
print line1, (matched ? f2[3] OFS f2[4] : "-" OFS "-")
if ( (f1stat <= 0) && (f2stat <= 0) ) {
done = 1
}
}
}
.
$ awk -f tst.awk file1.tsv file2.tsv
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR - -
chr1 10055 T C ... AS_VQSR - -
chr1 10057 A C ... AC0 - -
chr1 10057 A C ... AC0 - -

How to replace the value of multiple columns in a file based on two columns in another file with bash?

I'm trying to replace the value of multiple columns in a file using awk. The reason to use awk is that the file is very large and cant do it loading it in memory. I've tried to do with pandas (python).
I have a large database as a textfile. I put here a example of the info in the file (tab-delimited):
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 7 1 1 2 5 7
chr1 10 T A 7 1 1 3 0 1
chr1 10 T G 7 2 1 1 8 2
chr1 11 None None 2 0 0 0 5 4
chr1 11 G T 2 1 0 0 2 3
If the first two columns (CHROM,POS) are the same in the rows, I have to sum the values of the columns that contain '_00' in the header.
So, the expected output, is:
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 21 4 3 6 13 10
chr1 10 T A 21 4 3 6 13 10
chr1 10 T G 21 4 3 6 13 10
chr1 11 None None 4 1 0 0 7 7
chr1 11 G T 4 1 0 0 7 7
I dont know how to do this, because I'm very new in programing, so, I have to do the following with this awk code.
awk -F'\t' 'FNR==1{next};
{keys[$1"\t"$2]
for (i=5;i<=10;i++)
{sum[$1"\t"$2, i] += $i}
}END {for (key in keys) { printf "%s", key
for (i=5;i<=10;i++) {printf "%s%s", "\t", sum[key,i]} printf "\n"}} OFS='\t' out.txt
With this code, and using as 'out.txt' the first textfile, I get:
chr1 10 21 4 3 6 13 10
chr1 11 4 1 0 0 7 7
Now, I'm trying to replace, in the rows with chr1 10, the 6 values in the first row, and in the rows with chr1 11, the 6 values in the second row.
I have accomplished to change the value in one column with the this code:
awk -F"\t" 'NR==FNR{h[$1"\t"$2]=$3;next}
{
printf $1"\t"$2"\t"$3"\t"$4"\t"h[$1"\t"$2]"\t";
for (i=6;i<=NF;i++)
{printf "%s",$i "\t"};
printf "\n"
}' OFS="\t" file1 file2
but need to do the same for all the columns.
How can I do it using a similar code?
Note: I have more columns that doesn't have '_00' in the header name
here you go with a memory efficient perl on-liner which should solve your problem. You may need to add the correct input filed separator e.g. -F'\t' and a regex to skip comment lines.
perl -lane 'if(!$prev || $prev eq "$F[0]:$F[1]"){push #r,[#F[4..$#F]]; push #snp,join"\t",#F[0..3]}else{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp; #snp=(join"\t",#F[0..3]); #o=(); #r=([#F[4..$#F]])} $prev="$F[0]:$F[1]"; END{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp;}' < \
<(echo -e "chr1 10 A T 1 2 3\nchr1 10 A G 1 2 3\nchr1 11 A T 4 5 6\nchr2 12 G C 7 8 9")
formatted version with comments for you :)
if(!$prev || $prev eq "$F[0]:$F[1]"){ # CHROM:POS compare to previous line
push #r,[#F[4..$#F]]; # store values in array of array reference
push #snp,join"\t",#F[0..3] # store CHROM,POS,REF,ALT
}else{
for $r (#r){ # CHROM:POS is new
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1 # sum up values in array references
};
print join"\t",($_,#o) for #snp; # join CHROM,POS,REF,ALT with summed values
#snp=(join"\t",#F[0..3]); # re-initialize
#o=();
#r=([#F[4..$#F]])
}
$prev="$F[0]:$F[1]"; # store CHROM:POS info
END{ # print final lines
for $r (#r){
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1
};
print join"\t",($_,#o) for #snp;
}

Compare three files using two columns and print unique entries in each file using awk/sed [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have three files with following format:
$ cat a.bed
chr1 6 6 aa
chr1 8 8 bb
chr2 22 22 aa
chr3 24 24 bb
$ cat b.bed
chr1 12 12 cc
chr1 6 6 dd
chr5 14 14 cc
$ cat c.bed
chr1 8 8 ss
chr4 11 11 dd
chr1 6 6 aa
I want to compare these files using first two columns and print information for each row whether it is present in one file or multiple files, like:
chr1 6 6 aa 3 a.bed,b.bed,c.bed
chr1 8 8 bb 2 a.bed,c.bed
chr2 22 22 aa 1 a.bed
chr3 24 24 bb 1 a.bed
chr1 12 12 cc 1 b.bed
chr5 14 14 cc 1 b.bed
chr4 11 11 dd 1 c.bed
where 5th column gives number of of files it is present in and 6th column gives name of the files.
awk to the rescue!
$ awk '{a[$1,$2]=(($1,$2) in a?a[$1,$2]",":$0 OFS)FILENAME}
END{for(k in a) print a[k]}' {a,b,c}.bed
results won't be in the same order though.
Explanation
x=c?a:b is the ternary operator, sets x to a or b based on value of c (similar to if-then-else). Here we assign the value of map for key ($1,$2) either by appending FILENAME (if already exists) or setting to the current line (again by appending FILENAME). In the END block, just iterates over this map, and prints the values.
Try these four lines of gawk (doesn't appear to work in awk):
gawk '{print $0, FILENAME}' a.bed > abc.bed
gawk '{print $0, FILENAME}' b.bed >> abc.bed
gawk '{print $0, FILENAME}' c.bed >> abc.bed
gawk '{f = $5;k=$1 " " $2 " " $3 " " $4;if(k in a){a[k] = a[k] "," f}else{a[k] = f};c[k]++};END{for(k in a){print k, c[k], a[k]}}' abc.bed
Single char variables for brevity:
f - file name,
k - key, i.e. the data,
a - an array of keys,
c - an array of key counts.
Er, if I am reading it right, your input and output data samples don't match, e.g. there are only 2 'chr1 6 6 aa' not 3.

Resources