Compare three files using two columns and print unique entries in each file using awk/sed [closed] - bash

I have three files with following format:
$ cat a.bed
chr1 6 6 aa
chr1 8 8 bb
chr2 22 22 aa
chr3 24 24 bb
$ cat b.bed
chr1 12 12 cc
chr1 6 6 dd
chr5 14 14 cc
$ cat c.bed
chr1 8 8 ss
chr4 11 11 dd
chr1 6 6 aa
I want to compare these files using first two columns and print information for each row whether it is present in one file or multiple files, like:
chr1 6 6 aa 3 a.bed,b.bed,c.bed
chr1 8 8 bb 2 a.bed,c.bed
chr2 22 22 aa 1 a.bed
chr3 24 24 bb 1 a.bed
chr1 12 12 cc 1 b.bed
chr5 14 14 cc 1 b.bed
chr4 11 11 dd 1 c.bed
where 5th column gives number of of files it is present in and 6th column gives name of the files.

awk to the rescue!
$ awk '{a[$1,$2]=(($1,$2) in a?a[$1,$2]",":$0 OFS)FILENAME}
END{for(k in a) print a[k]}' {a,b,c}.bed
results won't be in the same order though.
x=c?a:b is the ternary operator, sets x to a or b based on value of c (similar to if-then-else). Here we assign the value of map for key ($1,$2) either by appending FILENAME (if already exists) or setting to the current line (again by appending FILENAME). In the END block, just iterates over this map, and prints the values.

Try these four lines of gawk (doesn't appear to work in awk):
gawk '{print $0, FILENAME}' a.bed > abc.bed
gawk '{print $0, FILENAME}' b.bed >> abc.bed
gawk '{print $0, FILENAME}' c.bed >> abc.bed
gawk '{f = $5;k=$1 " " $2 " " $3 " " $4;if(k in a){a[k] = a[k] "," f}else{a[k] = f};c[k]++};END{for(k in a){print k, c[k], a[k]}}' abc.bed
Single char variables for brevity:
f - file name,
k - key, i.e. the data,
a - an array of keys,
c - an array of key counts.
Er, if I am reading it right, your input and output data samples don't match, e.g. there are only 2 'chr1 6 6 aa' not 3.


Extracting certain locus from multiple samples from text file

After profiling STR locus in a population, the output gave me 122 files each of which contains about unique 800,000 locus.
There are 2 examples of my files:
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02035 chr2 1489653 (aatg)8 (aatg)11 4
HG02035 chr2 68011947 (tcta)11 (tcta)11 4
HG02035 chr2 218014855 (ggaa)16 (ggaa)16 4
HG02035 chr3 45540739 (tcta)15 (tcta)16 43
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02040 chr2 1489653 (aatg)8 (aatg)8 4
HG02040 chr2 68011947 (tcta)10 (tcta)10 4
HG02040 chr2 218014855 (ggaa)21 (ggaa)21 4
HG02040 chr3 45540739 (tcta)17 (tcta)17 4
I've been trying to extract variants for each of 800,000 STR locus. I expect the output should be like this for chromosome 1 at position of 230769616:
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02072 chr1 230769616 (tcta)10 (tcta)15 4
HG02121 chr1 230769616 (tcta)2 (tcta)2 4
HG02131 chr1 230769616 (tcta)16 (tcta)16 4
HG02513 chr1 230769616 (tcta)14 (tcta)14 4
I tried this command:
awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf
It worked but it take lots of time to create large number of files for each locus.
I am struggling to find an optimal solution to solve this.
You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).
Assuming you want to get a separate output file for every $2+$3 pair, you should be using the following with any awk:
tail -n +2 -q *.vcf | sort -k2,3 |
awk '
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur }
{ print > out }
If you want to have the header line present in every output file then tweak that to:
{ head -n 1 file1.vcf; tail -n +2 -q *.vcf | sort -k2,3; } |
awk '
NR==1 { hdr=$0; next }
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur; print hdr > out }
{ print > out }
My VCF file look like this:
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4
this is NOT a vcf file
for such file, sort on chrom,pos, compress with bgzip and index with tabix and query with tabix.
You can try processing everything in memory before printing them.
FNR > 1 {
i = $2 "_" $3
b[i, ++a[i]] = $0
for (i in a) {
n = i ".locus.tsv"
for (j = 1; j <= a[i]; ++j)
print b[i, j] > n
This may work depending on the size of your files and the amount of memory your machine has. Using another language that allows having a dynamic array as value can also be more efficient.

Using specific columns, output rows that are present 3 times in a text file

I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.
chr1 1 A T sample1
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 T A sample1
chr3 4 T A sample1
chr1 1 A T sample2
chr2 3 T A sample2
chr3 4 T A sample2
chr1 1 A T sample3
chr2 1 G C sample3
chr3 4 T A sample3
chr1 1 A T sample4
chr2 1 G C sample4
chr5 1 A T sample4
chr5 2 G C sample4
If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.
The file is not currently sorted.
Thanks in advance!
Thanks in advance!
Using GNU datamash, tr and awk assuming that input and output are tab-separated:
$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3 4 T A sample1 sample2 sample3
First, use datamash to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field.
The output would look like this:
$ datamash -s -g1,2,3,4 collapse 5 < file
chr1 1 A T sample1,sample2,sample3,sample4
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 G C sample3,sample4
chr2 2 T A sample1
chr2 3 T A sample2
chr3 4 T A sample1,sample2,sample3
chr5 1 A T sample4
chr5 2 G C sample4
Then pipe the output to tr to convert the commas to tabs and finally use awk to print the rows with seven fields.
Using awk:
awk '
BEGIN{ FS=OFS="\t" }
idx=$1 FS $2 FS $3 FS $4
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
for (i in cnt)
if (cnt[i]==3) print i, data[i]
' file
Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.
In the end block, loop over the cnt array and print the index and the value of the data array if the count is three.
For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
$ ./ input.tsv
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
This may be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
cnt = samples = ""
prev = curr
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }
$ sort -k1,4 file | awk -f tst.awk
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
sort uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk script is storing almost nothing in memory.

How to replace the value of multiple columns in a file based on two columns in another file with bash?

I'm trying to replace the value of multiple columns in a file using awk. The reason to use awk is that the file is very large and cant do it loading it in memory. I've tried to do with pandas (python).
I have a large database as a textfile. I put here a example of the info in the file (tab-delimited):
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 7 1 1 2 5 7
chr1 10 T A 7 1 1 3 0 1
chr1 10 T G 7 2 1 1 8 2
chr1 11 None None 2 0 0 0 5 4
chr1 11 G T 2 1 0 0 2 3
If the first two columns (CHROM,POS) are the same in the rows, I have to sum the values of the columns that contain '_00' in the header.
So, the expected output, is:
CHROM POS REF ALT GT_00 d_GT_00 c_GT_00 de_GT_00 can_GT_00 epi_GT_00
chr1 10 T A 21 4 3 6 13 10
chr1 10 T A 21 4 3 6 13 10
chr1 10 T G 21 4 3 6 13 10
chr1 11 None None 4 1 0 0 7 7
chr1 11 G T 4 1 0 0 7 7
I dont know how to do this, because I'm very new in programing, so, I have to do the following with this awk code.
awk -F'\t' 'FNR==1{next};
for (i=5;i<=10;i++)
{sum[$1"\t"$2, i] += $i}
}END {for (key in keys) { printf "%s", key
for (i=5;i<=10;i++) {printf "%s%s", "\t", sum[key,i]} printf "\n"}} OFS='\t' out.txt
With this code, and using as 'out.txt' the first textfile, I get:
chr1 10 21 4 3 6 13 10
chr1 11 4 1 0 0 7 7
Now, I'm trying to replace, in the rows with chr1 10, the 6 values in the first row, and in the rows with chr1 11, the 6 values in the second row.
I have accomplished to change the value in one column with the this code:
awk -F"\t" 'NR==FNR{h[$1"\t"$2]=$3;next}
printf $1"\t"$2"\t"$3"\t"$4"\t"h[$1"\t"$2]"\t";
for (i=6;i<=NF;i++)
{printf "%s",$i "\t"};
printf "\n"
}' OFS="\t" file1 file2
but need to do the same for all the columns.
How can I do it using a similar code?
Note: I have more columns that doesn't have '_00' in the header name
here you go with a memory efficient perl on-liner which should solve your problem. You may need to add the correct input filed separator e.g. -F'\t' and a regex to skip comment lines.
perl -lane 'if(!$prev || $prev eq "$F[0]:$F[1]"){push #r,[#F[4..$#F]]; push #snp,join"\t",#F[0..3]}else{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp; #snp=(join"\t",#F[0..3]); #o=(); #r=([#F[4..$#F]])} $prev="$F[0]:$F[1]"; END{for $r (#r){$o[$_]+=$$r[$_] for 0..scalar(#$r)-1}; print join"\t",($_,#o) for #snp;}' < \
<(echo -e "chr1 10 A T 1 2 3\nchr1 10 A G 1 2 3\nchr1 11 A T 4 5 6\nchr2 12 G C 7 8 9")
formatted version with comments for you :)
if(!$prev || $prev eq "$F[0]:$F[1]"){ # CHROM:POS compare to previous line
push #r,[#F[4..$#F]]; # store values in array of array reference
push #snp,join"\t",#F[0..3] # store CHROM,POS,REF,ALT
for $r (#r){ # CHROM:POS is new
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1 # sum up values in array references
print join"\t",($_,#o) for #snp; # join CHROM,POS,REF,ALT with summed values
#snp=(join"\t",#F[0..3]); # re-initialize
$prev="$F[0]:$F[1]"; # store CHROM:POS info
END{ # print final lines
for $r (#r){
$o[$_]+=$$r[$_] for 0..scalar(#$r)-1
print join"\t",($_,#o) for #snp;

How to combine column from multiple text files? [duplicate]

I want to extract and combine a certain column from a bunch of text files into a single file as shown.
A 123 1
B 234 2
C 345 3
D 456 4
A 123 5
B 234 6
C 345 7
D 456 8
A 123 9
B 234 10
C 345 11
D 456 12
A 123 55
B 234 66
C 345 77
D 456 88
How can I loop through my files of interest and paste these columns together so that the final result is like below without having to type out 1000 unique file names?
1 5 9 ... 55
2 6 10 ... 66
3 7 11 ... 77
4 8 12 ... 88
Try this:
paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
A 123 1
B 234 2
C 345 3
D 456 4
A 123 5
B 234 6
C 345 7
D 456 8
Run command as:
$ paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
1 5
2 6
3 7
4 8
I tested below code with first 3 files
cat File*_example.txt | awk '{a[$1$2]= a[$1$2] $3 " "} END{for(x in a){print a[x]}}' | sort
1 5 9
2 6 10
3 7 11
4 8 12
1) use an awk array, a[$1$2]= a[$1$2] $3 " " index is column1 and column2, array value appends all column 3.
2) END{for(x in a){print a[x]}} travesrsed array a and prints all values.
3)use sort to sort the output.
when cating you need to ensure the file order is preserved, one way is to explicitly specify the files
cat File{1..100}_example.txt | awk '{print $NF}' | pr 4ts' '
extract last column by awk and align using pr

Updating n-th column in csv using awk [duplicate]

Input file
Expected Output
I tried the following command
awk -F "," '{$8=$8"MNINS"}1' 1.csv > 2.csv
1 2 3 4 5 6 7 8MNINS 9 10
11 22 33 44 55 66 77 88MNINS 99 100
111 222 333 444 555 666 777 888MNINS 999 1000
It is removed all the commas, so my csv file is changing into space seperated file.
Please help
You need to specify comma as Output field separator value.
awk -F "," -v OFS="," '{$8=$8"MNINS"}1' 1.csv > 2.csv
