Sort a file to put 10, 11, 12... before 1, 2, 3... and X,Y - bash

I have a list of chromosome data with the columns (chromosome, start, and end) like this:
chr1 6252071 6253740
chr1 6965107 6966070
chr1 6966038 6967016
chr1 7066595 7068694
chr1 7100956 7102296
chr1 7153422 7154635
chr1 7155112 7156181
....
chr2
....
chr10
....
chrX
....
chrY
....
etc.
I am trying to use bash to sort the chromosome sections to this order:
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY
in the first column, and then in numerical order by start position in the second column, but no variation of sort seems to do the job. Any ideas? Thanks.

Split your file into two streams with separate filtering, then recombine them:
cat <(grep '^chr1[[:digit:]][[:space:]]' <inputfile | sort) \
<(grep -v '^chr1[[:digit:]][[:space:]]' <inputfile | sort) \
>outputfile

perl -E '
open $f, "<", shift;
say join "",
map {$_->[0]}
sort {length($b->[1]) <=> length($a->[1]) or $a->[1] cmp $b->[1]}
map {[$_, (split)[0]]}
<$f>
' file
It first opens the file.
Then it uses a Schwartzian Transform: read the next command from the bottom up:
read the lines: <$f>
transform the lines to a list of pairs: the original line, and the first word:
map {[$_, (split)[0]}
sort, first by length (longest to shortest), then lexically (A to Z)
transform the list of pairs to a list of lines (the first element of the pair)
map {$_->[0]}
join (the lines still have their newlines, so join on the empty string

Related

Getting sum of values in a particular column with some conditions

I have a tab delimited file like this:
chr1 104517 105076 abc 148
chr1 127781 128051 def 89
chr1 186884 186981 xyz 97
chr1 127781 128051 def 55
chr1 890934 891105 abc 50
chr1 104517 105076 abc 24
chr1 890934 891105 xyz 19
First, for every values in column 4 I wanted sum of the values in column 5. Like
abc 222
def 144
xyz 116
I did it with this code:
awk -F'\t' '{ SUM[$4] += $5 } END { for (j in SUM) print j, SUM[j] }' filename
Now I want to do this separately for every unique combination of first three columns. For example, in case of above input file, I want this output:
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Can someone please tell me the way to do this in bash script?
Thank you
I'd turn to perl instead of awk for its better support for complex data structures:
$ perl -M5.020 -lane '
our $data;
$data->{$F[0]}{$F[1]}{$F[2]}{$F[3]} += $F[4];
END {
for my $c1 (sort keys %$data) {
for my $c2 (sort { $a <=> $b } keys %{$data->{$c1}}) {
for my $c3 (sort { $a <=> $b } keys %{$data->{$c1}{$c2}}) {
my $rest = $data->{$c1}{$c2}{$c3};
print join("\t", $c1, $c2, $c3, %$rest{sort keys %$rest});
}
}
}
}' input.tsv
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Basically, builds a 4-dimensional hash table using the first four columns of each line as keys, with the sum of the fifth column as the final value. Then walks the levels of the table in sorted order and prints the result.

How to join two huge files based on the first two columns in awk/Bash programs?

There are multiple threads explaining here and here on how to perform merging between two files using awk for example.
My problem is a bit more complicated since my files are very huge. file1.tsv is 288gb and 109 columns and file2.tsv is 16gb with 4 columns. I would like to join these files based on the first two columns:
file1.tsv (tab-separated) with 109 columns (here showing first 4 and last column):
CHROM POS REF ALT ... FILTER
chr1 10031 T C ... AC0;AS_VQSR
chr1 10037 T C ... AS_VQSR
chr1 10040 T A ... PASS
chr1 10043 T C ... AS_VQSR
chr1 10055 T C ... AS_VQSR
chr1 10057 A C ... AC0
file2.tsv (tab-separated) with 4 columns:
CHROM POS CHROM_hg19 POS_hg19
chr1 10031 chr1 10034
chr1 10037 chr1 10042
chr1 10043 chr1 10084
chr1 10055 chr1 10253
chr1 10057 chr1 10434
I wish to add the two last columns from file2.tsv to file1.tsv by matching on CHROM and POS while keeping all non-matching rows from file1.txt:
file3.txt
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR chr1 10084
chr1 10055 T C ... AS_VQSR chr1 10253
chr1 10057 A C ... AC0 chr1 10434
But as you have figured, these files are big. I tried the following:
awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1.txt file2.txt
And as soon as I hit enter, I saw my memory rocketing and no results being produced. I am unsure if this will produce the correct results at the end or how much memory it will use. Is there a better way to join my files in any methods using awk or any Bash programs?
Thank you in advance.
With join, sed and bash (Process Substitution):
join -t $'\t' -a 1 <(sed 's/\t/:/' file1.tsv) <(sed 's/\t/:/' file2.tsv) | sed 's/:/\t/' > file3.txt
This solution assumes that the first two columns are sorted together in ascending order in both files.
See: man join
If all else fails you could brute-force it and read a line from file1 then read lines from file2 until you hit a match or higher number, then read the next line from file1, etc. The advantage to that approach is that very little is being stored in memory so it should work no matter how large your files are.
This isn't quite right but I don't have any more time to think about it so consider it a start and if anyone wants to finish it off and post the finished product as an answer, be my guest:
$ cat tst.awk
BEGIN {
f1name = ARGV[1]
f2name = ARGV[2]
ARGV[1] = ARGV[2] = ""
while ( !done ) {
if ( (f1stat = (getline line1 < f1name)) > 0 ) {
split(line1,f1)
f1key = f1[1] FS f1[2]
}
matched = 0
while ( !eof && !matched ) {
if ( (f2stat = (getline line2 < f2name)) > 0 ) {
split(line2,f2)
f2key = f2[1] FS f2[2]
matched = (f1key == f2key)
}
else {
eof = 1
}
}
print line1, (matched ? f2[3] OFS f2[4] : "-" OFS "-")
if ( (f1stat <= 0) && (f2stat <= 0) ) {
done = 1
}
}
}
.
$ awk -f tst.awk file1.tsv file2.tsv
CHROM POS REF ALT ... FILTER CHROM_hg19 POS_hg19
chr1 10031 T C ... AC0;AS_VQSR chr1 10034
chr1 10037 T C ... AS_VQSR chr1 10042
chr1 10040 T A ... PASS - -
chr1 10043 T C ... AS_VQSR - -
chr1 10055 T C ... AS_VQSR - -
chr1 10057 A C ... AC0 - -
chr1 10057 A C ... AC0 - -

Using specific columns, output rows that are present 3 times in a text file

I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.
chr1 1 A T sample1
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 T A sample1
chr3 4 T A sample1
chr1 1 A T sample2
chr2 3 T A sample2
chr3 4 T A sample2
chr1 1 A T sample3
chr2 1 G C sample3
chr3 4 T A sample3
chr1 1 A T sample4
chr2 1 G C sample4
chr5 1 A T sample4
chr5 2 G C sample4
If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.
The file is not currently sorted.
Thanks in advance!
edit: Thanks for all these informative answers. I selected the one that was most familiar to how I am used to working but the other answers look great too and I will learn from them.
Using GNU datamash, tr and awk assuming that input and output are tab-separated:
$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3 4 T A sample1 sample2 sample3
First, use datamash to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field.
The output would look like this:
$ datamash -s -g1,2,3,4 collapse 5 < file
chr1 1 A T sample1,sample2,sample3,sample4
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 G C sample3,sample4
chr2 2 T A sample1
chr2 3 T A sample2
chr3 4 T A sample1,sample2,sample3
chr5 1 A T sample4
chr5 2 G C sample4
Then pipe the output to tr to convert the commas to tabs and finally use awk to print the rows with seven fields.
Using awk:
awk '
BEGIN{ FS=OFS="\t" }
{
idx=$1 FS $2 FS $3 FS $4
cnt[idx]++
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
}
END{
for (i in cnt)
if (cnt[i]==3) print i, data[i]
}
' file
Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.
In the end block, loop over the cnt array and print the index and the value of the data array if the count is three.
For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)
#!/bin/sh
file="$1"
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF
Then:
$ ./demo.sh input.tsv
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
This may be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
prt()
cnt = samples = ""
prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }
.
$ sort -k1,4 file | awk -f tst.awk
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
sort uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk script is storing almost nothing in memory.

Sorting on a column alphanumerically

I have following file and I want to sort it alphanumerically based on 6 th column such that an E1 is followed by I1 and then E2 and so on of a specific ID before the ' : ', when I do sort -V -k6 file it puts all the ID:Is at the end and not where they should be.However when I do sort -k6 it does put the Es and Is of the IDs together but with some IDs belonging to different series interspersed (I have highlighted them here), how can I get the sorting such that no two IDs are mixed and the column is in the order it should be:
chr1 259017 259121 104 - ENSG00000228463:E2
chr1 259122 267095 7973 - ENSG00000228463:I1
chr1 267096 267253 157 - ENSG00000228463:E1
chr1 317720 317781 61 + ENSG00000237094:E1
chr1 317782 320161 2379 + ENSG00000237094:I1
chr1 320162 320653 491 + ENSG00000237094:E2
chr1 320654 320880 226 + ENSG00000237094:I2
chr1 320881 320938 57 + ENSG00000237094:E3
chr1 320939 321031 92 + ENSG00000237094:I3
chr1 321032 321290 258 + ENSG00000237094:E4
chr1 321291 322037 746 + ENSG00000237094:I4
chr1 322038 322228 190 + ENSG00000237094:E5
chr1 322229 322671 442 + ENSG00000237094:I5
chr1 322672 323073 401 + ENSG00000237094:E6
chr1 323074 323860 786 + ENSG00000237094:I6
chr1 323861 324060 199 + ENSG00000237094:E7
chr1 324061 324287 226 + ENSG00000237094:I7
chr1 324288 324345 57 + ENSG00000237094:E8
chr1 324346 324438 92 + ENSG00000237094:I8
chr1 324439 326514 2075 + ENSG00000237094:E9
**chr1 326096 326569 473 + ENSG00000250575:E1**
chr1 326515 327551 1036 + ENSG00000237094:I9
**chr1 326570 327347 777 + ENSG00000250575:I1**
**chr1 327348 328112 764 + ENSG00000250575:E2**
chr1 327552 328453 901 + ENSG00000237094:E10
chr1 328454 329783 1329 + ENSG00000237094:I10
**chr1 329431 329620 189 - ENSG00000233653:E2**
**chr1 329621 329949 328 - ENSG00000233653:I1**
chr1 329784 329976 192 + ENSG00000237094:E11
Original answer:
sed 's/:[EI]/&_ /' foo.txt | #separate the number at the end with a space
sort -k6 | sort -n -k7 | #sort by code, then by [EI] number
sed 's/_ //' #remove the underscore space
I like to do things like this by 'protecting' strings with a placeholder to isolate what I'm interested in, then replacing them later.
Closer:
sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'
But this naively assumes that sort works in a very specific way that it doesn't... so sometimes E2 will come before E1...
I'm not sure it can be done with sort alone, awk might be the way to go...
So I came back to this question and wrote some python code that actually accomplishes the task:
#!/usr/bin/env python
import sys
import re
from collections import defaultdict
#loop through args
for thisarg in sys.argv[1:]:
#initialize a defualt dict
bysign = defaultdict(list)
#read the file
try:
thisfile = open(thisarg,'r')
for line in thisfile:
#split each line by space and colon
dat = re.split('[ :]*',line.strip())
#append line to dictionary indexed by ENSG code
bysign[dat[-2]].append(line.strip())
thisfile.close()
except IOError:
print "no such file {:}".format(thisarg)
#extract the keys from the dictionary
mykeys = bysign.keys()
#sort the keys
mykeys.sort()
for key in mykeys:
#initialize another, smaller dictionary
bytuple = dict()
#loop through all the lines that have the same ENSG code
group = bysign[key]
for line in group:
#extract the E/I code
ei=line.split(':')[-1]
#convert the E/I code to a (char,int) tuple
letter = ei[0]
number = int(ei[1:])
#use that tuple to index the smaller dict
bytuple[(letter,number)] = line
#extract the keys from the sub-dictionary
eikeys = bytuple.keys()
#sort the keys
eikeys.sort()
#print the results
for k in eikeys:
print bytuple[k]
I hope you already figured it out by now. Curious if anyone cares enough to improve my python.

Use bash commands to sort list according to the certain column

I have a list of data with four column like below:
chr1 9778939 10199603 DEL
chr1 143804138 143808614 DEL
chr1 8541961 8757598 DEL
chr1 141480516 141909199 INV
chr1 3902285 4665319 INV
chr1 10212548 10467934 DEL
chr1 225767517 226730696 INV
chr1 10807309 11011343 DEL
chr1 23663773 23957334 DEL
chr1 4468523 4665322 DEL
chr1 24458662 24704306 DEL
....
....
chr2
....
....
chr10
....
....
chr22
....
....
chrX
....
....
chrY
....
....
I hope to:
first sort according to chr1, chr2, chr3.....till chr22,chrX,chrY. If simply use sort -n, it'll sort as chr10, chr1, chr11....blabla. I hope to sort according to the numeric value of the fist column.
Then under each chromosome(chr1,chr2...) how can I sort according to the last column, that is "DEL" or "INV"?
Then sort according to the second column,again, the numeric value. Say 104000 should go after 10500 because 104000 > 10500, but not based on the third digit comparison(4 and 5)
Thanks Hope I've made it clear.
Assuming the columns in the file afile are seprated by a single space character
$ cat afile | sed 's/chr/chr /' | sort -k2,2n -k5,5 -k3,3n | sed 's/chr /chr/'
Convert X and Y to 23 and 24 to sort numerically, and then back after the sort.
cat file | sed 's/chr/chr /' | sed 's/ X/ 23/' | sed 's/ Y/ 24/' | sort -k 2,2n -k 5,5n -k 3,3n | sed 's/chr 23/chrX/' | sed 's/chr 24/chrY/' | sed 's/chr /chr/'
It's a long string of seds, but they run quickly.

Resources