Assigning divisions and subdivisions (A.1 and A.1.1) to data in a file - bash

I have an example input of the following data in a file.
start end
chr1 100 300
chr2 200 400
The "start" and "end" indicate the length of the region. So, for "chr1" the region length is 200. For "chr2" the length is 200.
I assigned each "chr" region with a "name" using awk'{print$0 "\tA." NR} to produce :
start end name
chr1 100 300 A.1
chr2 200 400 A.2
What I want to do next is to break chr1 into 2 parts by splitting region length into 100 each, and named each part with A.1.1 and A.1.2 (to indicate that they used to be 1 part, but is split into 2). And the same with "chr2." So that they look like this:
start end name
chr1 100 200 A.1.1
chr1 201 300 A.1.2
chr2 200 300 A.2.1
chr2 301 400 A.2.2
So, my question is for the very last part. Would it be possible to use awk or something that can work with awk (since I already use awk for the first part) to solve this? if so, how would you do that?
Thanks for the help guys.

Using the following input:
chr1 100 300
chr2 200 400
I have kept the script simple so that you can follow what exactly is being done. You can bypass the intermediate step you are doing as the following will get that done.
awk -v OFS="\t" '
{
offset = 0;
range = int(($3-$2)/100);
start = $2;
end = $3;
for (iter=1; iter<=range; iter++) {
print $1, start+offset, (iter==range?end:start+100), "A."NR"."iter;
offset = 1;
start+=100
}
}' file
chr1 100 200 A.1.1
chr1 201 300 A.1.2
chr2 200 300 A.2.1
chr2 301 400 A.2.2
We create three variables, iter, start, and end that gets initialized to 0 for every line. We calculate the range from start and end. We loop for the range to print column1, start range, start+100 along with character A followed by line number and the iteration number.
We initialize the offset to 1 so that next range doesn't start from the end of first.
There is a ternary test (iter==range?end:start+100) which basically checks if we are towards the end of range. If we are we use the end number. This is to handle cases where your lines would be chr1 100 150 etc.

$ awk '$1!=prev{++cnt} {print $0 "\tA." cnt "." ++seen[$1]; prev=$1}' file
chr1 100 200 A.1.1
chr1 201 300 A.1.2
chr2 200 300 A.2.1
chr2 301 400 A.2.2

Related

Extracting certain locus from multiple samples from text file

After profiling STR locus in a population, the output gave me 122 files each of which contains about unique 800,000 locus.
There are 2 examples of my files:
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02035 chr2 1489653 (aatg)8 (aatg)11 4
HG02035 chr2 68011947 (tcta)11 (tcta)11 4
HG02035 chr2 218014855 (ggaa)16 (ggaa)16 4
HG02035 chr3 45540739 (tcta)15 (tcta)16 43
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02040 chr2 1489653 (aatg)8 (aatg)8 4
HG02040 chr2 68011947 (tcta)10 (tcta)10 4
HG02040 chr2 218014855 (ggaa)21 (ggaa)21 4
HG02040 chr3 45540739 (tcta)17 (tcta)17 4
I've been trying to extract variants for each of 800,000 STR locus. I expect the output should be like this for chromosome 1 at position of 230769616:
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02072 chr1 230769616 (tcta)10 (tcta)15 4
HG02121 chr1 230769616 (tcta)2 (tcta)2 4
HG02131 chr1 230769616 (tcta)16 (tcta)16 4
HG02513 chr1 230769616 (tcta)14 (tcta)14 4
I tried this command:
awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf
It worked but it take lots of time to create large number of files for each locus.
I am struggling to find an optimal solution to solve this.
You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).
Assuming you want to get a separate output file for every $2+$3 pair, you should be using the following with any awk:
tail -n +2 -q *.vcf | sort -k2,3 |
awk '
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur }
{ print > out }
'
If you want to have the header line present in every output file then tweak that to:
{ head -n 1 file1.vcf; tail -n +2 -q *.vcf | sort -k2,3; } |
awk '
NR==1 { hdr=$0; next }
{ cur = $2 "_" $3 ".locus.tsv" }
cur != out { close(out); out=cur; print hdr > out }
{ print > out }
'
My VCF file look like this:
SAMPLE CHROM POS Allele_1 Allele_2 LENGTH
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4
this is NOT a vcf file
for such file, sort on chrom,pos, compress with bgzip and index with tabix and query with tabix. http://www.htslib.org/doc/tabix.html
You can try processing everything in memory before printing them.
FNR > 1 {
i = $2 "_" $3
b[i, ++a[i]] = $0
}
END {
for (i in a) {
n = i ".locus.tsv"
for (j = 1; j <= a[i]; ++j)
print b[i, j] > n
close(n)
}
}
This may work depending on the size of your files and the amount of memory your machine has. Using another language that allows having a dynamic array as value can also be more efficient.

Getting sum of values in a particular column with some conditions

I have a tab delimited file like this:
chr1 104517 105076 abc 148
chr1 127781 128051 def 89
chr1 186884 186981 xyz 97
chr1 127781 128051 def 55
chr1 890934 891105 abc 50
chr1 104517 105076 abc 24
chr1 890934 891105 xyz 19
First, for every values in column 4 I wanted sum of the values in column 5. Like
abc 222
def 144
xyz 116
I did it with this code:
awk -F'\t' '{ SUM[$4] += $5 } END { for (j in SUM) print j, SUM[j] }' filename
Now I want to do this separately for every unique combination of first three columns. For example, in case of above input file, I want this output:
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Can someone please tell me the way to do this in bash script?
Thank you
I'd turn to perl instead of awk for its better support for complex data structures:
$ perl -M5.020 -lane '
our $data;
$data->{$F[0]}{$F[1]}{$F[2]}{$F[3]} += $F[4];
END {
for my $c1 (sort keys %$data) {
for my $c2 (sort { $a <=> $b } keys %{$data->{$c1}}) {
for my $c3 (sort { $a <=> $b } keys %{$data->{$c1}{$c2}}) {
my $rest = $data->{$c1}{$c2}{$c3};
print join("\t", $c1, $c2, $c3, %$rest{sort keys %$rest});
}
}
}
}' input.tsv
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Basically, builds a 4-dimensional hash table using the first four columns of each line as keys, with the sum of the fifth column as the final value. Then walks the levels of the table in sorted order and prints the result.

process second column if first column matches

I just want the second column to be multiplied by exp(3) if the first column matches the parameter I define.
cat inputfile.i
100 2
200 3
300 1
100 5
200 2
300 3
I want the output to be:
100 2
200 60.25
300 1
100 5
200 40.17
300 3
I tried this code:
awk ' $1 == "200" {print $2*exp(3)}' inputfile
but nothing actually shows
you are not printing the unmatched lines, you don't need to quote numbers
$ awk '$1==200{$2*=exp(3)}1' file
100 2
200 60.2566
300 1
100 5
200 40.1711
300 3
Is there a difference between inputfile.i and inputfile?
Anyway, here is my solution for you:
awk '$1 == 200 {printf "%s %.2f\n",$1,$2*exp(3)};$1 != 200 {print $0}' inputfile.i
100 2
200 60.26
300 1
100 5
200 40.17
300 3

Sorting on a column alphanumerically

I have following file and I want to sort it alphanumerically based on 6 th column such that an E1 is followed by I1 and then E2 and so on of a specific ID before the ' : ', when I do sort -V -k6 file it puts all the ID:Is at the end and not where they should be.However when I do sort -k6 it does put the Es and Is of the IDs together but with some IDs belonging to different series interspersed (I have highlighted them here), how can I get the sorting such that no two IDs are mixed and the column is in the order it should be:
chr1 259017 259121 104 - ENSG00000228463:E2
chr1 259122 267095 7973 - ENSG00000228463:I1
chr1 267096 267253 157 - ENSG00000228463:E1
chr1 317720 317781 61 + ENSG00000237094:E1
chr1 317782 320161 2379 + ENSG00000237094:I1
chr1 320162 320653 491 + ENSG00000237094:E2
chr1 320654 320880 226 + ENSG00000237094:I2
chr1 320881 320938 57 + ENSG00000237094:E3
chr1 320939 321031 92 + ENSG00000237094:I3
chr1 321032 321290 258 + ENSG00000237094:E4
chr1 321291 322037 746 + ENSG00000237094:I4
chr1 322038 322228 190 + ENSG00000237094:E5
chr1 322229 322671 442 + ENSG00000237094:I5
chr1 322672 323073 401 + ENSG00000237094:E6
chr1 323074 323860 786 + ENSG00000237094:I6
chr1 323861 324060 199 + ENSG00000237094:E7
chr1 324061 324287 226 + ENSG00000237094:I7
chr1 324288 324345 57 + ENSG00000237094:E8
chr1 324346 324438 92 + ENSG00000237094:I8
chr1 324439 326514 2075 + ENSG00000237094:E9
**chr1 326096 326569 473 + ENSG00000250575:E1**
chr1 326515 327551 1036 + ENSG00000237094:I9
**chr1 326570 327347 777 + ENSG00000250575:I1**
**chr1 327348 328112 764 + ENSG00000250575:E2**
chr1 327552 328453 901 + ENSG00000237094:E10
chr1 328454 329783 1329 + ENSG00000237094:I10
**chr1 329431 329620 189 - ENSG00000233653:E2**
**chr1 329621 329949 328 - ENSG00000233653:I1**
chr1 329784 329976 192 + ENSG00000237094:E11
Original answer:
sed 's/:[EI]/&_ /' foo.txt | #separate the number at the end with a space
sort -k6 | sort -n -k7 | #sort by code, then by [EI] number
sed 's/_ //' #remove the underscore space
I like to do things like this by 'protecting' strings with a placeholder to isolate what I'm interested in, then replacing them later.
Closer:
sed 's/:[EI]/_ &_ /' foo.txt | sort -n -k8 | sort -k6,6 | sed 's/_ //g'
But this naively assumes that sort works in a very specific way that it doesn't... so sometimes E2 will come before E1...
I'm not sure it can be done with sort alone, awk might be the way to go...
So I came back to this question and wrote some python code that actually accomplishes the task:
#!/usr/bin/env python
import sys
import re
from collections import defaultdict
#loop through args
for thisarg in sys.argv[1:]:
#initialize a defualt dict
bysign = defaultdict(list)
#read the file
try:
thisfile = open(thisarg,'r')
for line in thisfile:
#split each line by space and colon
dat = re.split('[ :]*',line.strip())
#append line to dictionary indexed by ENSG code
bysign[dat[-2]].append(line.strip())
thisfile.close()
except IOError:
print "no such file {:}".format(thisarg)
#extract the keys from the dictionary
mykeys = bysign.keys()
#sort the keys
mykeys.sort()
for key in mykeys:
#initialize another, smaller dictionary
bytuple = dict()
#loop through all the lines that have the same ENSG code
group = bysign[key]
for line in group:
#extract the E/I code
ei=line.split(':')[-1]
#convert the E/I code to a (char,int) tuple
letter = ei[0]
number = int(ei[1:])
#use that tuple to index the smaller dict
bytuple[(letter,number)] = line
#extract the keys from the sub-dictionary
eikeys = bytuple.keys()
#sort the keys
eikeys.sort()
#print the results
for k in eikeys:
print bytuple[k]
I hope you already figured it out by now. Curious if anyone cares enough to improve my python.

Search for a value in a file and remove subsequent lines

I'm developing a shell script but I am stuck with the below part.
I have the file sample.txt:
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
6 100 200
7 100 200
I want to search the S.No column in sample.txt. For example if I'm searching the value 5 I need the rows up to 5 only I don't want the rows after the value of in S.NO is larger than 5.
the output must look like, output.txt
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Print the first line and any other line where the first field is less than or equal to 5:
$ awk 'NR==1||$1<=5' file
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Using perl:
perl -ane 'print if $F[$1]<=5' file
And the sed solution
n=5
sed "/^$n[[:space:]]/q" filename
The sed q command exits after printing the current line
The suggested awk relies on that column 1 is numeric sorted. A generic awk that fulfills the question title would be:
gawk -v p=5 '$1==p {print; exit} {print}'
However, in this situation, sed is better IMO. Use -i to modify the input file.
sed '6q' sample.txt > output.txt

Resources