Replace numeric genotype code with DNA letter - bash

how can i replace the numeric genotype code with a DNA letter?
i have a modified vcf file that looks like that:
POS REF ALT A2.bam C10.bam
448 T C 0/0:0,255,255 0/0:0,255,255
2402 C T 1/1:209,23,0 xxx:255,0,255
n...
i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it.
it should look like this:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C G G xxx
n...
been trying to do it with sed but it didn't work
don't know how to approach it

Would you please try:
awk '{
if (NR > 1) {
for (i=4; i<=5; i++) {
split($i, a, ":")
$i = a[1]
if ($i == "0/0") $i = $2
if ($i == "1/1") $i = $3
}
}
print
}' file.txt
Output:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C T T xxx
n...
The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
First it chops off the substring after ":".
If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
In case of "1/1", use the 3rd column (ALT).
Hope this helps.

Related

grep variable pattern and output match and sequence position

Given the following string,
>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG
I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,
grep -EIho -B1 '([^G]G[^G]){6,}' file
which outputs
>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH
Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.
Expected output,
$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384
The awk solution,
awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Really appreciate any help!
Here's a basic awk solution:
Each sequence must span a single line
The resulting positions are relatives to the start of the line
The algorithm first searches the parts of the line that match [^G]G[^G]{6,}, then searches for the occurrences of SGA and TGA in those parts. The implementation is a little tedious, as there's no offset option for the match() and index() functions of awk.
awk '
BEGIN {
regexp = "([^G]G[^G]){6,}"
search["SGA"]
search["TGA"]
}
/^>/ {
print
next
}
{
i0 = 1
s0 = $0
while ( match( s0, regexp ) ) {
head = substr(s0,RSTART,RLENGTH)
tail = substr(s0,RSTART+RLENGTH)
i0 += RSTART - 1
for (s in search) {
s1 = head
i1 = i0
while ( i = index(s1, s) ) {
s1 = substr(s1, i+1)
i1 += i
search[s] = search[s] " " i1-1
}
}
s0 = tail
i0 += RLENGTH
}
for (s in search) {
print s ":" search[s]
search[s] = ""
}
}
'
Example with simplified sequences
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:
TODO
Parameterize the regex and the search strings: it's not difficult per se but the current code will run into an infinite loop when a search string is empty or when the regex allows 0-length matches; you'll need to prevent that from happening.
Allow multi-line sequences
Allow overlapping matches for the regex. Basically, it means to look for the next match at RSTART+1 of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.
You could match all occurrences of the regular expression [ST]GA and look at the wider substring surrounding each match to compare that window to (.G.){6}. Here is some code to do that:
$ awk '
/^>/ { label = $0 ORS; next }
{
while (match(substr($0, pos + 1), /[ST]GA/)) {
pos += RSTART
if (len = RLENGTH) {
wbeg = pos - 18 + len # 18 is the length of .G..G..G..G..G..G.
wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
wbeg = (wbeg < 1 ? 1 : wbeg) # substr must start from at least 1
window = substr($0, wbeg, wlen)
if (window ~ /.G..G..G..G..G..G./) {
str = substr($0, pos, len)
print label str ":", pos + int(len / 2)
label = ""
}
pos += len - 1
}
if (pos >= length($0)) {
break
}
}
pos = 0
}
' file
>Q07092
SGA: 384
The output only shows SGA: 384 because that is the only portion of the example input that meets the requirement:
I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
With your shown samples please try following awk code. Written and tested in GNU awk should work in any POSIX awk. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk variable named keyWords and it will look for all those into the lines.
awk -v keyWords="SGA,TGA" '
BEGIN{
num=split(keyWords,arr1,",")
for(i=1;i<=num;i++){
checkValues[arr1[i]]
}
}
!/>/{
start=diff=prev=""
while(match($0,/(.G.){6,}/)){
lineMatch=substr($0,RSTART,RLENGTH)
start+=(RSTART>1?RSTART-1:RSTART)
diff=(start-prev)
for(key in checkValues){
if(ind=index(lineMatch,key)){
print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
}
prev=start
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Output with shown samples will be as follows:
>Q07092
SGA: 384

Creating a table with scrambled and unordered data

I'm trying to create a nice little table of values and I'm doing it using bash, but not all the values are in order. Not only that the values also happen to be in their own file. My first few thoughts are to use cat and grep to grab the values, but from there I'm not sure what is appropriate. I feel like awk would do wonders in this situation, but I do not know awk very well.
file1 might look like this
V 0.001
A 98.6
N Measurement1
T 14:15:01
S 20.2
F 212.86
G 28.19
file2 might look like this
V 0.008
A 103.4
N Measurement2
T 16:20:31
S 21.2
F 215.86
G 28.19
The final file would look like this
N Measurement1 Measurement2
T 14:15:01 16:20:31
V 0.001 0.008
G 28.19 28.19
A 98.6 103.4
S 20.2 21.2
F 212.86 215.86
self commented, code is provided for understanding awk
awk '
# cretate new reference (per file)
FNR==1{Ref++}
# each line
{ # add label to memory
N[$1]
# add value in 2 dimension array
V[Ref ":" $1] = $2
# remember maximum length of this serie
if( length( $2 ) > M[Ref] ) M[Ref] = length( $2 )
}
# after last file
END{
# print header (name of the serie)
printf( "N ")
for( i=1;i<=Ref;i++) printf( "%" M[i] "s ", V[ i ":N" ] )
printf( "\n")
# print each data for this label (format suite the size to be aligned)
# don t print a second time the name of the serie
for ( n in N ){
if( n != "N" ){
printf( "%s ", n)
for( i=1;i<=Ref;i++) printf( "%" M[i] "s ", V[ i ":" n ] )
printf( "\n")
}
}
}
' file*

How to use "column" to center a chart?

I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.
Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.
Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()

replacing specific value (from another file) using awk

I have a following file.
File1
a b 1
c d 2
e f 3
File2
x l
y m
z n
I want to replace 1 by x at a time and save in a file3. next time 1 to y and save in file4.
Then files look like
File3
a b x
c d 2
e f 3
File4
a b y
c d 2
e f 3
once I finished x, y, z then 2 by l, m and n.
I start with this but it inserts but does not replace.
awk -v r=1 -v c=3 -v val=x -F, '
BEGIN{OFS=" "}; NR != r; NR == r {$c = val; print}
' file1 >file3
Here's a gnu awk script ( because it uses multidimensional arrays, array ordering ) that will do what you want:
#!/usr/bin/awk -f
BEGIN { fcnt=3 }
FNR==NR { for(i=1;i<=NF;i++) f2[i][NR]=$i; next }
{
fout[FNR][1] = $0
ff = $NF
if(ff in f2) {
for( r in f2[ff]) {
$NF = f2[ff][r]
fout[FNR][fcnt++] = $0
}
}
}
END {
for(f=fcnt-1;f>=3;f--) {
for( row in fout ) {
if( fout[row][f] != "" ) out = fout[row][f]
else out = fout[row][1]
print out > "file" f
}
}
}
I made at least one major assumption about your input data:
The field number in file2 corresponds exactly to the value that needs to be replaced in file1. For example, x is field 1 in file2, and 1 is what needs replacing in the output files.
Here's the breakdown:
Set fcnt=3 in the BEGIN block.
FNR==NR - store the contents of File2 in the f2 array by (field number, line number).
Store the original f1 line in fout as (line number,1) - where 1 is a special, available array position ( because fcnt starts at 3 ).
Save off $NF as ff because it's going to be reset
Whenever ff is a field number in the first subscript of the f2 array, then reset $NF to the value from file2 and then assign the result to fout at (line number, file number) as $0 ( recomputed ).
In the END, loop over the fcnt in reverse order, and either set out to a replaced line value or an original line value in row order, then print out to the desired filename.
It could be run like gawk -f script.awk file2 file1 ( notice the file order ). I get the following output:
$ cat file[3-8]
a b x
c d 2
e f 3
a b y
c d 2
e f 3
a b z
c d 2
e f 3
a b 1
c d l
e f 3
a b 1
c d m
e f 3
a b 1
c d n
e f 3
This could be made more efficient for memory by only performing the lookup in the END block, but I wanted to take advantage of the $0 recompute instead of needing calls to split in the END.

how to write bash script in ubuntu to normalize the index of text comparison

I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"
This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3

Resources