Make a file look not messed up - bash

I have a file that looks messed up:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 Alphaproteobacteria (taxid 28211)
contig_3 bin.009
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003
I want it to look properly with tab delimited columns and zeros where it's empty:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 0 Alphaproteobacteria (taxid 28211)
contig_3 bin.009 0
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003 0
If I use smth like sed 's/ /,/g' filename commas are inserted everywhere besides 1-2 and 2-3 columns.

If awk is your option, would you please try the following:
awk -v OFS="\t" '
NR==FNR {
# in the 1st pass, detect the starting positions of the 2nd field and the 3rd
sub(" +$", "") # it avoids misdetection due to extra trailing blanks
if (match($0, "[^[:blank:]]+[[:blank:]]+")) {
# RLENGTH holds the ending position of the 1st blank
if (col2 == 0 || RLENGTH < col2) col2 = RLENGTH + 1
if (match($0, "[^[:blank:]]+[[:blank:]]+[^[:blank:]]+[[:blank:]]+")) {
# RLENGTH holds the ending position of the 2nd blank
if (col3 == 0 || RLENGTH < col3) col3 = RLENGTH + 1
}
}
next
}
{
# in the 2nd pass, extract the substrings in the fixed position and reformat them
# by removing extra spaces and putting "0" if the fiels is empty
c1 = substr($0, 1, col2 - 1); sub(" +$", "", c1); if (c1 == "") c1 = "0"
c2 = substr($0, col2, col3 - col2); sub(" +$", "", c2); if (c2 == "") c2 = "0"
c3 = substr($0, col3); gsub(" +", " ", c3); if (c3 == "") c3 = "0"
# print c1, c2, c3 # use this for the tab-separated output
printf("%-12s%-12s%-s\n", c1, c2, c3)
}' file file
Output:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 0 Alphaproteobacteria (taxid 28211)
contig_3 bin.009 0
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003 0
The process consists of two passes. In the 1st pass, it detects the starting positions of the fields.
In the 2nd pass, it cuts out individual fields by using the positions calculated in the 1st pass.
I have picked printf to visually align the output. You can switch to tab separated values
depending on the preference.

Related

grep variable pattern and output match and sequence position

Given the following string,
>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG
I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,
grep -EIho -B1 '([^G]G[^G]){6,}' file
which outputs
>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH
Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.
Expected output,
$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384
The awk solution,
awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Really appreciate any help!
Here's a basic awk solution:
Each sequence must span a single line
The resulting positions are relatives to the start of the line
The algorithm first searches the parts of the line that match [^G]G[^G]{6,}, then searches for the occurrences of SGA and TGA in those parts. The implementation is a little tedious, as there's no offset option for the match() and index() functions of awk.
awk '
BEGIN {
regexp = "([^G]G[^G]){6,}"
search["SGA"]
search["TGA"]
}
/^>/ {
print
next
}
{
i0 = 1
s0 = $0
while ( match( s0, regexp ) ) {
head = substr(s0,RSTART,RLENGTH)
tail = substr(s0,RSTART+RLENGTH)
i0 += RSTART - 1
for (s in search) {
s1 = head
i1 = i0
while ( i = index(s1, s) ) {
s1 = substr(s1, i+1)
i1 += i
search[s] = search[s] " " i1-1
}
}
s0 = tail
i0 += RLENGTH
}
for (s in search) {
print s ":" search[s]
search[s] = ""
}
}
'
Example with simplified sequences
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:
TODO
Parameterize the regex and the search strings: it's not difficult per se but the current code will run into an infinite loop when a search string is empty or when the regex allows 0-length matches; you'll need to prevent that from happening.
Allow multi-line sequences
Allow overlapping matches for the regex. Basically, it means to look for the next match at RSTART+1 of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.
You could match all occurrences of the regular expression [ST]GA and look at the wider substring surrounding each match to compare that window to (.G.){6}. Here is some code to do that:
$ awk '
/^>/ { label = $0 ORS; next }
{
while (match(substr($0, pos + 1), /[ST]GA/)) {
pos += RSTART
if (len = RLENGTH) {
wbeg = pos - 18 + len # 18 is the length of .G..G..G..G..G..G.
wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
wbeg = (wbeg < 1 ? 1 : wbeg) # substr must start from at least 1
window = substr($0, wbeg, wlen)
if (window ~ /.G..G..G..G..G..G./) {
str = substr($0, pos, len)
print label str ":", pos + int(len / 2)
label = ""
}
pos += len - 1
}
if (pos >= length($0)) {
break
}
}
pos = 0
}
' file
>Q07092
SGA: 384
The output only shows SGA: 384 because that is the only portion of the example input that meets the requirement:
I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
With your shown samples please try following awk code. Written and tested in GNU awk should work in any POSIX awk. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk variable named keyWords and it will look for all those into the lines.
awk -v keyWords="SGA,TGA" '
BEGIN{
num=split(keyWords,arr1,",")
for(i=1;i<=num;i++){
checkValues[arr1[i]]
}
}
!/>/{
start=diff=prev=""
while(match($0,/(.G.){6,}/)){
lineMatch=substr($0,RSTART,RLENGTH)
start+=(RSTART>1?RSTART-1:RSTART)
diff=(start-prev)
for(key in checkValues){
if(ind=index(lineMatch,key)){
print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
}
prev=start
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Output with shown samples will be as follows:
>Q07092
SGA: 384

Replace numeric genotype code with DNA letter

how can i replace the numeric genotype code with a DNA letter?
i have a modified vcf file that looks like that:
POS REF ALT A2.bam C10.bam
448 T C 0/0:0,255,255 0/0:0,255,255
2402 C T 1/1:209,23,0 xxx:255,0,255
n...
i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it.
it should look like this:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C G G xxx
n...
been trying to do it with sed but it didn't work
don't know how to approach it
Would you please try:
awk '{
if (NR > 1) {
for (i=4; i<=5; i++) {
split($i, a, ":")
$i = a[1]
if ($i == "0/0") $i = $2
if ($i == "1/1") $i = $3
}
}
print
}' file.txt
Output:
POS REF ALT A2.bam C10.bam
448 T C T T
2402 C T T xxx
n...
The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
First it chops off the substring after ":".
If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
In case of "1/1", use the 3rd column (ALT).
Hope this helps.

How to use "column" to center a chart?

I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.
Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.
Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()

label timestamps by intervals in bash

I am using awk to split a file where I have three splits separated by spaces: 1. starting point; 2. ending point; 3. label
I want to create new labels within defined frames which requires an if which is where I am a little stuck.
I am looking for something like this:
num_intervals == (tail -1 | ending point)/250000
count == 1
interval == 2500000
current_interval_start == 0
current_interval_end == current_interval_start + interval
for interval in num_intervals
if starting_point >= current_interval_start and if ending_point <= current_interval_end then
print count + label
count == count + 1
current_interval_start == current_interval_end
current_interval_end == current_interval_start + interval
*observation if two labels are in the same interval range, take the first one, but I could post process this.
My data looks like this:
0 2300000 null
2300000 4300000 h
4300000 8000000 aa
8000000 11500000 t
11500000 28400001 null
What I would like as output would be this:
0 2500000 null
2500000 5000000 h
5000000 7500000 aa
7500000 1000000 aa
1000000 1250000 t
1250000 1500000 null
1500000 1750000 null
1750000 2000000 null
2000000 2250000 null
2500000 2750000 null
2750000 3000000 null
You can do with only awk:
awk -v s=2500000 '{
f=int($1/s);
l=int($2/s);
if((l-f) > 0){
for(i=f+1;i<=l;i++){
a[i]=$3
}
}
}
END {
e=int($2/s);
for (i=0;i<=e;i++){
if (i in a ){
print i*s,(i+1)*s,a[i]
}
else{
print i*s,(i+1)*s,"null"
}
}
}'

Find specific keyword on column 1 and append new line on column 2 shell script

I have one text file look like the followings:
empty 2
23 8
19 1
empty
11
I am trying to append new line on column 2 if column 1 has keyword "empty". Any one know how to do this? The following is the expected output:
empty
23 2
19 8
empty
11 1
Here is a script for gnu awk:
{ col1[ FNR ] = $1
col2[ FNR ] = sprintf("%s %s",$2, $3)
}
END {
k2 = 0;
for( k1 = 1; k1 <= FNR; k1++) {
if( col1[ k1 ] != "empty" ){
k2++
print col1[ k1], col2[ k2]
}
else print col1[ k1]
}
}
It stores the values of column1 and (column 2 + column 3) in two different arrays. During the output ( in the END) it consumes a value from the second array only if the first column is not "empty".
awk to the rescue!
$ awk 'p{t=$2;$2=p;p=t} $1=="empty"{if($2!=""){p=$2;$2=""}}1' file
empty
23 2
19 8
empty
11 1

Resources