Genebank files manipulation with bash

Genebank files manipulation with bash - bash

I have this genebank file. And I need your help in manipulating it
Iam picking a random part of the file
CDS complement(1750..1956)
/gene="MAMA_L4"
/note="similar to MIMI_L9"
/codon_start=1
/product="hypothetical protein"
/protein_id="AEQ60146.1"
/translation="MHFLDDDNDESNNCFDDKEKARDKIIIDMLNLIIGKKKTSYKCL
DYILSEQEYKFAILSIVENSIFLF"
misc_feature complement(2020..2235)
/note="MAMA_L5; similar to replication origin binding
protein (fragment)"
gene complement(2461..2718)
/gene="MAMA_L6"
CDS complement(2461..2718)
/gene="MAMA_L6"
/codon_start=1
/product="T5orf172 domain-containing protein"
/protein_id="AEQ60147.1"
/translation="MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDV
IITYFQYTDKAKKVESDLKEKLSKCRITNIKGNLSEWIVID"
My target is to "extract" the info of /translation= and /product= like following
T5orf172 domain-containing protein
MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDVIITYFQYTDKAKKVESDLKEKLSKCRITNIKGNLSEWIVID
*with bold I highlighted the issue that I had.
I am trying to write a bash script so I was thinking to apply something like:
grep -w /product= genebank.file |cut -d= -f2| sed 's/"//'g > File1
grep -w /translation= genebank.file |cut -d= -f2| sed 's/"//'g > File2
paste File1 File2
T the problem is that in the translation entries when I use grep I got only the first line. So it prints until the bold line like
T5orf172 domain-containing protein MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDV
Can anybody help me to step over this issue? Thank you in advance!

With GNU sed:
sed -En '/^\s*\/(product|translation)="/{
s///
:a
/"$/! { N; s/\n\s*//; ba; }
s/"$//p
}' file |
sed 'N; s/\n/\t/'
Note: This assumes the second occurrence of the delimiter " is immediately followed by a newline in the input file.

I haven't fully tested this but if you add -A1 to your grep command you'll get one line after the match.
grep -w /product= genebank.file |cut -d= -f2| sed 's/"//'g > File1
grep -A1 -w /translation= genebank.file |cut -d= -f2| sed 's/^ *//g' > File2
paste File1 File2
You would need to delete that extra newline but that should get you close.

Related

Combine multiple text files (row wise) into columns

I have multiple text files that I want to merge columnwise.
For example:
File 1
0.698501 -0.0747351 0.122993 -2.13516
File 2
-5.27203 -3.5916 -0.871368 1.53945
I want the output file to be like:
0.698501, -5.27203
-0.0747351, -3.5916
0.122993, -0.871368
-2.13516, 1.53945
Is there a one line bash common that can accomplish this?
I'll appreciate any help.
---Lyndz

With awk:
awk '{if(NR==1) {split($0,a1," ")} else {split($0,a2," ")}} END{for(i in a2) print a1[i] ", " a2[i]}' file1 file2
Output:
0.698501, -5.27203
-0.0747351, -3.5916
0.122993, -0.871368
-2.13516, 1.53945

paste <(cat file1 | sed -E 's/ +/&,\n/g') <(cat file2 | sed -E 's/ +/&\n/g') | column -s $',' -t | sed -E 's/\s+/, /g' | sed -E 's/, $//g'
It got a bit complicated, but I guess it can be done in a bit simpler way also.
P.S: Please lookup for the man pages of each command to see what they do.

Linux get data from each line of file

I have a file with many (~2k) lines similar to:
117 VALID|AUTHEN tcp:10.92.163.5:64127 uniqueID=nwCelerra
....
991 VALID|AUTHEN tcp:10.19.16.21:58332 uniqueID=smUNIX
I want only the IP address (10.19.16.21 shown above) and the value of the uniqueID (smUNIX shown above)
I am able to get close with:
cat t.txt|cut -f2- -d':'
10.22.36.69:46474 uniqueID=smwUNIX
...
I am on Linux using bash.

Using awk:
awk '{split($3,a,":"); split($4,b,"="); print a[2] " " b[2]}'
By default if splits on the whitespaces, with some extra code you can split the subfields
Update:
even easier overriding the default delimiter:
awk -F '[:=]' '{print $2 " "$4}'

using grep and sed :
grep -oP "^\d+ [A-Z]+\|[A-Z]+ \w+:\K(.*)" | sed "s/ uniqueID=/ /g"
outputs:
10.92.163.5:64127 nwCelerra
10.19.16.21:58332 smUNIX

How to remove symbols and add file name to fasta headers

I have several fasta files with the following headers:
M01498:408:000000000-BLBYD:1:1101:11790:1823 1:N:0:1
I want to remove all symbols (colon, dash, and space), and add "barcodelabel=FILENAME;"
I can do it for one file using:
cat A1.fasta |sed s/-//g | sed s/://g| sed s/\ //g|sed 's/^>/>barcodelabel=A1;/g' >A1.renamed.fasta
How can I do this but for all of my files at once? I tried the code below but it didn't work:
for i in {A..H}{1..6}; do cat ${i}.fasta |sed s/-//g | sed s/://g| sed s/\ //g | sed 's/^>/>barcodelabel=${i};/g' >${i}.named.fasta; done
any help would be appreciated !

Considering that you want to substitute -,: or space with null and want to add string at last of the first line then following may help you on same:
awk 'FNR==1{gsub(/:|-| +/,"");print $0,"barcodelabel=FILENAME";next} 1' Input_file
In case you want to save output in to same Input_file then add following in above code too > temp_file && mv temp_file Input_file

I figured it out. First, I reduced the number of sed to simplify the code. The mistake was in the final sed I had simple quotation marks and it should have been double so it can read the ${i}. final code is:
for i in {A..H}{1..6}; do cat ${i}.fasta |
sed 's/[-: ]//g' |
sed "s/^>/>barcodelabel=${i};/g" > ${i}.final4.fasta; done

Bash: concenate lines in csv file (1+2, 3+4 etc)

I have a bash file with increasing integers in the first column and some text behind.
1,text1a,text1b
2,text2a,text2b
3,text3a,text3b
4,text4a,text4b
...
I would like to add line 1+2, 3+4 etc. and add the outcome to a new csv file.
The desired output would be
1,text1a,text1b,2,text2a,text2b
3,text3a,text3b,4,text4a,text4b
...
A second option without the numbers would be great as well. The actual input would be
1,text,text,,,text#text.com,2,text.text,text
2,text,text,,,text#text.com,3,text.text,text
3,text,text,,,text#text.com,2,text.text,text
4,text,text,,,text#text.com,3,text.text,text
Desired outcome
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text

$ pr -2ats, file
gives you
1,text1a,text1b,2,text2a,text2b
3,text3a,text3b,4,text4a,text4b
UPDATE
for the second part
$ cut -d, -f2- file | pr -2ats,
will give you
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text
text,text,,,text#text.com,2,text.text,text,text,text,,,text#text.com,3,text.text,text

awk solution:
awk '{ printf "%s%s",$0,(!(NR%2)? ORS:",") }' input.csv > output.csv
The output.csv content:
1,text1a,text1b,2,text2a,text2b
3,text3a,text3b,4,text4a,text4b
----------
Additional approach (to skip numbers):
awk -F',' '{ printf "%s%s",$2 FS $3,(!(NR%2)? ORS:FS) }' input.csv > output.csv
The output.csv content:
text1a,text1b,text2a,text2b
text3a,text3b,text4a,text4b
3rd approach (for your extended input):
awk -F',' '{ sub(/^[0-9]+,/,"",$0); printf "%s%s",$0,(!(NR%2)? ORS:FS) }' input.csv > output.csv

With bash, cut, sed and paste:
paste -d, <(cut -d, -f 2- file | sed '2~2d') <(cut -d, -f 2- file | sed '1~2d')
Output:
text1a,text1b,text2a,text2b
text3a,text3b,text4a,text4b

I hoped to get started with something simple as
printf '%s,%s\n' $(<inputfile)
This turns out wrong when you have spaces inside your text fields.
The improvement is rather a mess:
source <(echo "printf '%s,%s\n' $(sed 's/.*/"&"/' inputfile|tr '\n' ' ')")
Skipping the first filed can be done in the same sed command:
source <(echo "printf '%s,%s\n' $(sed -r 's/([^,]*),(.*)/"\2"/' inputfile|tr '\n' ' ')")
EDIT:
This solution will fail when it has special characters, so you should use a simple solution as
cut -f2- file | paste -d, - -

Find multible missing lines in csv using diff

One part of my problem was solved with this answer:
Threadlink
, but an important part of my problem was unsolved!
After using
diff a.csv b.csv | grep -E -A1 '^[0-9]+d[0-9]+$' | grep -v '^--$' | sed -n '0~2 p' | sed -re 's,^< (.*)$,\1,g'
several times i found something left.
Sometimes multible following lines are deleted.
If only one line was deleted there are something like this found:
3663d3661
For multible lines it is:
3724,3725d3718
So i changed the diff call to:
diff a.csv b.csv | grep -E -A1 '^[0-9]+\,*[0-9]*d[0-9]+$' | grep -v '^--$' | sed -n '0~2 p' | sed -re 's,^< (.*)$,\1,g'
This works for the first of multiple deleted lines.
My question is:
How could i get all deletet lines (maybe 5 following lines) in such a case?
What did i have to change in the diff call?

diff a.csv b.csv | sed -n '/^[0-9]\+d[0-9]*/,/^[0-9]\+[^d]*$/{/^[0-9]\+/d;s/^< //;p}'
will do that.
/^[0-9]\+d[0-9]*/,/^[0-9]\+[^d]*$/
will find the string range which are deleted
/^[0-9]\+/d
will delete all 6842d6844,6772
s/^< //
will replace all < in the beginning of lines
and p will print the line.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Genebank files manipulation with bash - bash

With GNU sed: sed -En '/^\s\/(product|translation)="/{ s/// :a /"$/! { N; s/\n\s//; ba; } s/"$//p }' file | sed 'N; s/\n/\t/' Note: This assumes the second occurrence of the delimiter " is immediately followed by a newline in the input file.

Related

Combine multiple text files (row wise) into columns

Linux get data from each line of file

How to remove symbols and add file name to fasta headers

Bash: concenate lines in csv file (1+2, 3+4 etc)

Find multible missing lines in csv using diff

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Genebank files manipulation with bash - bash

With GNU sed: sed -En '/^\s*\/(product|translation)="/{ s/// :a /"$/! { N; s/\n\s*//; ba; } s/"$//p }' file | sed 'N; s/\n/\t/' Note: This assumes the second occurrence of the delimiter " is immediately followed by a newline in the input file.

Related

Combine multiple text files (row wise) into columns

Linux get data from each line of file

How to remove symbols and add file name to fasta headers

Bash: concenate lines in csv file (1+2, 3+4 etc)

Find multible missing lines in csv using diff

Categories

Resources

With GNU sed: sed -En '/^\s\/(product|translation)="/{ s/// :a /"$/! { N; s/\n\s//; ba; } s/"$//p }' file | sed 'N; s/\n/\t/' Note: This assumes the second occurrence of the delimiter " is immediately followed by a newline in the input file.