How to extract FASTA sequence using sequence ID (shell script) - shell

I have the following sequences which is in a fasta format with sequence header and its nucleotides.
How can I compare two files(Kcompare.pep and clade1i.txt) and extract the sequences with the same sequence header?
Can anyone help me?
Kcompare.pep
>ztr:MYCGRDRAFT_45998
MAAPLHAEGPIRTPYTGVELLNTPYLNKGTAFPADERRVLGLTALLPTSVHTLDQQLQRA
WHQYQSRDNDLARNTFLTSLKEQNEVLYYRLVLDHLSEVFSIIYTPTEGEAIQRYSSLFR
>kal:KALB_5042
MTAEVAVVSDGSAIPGASPPATLPLLQDYAELVREHAGLSAVPLAVDSARLAAELCALPK
RFRAVFLTHTDPERAFQVQRAVAKAGGPLVITDQDTTAISLTASTLTTLARRGRSPSDSR
clade1i.txt
cpo:COPRO5265_0583
ble:BleG1_3845
kal:KALB_5042
expected output
>kal:KALB_5042
MTAEVAVVSDGSAIPGASPPATLPLLQDYAELVREHAGLSAVPLAVDSARLAAELCALPK
RFRAVFLTHTDPERAFQVQRAVAKAGGPLVITDQDTTAISLTASTLTTLARRGRSPSDSR
I tried to run this but no error or result appeared.
for i in K*
do
echo $i
awk -F ' ' '{print $1}' $i/$i.pep > Kcompare.pep
mv Kcompare.pep $i
awk -F '_' '{print $2":"$3"_"$4}' $i/firstClade.txt > $i/clade1i.txt
awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' $i/Kcompare.pep | awk -F"\t" 'BEGIN{while((getline k <"$i/clade1i.txt")>0)i[k]=1}{gsub("^>","",$0);if(i[$1]){print ">"$1"\n"$2}}' > $i/firsti.pep
done

Using awk:
awk 'NR==FNR{a[">"$0];next}/^>/{f=0;}($0 in a)||f{print;f=1}' clade1i.txt Kcompare.pep
Read the clade1i.txt file and store in an array as keys.
Read the Kcompare.pep. For every line beginning with '>', set a flag, and keep printing the lines till the next line beginning with '>' is encountered.

Use this:
while read l; do
sed -n '/^>'"$l"'/,/^>|$/p' Kcompare.pep
done <clade1i.txt
The while loop loops trough the clade1i.txt file line by line.
sed -n suppresses auto print.
/regex/,/regex/ matches all from the first regex to the second.
p prints matched lines.

Related

Writing the output of a command to specific columns of a csv file, unix

I wanted to write the output of command to specific columns (3rd and 5th) of the csv file.
#!/bin/bash
echo -e "Value,1\nCount,1" >> file.csv
echo "Header1,Header2,Path,Header4,Value,Header6" >> file.csv
sed 'y/ /,/' input.csv >> file.csv
input.csv in the above snippet will look something like this
1234567890 /training/folder
0325435287 /training/newfolder
Current output of file.csv
Value,1
Count,1
Header1,Header2,Path,Header4,Value,Header6
1234567890,/training/folder
0325435287,/training/newfolder
Expected Output of file.csv
Value,1
Count,1
Header1,Header2,Path,Header4,Value,Header6
,,/training/folder,,1234567890,
,,/training/newfolder,,0325435287,
All the operations can be done in a single awk:
awk -v OFS=, -v pre="Value,1\nCount,1" -v hdr="Header1,Header2,Path,Header4,Value,Header6" '
BEGIN {print pre; print hdr}
{print "", "", $1, "", $2, ""}
' input.csv
Value,1
Count,1
Header1,Header2,Path,Header4,Value,Header6
,,i1234567890,,/training/folder,
,,0325435287,,/training/newfolder,
With sed you could try following code. Which is using sed's capability of back reference.
sed -E 's/(^[^ ]*) +(.*$)/,,\2,,\1,/' Input_file
Explanation: Using -E option of sed to enable ERE(extended regular expressions) first. Then in main program using s option to perform substitution operation. In 1st part of substitution creating 2 back references(capability to catch values by using regex and keep them in temp buffer memory to be used later on while substituting it with in 2nd part of substitution). In 2nd part of substitution substituting whole line with 2 commas followed by 2nd capturing group\2 followed by 2 commas followed by 1st capturing group \1 following by ,.
You can use awk instead of sed
cat input.csv | awk '{print ",," $1 "," $2 ","}' >> file.csv
awk can process a stdin input by line to line. It implements a print function and each word is processed as a argument (in your case, $1 and $2). In the above example, I added ,, and , as an inline argument.
You can trivially add empty columns as part of your sed script.
sed 'y/ /,/;s/,/,,/;s/^/,,/;s/$/,/' input.csv >> file.csv
This replaces the first comma with two, then adds two up front and one at the end.
Your expected output does not look like valid CSV, though. This is also brittle in that it will fail for any file names which contain a space or a comma.

How can we use '~|~' delimiter to split the records using scripting command?

Please suggest how can I split the columns separated with ~|~ delimiter.(file: abc.dat)
a~|~1~|~x
b~|~1~|~y
c~|~2~|~z
I am trying below awk command but getting output 0 count.
awk -F'~|~' '$2 == 1' ${file} | wc -l
With your shown samples, please try following. We need not to use wc command along with awk, it could be done within awk itself.
awk -F'~\\|~' '$2 == 1{count++} END{print count}' "$file"
Explanation: Setting field separator as ~|~(escaped | here). Then checking if 2nd field is 1, increment variable count with 1 then. In END block of this program print its value.
For saving values into shell variable use like:
var=$(awk -F'~\\|~' '$2 == 1{count++} END{print count}' "$file")
You can also use ~[|]~ as FS value, as the pipe char used inside a bracket expression always matches itself, a pipe char:
counter=$(awk 'BEGIN{FS="~[|]~"} $2==1{cnt++} END{print cnt}' file)
See the online awk demo:
s='a~|~1~|~x
b~|~1~|~y
c~|~2~|~z'
counter=$(awk 'BEGIN{FS="~[|]~"} $2==1{cnt++} END{print cnt}' <<< "$s")
echo $counter
# => 2

AWK Finding a way to print lines containing a word from a comma separated string

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03
Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt
You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

Iterative replacement of substrings in bash

I'm trying to write a simple script to make several replacements in a big text file. I've a "map" file which contains the records to be searched and replaced,one per line,separated by a space, and a "input" file where I need the changes to be done. The examples files and the script I wrote are beneath.
Map file
new_0 old_0
new_1 old_1
new_2 old_2
new_3 old_3
new_4 old_4
Input file
itsa(old_0)single(old_2)string(old_1)with(old_5)ocurrences(old_4)ofthe(old_3)records
Script
#!/bin/bash
while read -r mapline ; do
mapf1=`awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline"`
mapf2=`awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline"`
for line in $(cat "input") ; do
if [[ "${line}" == *"${mapf2}"* ]] ; then
sed "s/${mapf2}/${mapf1}/g" <<< "${line}"
fi
done < "input"
done < "map"
The thing is that the searches and replaces are made correctly, but I can't find a way to save the output of each iteration and work over it in the next. So, my output looks like this:
itsa(new_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(new_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(new_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(new_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(new_4)ofthe(old_3)records
Yet, the desired output would look like this:
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
May anyone bring some light in this darkly waters??? Thanks in advance!
Improving the existing script
Improvements:
Use "$()" instead of ``. It supports whitespace and is easier to read.
Don't execute sed for each line. sed already loops over all lines and is faster than a loop in bash.
The adapted script:
text="$(< input)"
while read -r mapline; do
mapf1="$(awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline")"
mapf2="$(awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline")"
text="$(sed "s/${mapf2}/${mapf1}/g" <<< "$text")"
done < "map"
echo "$text"
The variable $text contains the complete input file and is modified in each iteration. The output of this script is the file after all replacements were done.
Alternative approach
Convert the map file into a pattern for sed and execute sed just once using that pattern.
pattern="$(sed 's#\(.*\) \(.*\)#s/\2/\1/g#' map)"
sed "$pattern" input
The first command is the conversion step. The file
new_0 old_0
new_1 old_1
...
will result in the pattern
s/old_0/new_0/g
s/old_1/new_1/g
...
It is possible in GNU Awk as follows,
awk 'FNR==NR{hash[$2]=$1; next} \
{for (i=1; i<=NF; i++)\
{for(key in hash) \
{if (match ($i,key)) {$i=sprintf("(%s)",hash[key];break;)}}}print}' \
map-file FS='[()]' OFS= input-file
produces an output as,
itsa(new_0)single(new_2)string(new_1)withold_5ocurrences(new_4)ofthe(new_3)records
Another in Gnu awk, using split and ternary operator(s):
$ awk '
NR==FNR { a[$2]=$1; next }
{
n=split($0,b,"[()]")
for(i=1;i<=n;i++)
printf "%s%s",(i%2 ? b[i] : (b[i] in a? "(" a[b[i]] ")":"")),(i==n?ORS:"")
}' map foo
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
First you read in the map to a hash. When processing the file, split all records by ( and ). Every other could be in the map (i%2==0). While printfing test with ternary operator if matches are found from a and when there is a match, output it parenthesized.

Appending text from first line to last of a file through loop

I want to append a text to the end of the each line of a existing file. This text has to be appended from the first line of the file to last line.
The thing is, I am passing the file contents as an input to the loop and output of that has to be appended onto the same file. I could not figure out a logic for it.
FileName: sample
cat sample
Alex, Johston
Samuel, John
Vebron, Justus
Above are the content of the file. Now, in this, I want to append first column of values ie) Alex, Samuel and Verbron to the end of the file with comma.
My intented output:
Alex, Johston,Alex
Samuel, John,Samuel
Vebron, Justus,Vebron
My script I wrote to take the first column values:
while
read LINE
do
fcol=$(echo $LINE|awk -F, '{ print $1 }')
done < sample
Running through the above loop, variable fcol will store the values - Alex, Samuel, Vebron. I need to append these values into end of each line
Can some one guide me on this and so that I can alter the above code to have the intented output as explained above.
Thanks!
awk -F, '{print $0", "$1}' sample
The loop is not required as the awk takes each line from the input file and process it based on the command provided.
Here the command is print $0", "$1 which appends the $0 and $1 with a "," between them
you can use awk to do this :
cat sample |awk -F "," '{print $0 ", "$1}'
You can't do this without an intermediate file of some sort (without storing the files contents in memory before processing them). That said things like sed -i will hide that detail from you.
awk -F, -v OFS=, '{print $0,$1}' sample > sample.new
sed -i.orig -e 's/^\(.\)\(.*\)/&\1/' sample
Initialize line_no = 0 before looping through the file.
Increment line_no by 1 before processing each line:
Following sed command will append the $fcol at the end of the $line_no
sed -i "$line_no s/$/ ,$fcol/" sample
Final Script:
line_no=0
while read LINE
do
line_no=$((line_no+1))
fcol=$(echo $LINE|awk -F, '{ print $1 }')
sed -i "$line_no s/$/ ,$fcol/" sample
done < sample
My solution is
sed -r 's/([^,]*).*/\0, \1/'

Resources