Add sequence lengths to headers in a fasta file

Add sequence lengths to headers in a fasta file - bash

I have a multifasta file and would like to add the sequence lengths to the headers by keeping the sequences.
>Seq1
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
KGTVGSILDRKDETKTQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQK
>Seq2
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTSQSKIAWISETLCIGCGI
CIKKCPFGALSIVNLPSNLEKETTHRYCANAFKLHRLPIPRPGEVLGLVGTNGIGKSTAL
KGTVGSILDRKDETKTQTVVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQKADIFMF
DEPSSYLDVKQRLKAAITIRSLINPDRYIIV
My desired output
>Seq1_174
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
KGTVGSILDRKDETKTQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQK
>Seq2_211
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTSQSKIAWISETLCIGCGI
CIKKCPFGALSIVNLPSNLEKETTHRYCANAFKLHRLPIPRPGEVLGLVGTNGIGKSTAL
KGTVGSILDRKDETKTQTVVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQKADIFMF
DEPSSYLDVKQRLKAAITIRSLINPDRYIIV
I tried to use this command
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta | paste - - | sed 's/\t/_/' | >seq_len.fasta
but it only shows the length without the sequence.
Can you help me to fix that without using biopython or seqkit?
for example:

When the line doesn't begin with >, accumulate the sequence data in a variable and add its length to a total variable. When the line begins with >, print the sequence that you were accumulating, and save the current line as the name of the next sequence. Finally, at the end of the file print the last sequence.
awk '/^>/ { if (name) {printf("%s_%d\n%s", name, len, seq)} name=$0; seq=""; len = 0; next}
NF > 0 {seq = seq $0 "\n"; len += length()}
END { if (name) {printf("%s_%d\n%s", name, len, seq)} }' file.fasta > seq_len.fasta

Related

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?

Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4

$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.

updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Remove the last-occured lines of patterns

I want to exclude/delete the last line of pattern {n}{n}{n}.log for each possible 3-digit numbers. Each lines end with a sample pattern "123.log".
Sample input file:
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aa112.log
aaa116.log
a113.log
aaaaa116.log
aaa113.log
aa114.log
Output file:
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aaa116.log
a113.log
How could this be performed by bash scripting?

It is fairly simple to remove the last matching line in awk without retaining order.
awk -F'[^0-9]+' '/[0-9]+\.log$/ {
t = $(NF - 1);
if (t in a)
print a[t];
a[t] = $0;
}'
To keep the output ordered is more complicated, and requires more memory.
awk -F'[^0-9]+' '/[0-9]+\.log$/ {
t = $(NF - 1);
a[++i] = $0;
b[$0] = t;
c[t] = i;
}
END {
for (n = 1; n <= i; n++)
if (n != c[b[a[n]]])
print a[n];
}'
To pass through non-matching lines in the first example a next statement can be added to the action, and a pattern of 1 can be appended. For the second example assignment into array a can be moved to its own action.

Probably awk would be the easiest tool for this. For example, this one-liner
tac file | awk 'match($0, /[0-9]{3}.log/,a) && a[0] in b; {b[a[0]]}' | tac
produces the requested output for the sample input. This does not require the entire file to be stored in memory.
Change the regular expression to suit your specific needs.

$ awk '{k=substr($0,length()-7)} NR==FNR{n[k]=NR;next} FNR!=n[k]' file file
aaaa116.log
a112.log
aaa112.log
a113.log
aaaaa112.log
aaa113.log
aaa116.log
a113.log

count the max number of _ and add additional ; if missing

I have a file with several fields like below
deme_Fort_Email_am;04/02/2015;Deme_Fort_Postal
deme_faible_Email_am;18/02/2015;deme_Faible_Email_Relance_am
equi_Fort_Email_am;23/02/2015;trav_Fort_Email_am
trav_Faible_Email_pm;18/02/2015;trav_Faible_Email_Relance_pm
trav_Fort_Email_am;12/02/2015;Trav_Fort_Postal
voya_Faible_Email_am;29/01/2015;voya_Faible_Email_Relance_am
Aim is to have that
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
I'm counting the max of underscore for one of the line then change it to semi-colon and add additional semi-colon, if it is not the maximum number of semi-colon found in all the lines.
I thought about using awk for that but I will only change ,with the command line below , every thing after the first field. My aim is also to add additional semi-colon
awk 'BEGIN{FS=OFS=";"} {for (i=1;i<=NF;i++) gsub(/_/,";", $i) } 1' file
Note: As awk is dealing on a line by line basis, I'm not sure I can do that but I'm asking just in case. If it cannot be done, please let me know and I'll try to find another way.
Thanks.

Here's a two-pass solution. Note you need to put the data file twice on the command line when running awk:
$ cat mu.awk
BEGIN { FS="_"; OFS=";" }
NR == FNR { if (max < NF) max = NF; next }
{ $1=$1; i = max; j = NF; while (i-- > j) $0 = $0 OFS }1
$ awk -f mu.awk mu.txt mu.txt
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
The BEGIN block sets the input and output file separators.
The NF == FNR block makes the first pass through the file, setting the max number of fields.
The last block makes the second pass through the file. First it reconstitutes the line to use the output file separator and than adds an extra ; for however many fields the line is short of the max.
EDIT
This version answers the updated question to only affect fields after field 7:
$ cat mu2.awk
BEGIN { OFS=FS=";" }
# First pass, find the max number of "_"
NR == FNR { gsub("[^_]",""); if (max < length()) max = length(); next }
# Second pass:
{
# count number of "_" less than the max
line = $0
gsub("[^_]","", line)
n = max - length(line)
# replace "_" with ";" after field 7
for (i=8; i<=NF; ++i) gsub("_", ";", $i);
# add an extra ";" for each "_" less than max
while (n-- > 0) $0 = $0 ";"
}1
$ awk -f mu2.awk mu2.txt mu2.txt
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am

This should do:
awk -F_ '{for (i=1;i<=NF;i++) a[NR FS i]=$i;c=NF>c?NF:c} END {for (j=1;j<=NR;j++) {for (i=1;i<c;i++) printf "%s;",a[j FS i];print a[j FS c]}}' file
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
How it works:
awk -F_ ' # Set field separator to "_"
{for (i=1;i<=NF;i++) # Loop trough one by one field
a[NR FS i]=$i # Store the field in array "a" using both row(NR) and column position(i) as referense
c=NF>c?NF:c} # Find the largest number of fields and store it in "c"
END { # When file read is done, then do at end
for (j=1;j<=NR;j++) { # Loop trough all row
for (i=1;i<c;i++) # Loop trough all column
printf "%s;",a[j FS i] # Print one and one field for every row
print a[j FS c] # Print end field in each row
}
}
' file # read the file

How to grep number of unique occurrences

I understand that grep -c string can be used to count the occurrences of a given string. What I would like to do is count the number of unique occurrences when only part of the string is known or remains constant.
For Example, if I had a file (in this case a log) with several lines containing a constant string and a repeating variable like so:
string=value1
string=value1
string=value1
string=value2
string=value3
string=value2
Than I would like to be able to identify the number of each unique set with an output similar to the following: (ideally with a single grep/awk string)
value1 = 3 occurrences
value2 = 2 occurrences
value3 = 1 occurrences
Does anyone have a solution using grep or awk that might work? Thanks in advance!

This worked perfectly... Thanks to everyone for your comments!
grep -oP "wwn=[^,]*" path/to/file | sort | uniq -c

In general, if you want to grep and also keep track of results, it is best to use awk since it performs such things in a clear manner with a very simple syntax.
So for your given file I would use:
$ awk -F= '/string=/ {count[$2]++} END {for (i in count) print i, count[i]}' file
value1 3
value2 2
value3 1
What is this doing?
-F=
set the field separator to =, so that we can compute the right and left part of it.
/string=/ {count[$2]++}
when the pattern "string=" is found, check it! This uses an array count[] to keep track on the times the second field has appeared so far.
END {for (i in count) print i, count[i]}
at the end, loop through the results and print them.

Here's an awk script:
#!/usr/bin/awk -f
BEGIN {
file = ARGV[1]
while ((getline line < file) > 0) {
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
if (line ~ p) {
a[p] += !a[p, line]++
}
}
}
for (i = 2; i < ARGC; ++i) {
p = ARGV[i]
printf("%s = %d occurrences\n", p, a[p])
}
exit
}
Example:
awk -f script.awk somefile ab sh
Output:
ab = 7 occurrences
sh = 2 occurrences

Add leading zeroes to awk variable

I have the following awk command within a "for" loop in bash:
awk -v pdb="$pdb" 'BEGIN {file = 1; filename = pdb"_" file ".pdb"}
/ENDMDL/ {getline; file ++; filename = pdb"_" file ".pdb"}
{print $0 > filename}' < ${pdb}.pdb
This reads a series of files with the name $pdb.pdb and splits them in files called $pdb_1.pdb, $pdb_2.pdb, ..., $pdb_21.pdb, etc. However, I would like to produce files with names like $pdb_01.pdb, $pdb_02.pdb, ..., $pdb_21.pdb, i.e., to add padding zeros to the "file" variable.
I have tried without success using printf in different ways. Help would be much appreciated.

Here's how to create leading zeros with awk:
# echo 1 | awk '{ printf("%02d\n", $1) }'
01
# echo 21 | awk '{ printf("%02d\n", $1) }'
21
Replace %02 with the total number of digits you need (including zeros).

Replace file on output with sprintf("%02d", file).
Or even the whole assigment with filename = sprintf("%s_%02d.pdb", pdb, file);.

This does it without resort of printf, which is expensive. The first parameter is the string to pad, the second is the total length after padding.
echo 722 8 | awk '{ for(c = 0; c < $2; c++) s = s"0"; s = s$1; print substr(s, 1 + length(s) - $2); }'
If you know in advance the length of the result string, you can use a simplified version (say 8 is your limit):
echo 722 | awk '{ s = "00000000"$1; print substr(s, 1 + length(s) - 8); }'
The result in both cases is 00000722.

Here is a function that left or right-pads values with zeroes depending on the parameters: zeropad(value, count, direction)
function zeropad(s,c,d) {
if(d!="r")
d="l" # l is the default and fallback value
return sprintf("%" (d=="l"? "0" c:"") "d" (d=="r"?"%0" c-length(s) "d":""), s,"")
}
{ # test main
print zeropad($1,$2,$3)
}
Some tests:
$ cat test
2 3 l
2 4 r
2 5
a 6 r
The test:
$ awk -f program.awk test
002
2000
00002
000000
It's not fully battlefield tested so strange parameters may yield strange results.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Add sequence lengths to headers in a fasta file - bash

Related

Average of first ten numbers of text file using bash

Remove the last-occured lines of patterns

count the max number of _ and add additional ; if missing

How to grep number of unique occurrences

Add leading zeroes to awk variable

Categories

Resources