Sequence length of FASTA file - bash

I have the following FASTA file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
The output I get with this code is:
>header1
60
57
>header2
3
>header3
7
I need a small modification in order to deal with multiple sequence lines.
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

An awk / gawk solution can be composed by three stages:
Every time header is found these actions should be performed:
Print previous seqlen if exists.
Print tag.
Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.
Commented code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
For the totals:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa

A quick way with any awk, would be this:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}

"seqkit" is a quick way:
seqkit fx2tab --length --name --header-line sequence.fa

Related

Editing text in Bash

I am trying to edit text in Bash, i got to point where i am no longer able to continue and i need help.
The text i need to edit:
Symbol Name Sector Market Cap, $K Last Links
AAPL
Apple Inc
Computers and Technology
2,006,722,560
118.03
AMGN
Amgen Inc
Medical
132,594,808
227.76
AXP
American Express Company
Finance
91,986,280
114.24
BA
Boeing Company
Aerospace
114,768,960
203.30
The text i need:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
I already tried :
sed 's/$/,/' BIPSukol.txt > BIPSukol1.txt | awk 'NR==1{print}' BIPSukol1.txt | awk '(NR-1)%5{printf "%s ", $0;next;}1' BIPSukol1.txt | sed 's/.$//'
But it doesnt quite do the job.
(BIPSukol1.txt is the name of the file i am editing)
The biggest problem you have is you do not have consistent delimiters between your fields. Some have commas, some don't and some are just a combination of 3-fields that happen to run together.
The tool you want is awk. It will allow you to treat the first line differently and then condition the output that follows with convenient counters you keep within the script. In awk you write rules (what comes between the outer {...} and then awk applies your rules in the order they are written. This allows you to "fix-up" your hap-hazard format and arrive at the desired output.
The first rule applied FNR==1 is applied to the 1st line. It loops over the fields and finds the problematic "Market Cap $K" field and considers it as one, skipping beyond it to output the remaining headings. It stores a counter count = NF - 3 as you only have 5 lines of data for each Symbol, and skips to the next record.
When count==n the next rule is triggered which just outputs the records stored in the a[] array, zeros count and deletes the a[] array for refilling.
The next rule is applied to every record (line) of input from the 2nd-on. It simply removes any whitespece from the fields by forcing awk to recalculate the fields with $1 = $1 and then stores the record in the array incrementing count.
The last rule, END is a special rule that runs after all records are processed (it lets you sum final tallies or output final lines of data) Here it is used to output the records that remain in a[] when the end of the file is reached.
Putting it altogether in another cut at awk:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
for (i=1;i<=n;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
delete a
count = 0
}
{
$1 = $1
a[++count] = $0
}
END {
for (i=1;i<=count;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
}
' file
Example Use/Output
Note: you can simply select-copy the script above and then middle-mouse-paste it into an xterm with the directory set so it contains file (you will need to rename file to whatever your input filename is)
$ awk '
> FNR==1 {
> for (i=1;i<=NF;i++)
> if ($i == "Market") {
> printf ",Market Cap $K"
> i = i + 2
> }
> else
> printf (i>1?",%s":"%s"), $i
> print ""
> n = NF-3
> count = 0
> next
> }
> count==n {
> for (i=1;i<=n;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> delete a
> count = 0
> }
> {
> $1 = $1
> a[++count] = $0
> }
> END {
> for (i=1;i<=count;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> }
> ' file
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
(note: it is unclear why you want the "Links" heading included since there is no information for that field -- but that is how your desired output is specified)
More Efficient No Array
You always have afterthoughts that creep in after you post an answer, no different than remembering a better way to answer a question as you are walking out of an exam, or thinking about the one additional question you wished you would have asked after you excuse a witness or rest your case at trial. (there was some song that captured it -- a little bit ironic :)
The following does essentially the same thing, but without using arrays. Instead it simply outputs the information after formatting it rather than buffer it in an array for output all at once. It was one of those type afterthoughts:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
print ""
count = 0
}
{
$1 = $1
printf (++count>1?",%s":"%s"), $0
}
END { print "" }
' file
(same output)
With your shown samples, could you please try following(written and tested in GNU awk). Considering that(by seeing OP's attempts) after header of Input_file you want to make every 5 lines into a single line.
awk '
BEGIN{
OFS=","
}
FNR==1{
NF--
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
OR if your awk doesn't support NF-- then try following.
awk '
BEGIN{
OFS=","
}
FNR==1{
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +Links( +)?$/,"",lastPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
NOTE: Looks like your header/first line needed special manipulation because we can't simply set , for all spaces, so taken care of it in this solution as per shown samples.
With GNU awk. If your first line is always the same.
echo 'Symbol,Name,Sector,Market Cap $K,Last,Links'
awk 'NR>1 && NF=5' RS='\n ' ORS='\n' FS='\n' OFS=',' file
Output:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Extract common lines from multiple text files and display original line numbers

What I want?
Extract the common lines from n large files.
Append the original line numbers of each files.
Example:
File1.txt has the following content
apple
banana
cat
File2.txt has the following content
boy
girl
banana
apple
File3.txt has the following content
foo
apple
bar
The output should be a different file
1 3 2 apple
1, 3 and 2 in the output are the original line numbers of File1.txt, File2.txt and File3.txt where the common line apple exists
I have tried using grep -nf File1.txt File2.txt File3.txt, but it returns
File2.txt:3:apple
File3.txt:2:apple
Associate each unique line with a space separated list of line numbers indicating where it is seen in each file in an array, and print these next to each other at the end if the line is found in all three files.
awk '{
n[$0] = n[$0] FNR OFS
c[$0]++
}
END {
for (r in c)
if (c[r] == 3)
print n[r] r
}' file1 file2 file3
If the number of files is unknown, refer to Ravinder's answer, or just change the hardcoded 3 in the END block with ARGC-1 as shown there.
GNU awk specific approach that works with any number of files:
#!/usr/bin/gawk -f
BEGINFILE {
nfiles++
}
{
lines[$0][nfiles] = FNR
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (line in lines) {
if (length(lines[line]) == nfiles) {
for (file = 1; file <= nfiles; file++)
printf "%d\t", lines[line][file]
print line
}
}
}
Example:
$ ./showlines file[123].txt
1 3 2 apple
Could you please try following, written and tested with GNU awk, one could make use of ARGC value which gives us total number of element passed to awk program.
awk '
{
a[$0]=(a[$0]?a[$0] OFS:"")FNR
count[$0]++
}
END{
for(i in count){
if(count[i]==(ARGC-1)){
print i,a[i]
}
}
}
' file1.txt file2.txt file3.txt
A perl solution
perl -ne '
$h{$_} .= "$.\t"; # append current line number and tab character to value in a hash with key current line
$. = 0 if eof; # reset line number when end of file is reached
END{
while ( ($k,$v) = each %h ) { # loop over has entries
if ( $v =~ y/\t// == 3 ) { # if value contains 3 tabs
print $v.$k # print value concatenated with key
}
}
}' file1.txt file2.txt file3.txt

Use awk to create index of words from file

I'm learning UNIX for school and I'm supposed to create a command line that takes a text file and generates a dictionary index showing the words (exluding articles and prepositions) and the lines where it appears in the file.
I found a similar problem as mine in: https://unix.stackexchange.com/questions/169159/how-do-i-use-awk-to-create-an-index-of-words-in-file?newreg=a75eebee28fb4a3eadeef5a53c74b9a8 The problem is that when I run the solution
$ awk '
{
gsub(/[^[:alpha:] ]/,"");
for(i=1;i<=NF;i++) {
a[$i] = a[$i] ? a[$i]", "FNR : FNR;
}
}
END {
for (i in a) {
print i": "a[i];
}
}' file | sort
The output contains special characters (which I don't want) like:
-Quiero: 21
Sancho,: 2, 4, 8
How can I remove all the special characters and excluding articles and prepositions?
$ echo This is this test. | # some test text
awk '
BEGIN{
x["a"];x["an"];x["the"];x["on"] # the stop words
OFS=", " # list separator to a
}
{
for(i=1;i<=NF;i++) # list words in a line
if($i in x==0) { # if word is not a stop word
$i=tolower($i) # lowercase it
gsub(/^[^a-z]|[^a-z]$/,"",$i) # remove leading and trailing non-alphabets
a[$i]=a[$i] (a[$i]==""?"":OFS) NR # add record number to list
}
}
END { # after file is processed
for(i in a) # in no particular order
print i ": " a[i] # ... print elements in a
}'
this: 1, 1
test: 1
is: 1

Bash group by on the basis of n number of columns

This is related to my previous question that I [asked] (bash command for group by count)
What if I want to generalize this? For instance
The input file is
ABC|1|2
ABC|3|4
BCD|7|2
ABC|5|6
BCD|3|5
The output should be
ABC|9|12
BCD|10|7
The result is calculated by group first column and adding the values of 2nd column, and 3rd column, just like similar to group by command in SQL.
I tried modifying the command provided in the link but failed. I don't know whether I'm making a conceptual error or a silly mistake but all I know is none of the mentioned commands aren't working.
Command used
awk -F "|" '{arr[$1]+=$2} END arr2[$1]+=$5 END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2} END {arr2[$1]+=$5} END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2 arr2[$1]+=$5} END {for (i in arr2) {print i"|"arr[i]"|"arr2[i]}}' sample
Additionally, what if I'm trying here is to limit the use to summing the columns upto 2 only. What if there are n columns and we want to perform operations such as addition in one column and subtraction in other? How can that further be modified?
Example
ABC|1|2|4|......... upto n columns
ABC|4|5|6|......... upto n columns
DEF|1|4|6|......... upto n columns
lets say if sum is needed with first column, average may be for second column, some other operation for third column, etc. How this can be tackled?
For 3 fields (key and 2 data fields):
$ awk '
BEGIN { FS=OFS="|" } # set separators
{
a[$1]+=$2 # sum second field to a hash
b[$1]+=$3 # ... b hash
}
END { # in the end
for(i in a) # loop all
print i,a[i],b[i] # and output
}' file
BCD|10|7
ABC|9|12
More generic solution for n columns using GNU awk:
$ awk '
BEGIN { FS=OFS="|" }
{
for(i=2;i<=NF;i++) # loop all data fields
a[$1][i]+=$i # sum them up to related cells
a[$1][1]=i # set field count to first cell
}
END {
for(i in a) {
for((j=2)&&b="";j<a[i][1];j++) # buffer output
b=b (b==""?"":OFS)a[i][j]
print i,b # output
}
}' file
BCD|10|7
ABC|9|12
Latter only tested for 2 fields (busy at a meeting :).
gawk approach using multidimensional array:
awk 'BEGIN{ FS=OFS="|" }{ a[$1]["f2"]+=$2; a[$1]["f3"]+=$3 }
END{ for(i in a) print i,a[i]["f2"],a[i]["f3"] }' file
a[$1]["f2"]+=$2 - summing up values of the 2nd field (f2 - field 2)
a[$1]["f3"]+=$3 - summing up values of the 3rd field (f3 - field 3)
The output:
ABC|9|12
BCD|10|7
Additional short datamash solution (will give the same output):
datamash -st\| -g1 sum 2 sum 3 <file
-s - sort the input lines
-t\| - field separator
sum 2 sum 3 - sums up values of the 2nd and 3rd fields respectively
awk -F\| '{ array[$1]="";for (i=1;i<=NF;i++) { arr[$1,i]+=$i } } END { for (i in array) { printf "%s",i;for (p=2;p<=NF;p++) { printf "|%s",arr[i,p] } print "\n" } }' filename
We use two arrays, (array and arr) array is a single dimensional array tracking all the first pieces and arr is a multidimensional array keyed on the first piece and then the piece index and so for example arr["ABC",1]=1 and arr["ABC",2]=2. At the end we loop through array and then each field in the data set, we pull out the data from the multidimensional array arr.
This will work in any awk and will retain the input keys order in the output:
$ cat tst.awk
BEGIN { FS=OFS="|" }
!seen[$1]++ { keys[++numKeys] = $1 }
{
for (i=2;i<=NF;i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (i=2;i<=NF;i++) {
printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
ABC|9|12
BCD|10|7

improve bash loop with awk split

The awk below improved by #hek2mgl runs, however it takes ~15 hours to complete. It is basically matching input files that are 21 - 259 records to a file of 11,137,660 records. It is a lot but hopefully it can be made faster. Maybe If spilt $5 on the hyphen AGRN-6|gc=75 to AGRN - 6|gc=75could speed up the process. Not sure if the below is a start or not. Essentially what it does is use the input files of which there are 4 to search and match in a large 11,000,000 record file. Thank you :).
input
AGRN
CCDC39
CCDC40
CFTR
file that is searched in
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 1 0
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 2 2
chr1 955543 955763 chr1:955543 AGRN-6|gc=75 3 2
output ($4 $5 average of $7)
chr1:955543 AGRN-6|gc=75 1.3
awk
BEGIN{FS="[\t| -]+"}
# Read search terms from file1 into 's'
FNR==NR {
s[$0=1]
next
}
{
# Check if $5 matches one of the search terms
for(i in s) {
if($5 ~ i) {
# check for match
if s[$5] exists
s[$5] {
# Store first two fields for later usage
a[$5]=$1
b[$5]=$2
# Add $9 to total of $9 per $5
t[$5]+=$8
# Increment count of occurences of $5
c[$5]++
next
}
}
}
END {
# Calculate average and print output for all search terms
# that has been found
for( i in t ) {
avg = t[i] / c[i]
printf "%s:%s\t%s\t%s\n", a[i], b[i], i, avg | "sort -k3,3n"
}
}
Simplify:
awk '
NR == FNR {input[$0]; next}
{
split($5, a, "-")
if (a[1] in input) {
key = $4 OFS $5
n[key]++
sum[key] += $7
}
}
END {
for (key in n)
printf "%s %.1f\n", key, sum[key]/n[key]
}
' input file
Your code is broken because of the over-use of arrays, but mainly this:
FNR==NR {
s[$0=1]
# ^^^^^
next
}
Array s will only have a single key, the number "1" because for each line you assign the value "1" to $0. You should write
s[$0] = 1
I'd be interested to hear what the speed is of the following, I'm not sure it will be much slower since it doesn't require awk to do anything clumsy but it still requires the number of input selection passes to complete. If you want to optimize it I think you need to use associative arrays and hash the input selection match to its own array. That way you can have it done in one pass over the file - though still the same amount of potential passes per line unless you can skip searching after the first match you may be slightly quicker.
Input file: select.txt
Search file: search_file.txt
while IFS= read a; do
awk "BEGIN {cnt=0;var=0}{ if (\$5~ \"${a}\") { var=var+\$7;field4=\$4; cnt+=1; field5=\$5; }; } END{print field4\" \"field5\" \"var/cnt}" search_file.txt
done < select.txt

Resources