How to count a matching pattern in one line? - shell

enter code hereI have a fasta file containing sequences
>lcl|QCYY01003067.1_cds_ROT65593.1_2
ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
>lcl|QCYY01003067.1_cds_ROT65593.1_3
ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
>lcl|QCYY01003067.1_cds_ROT65593.1_4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCC
I wanted to count the number of 'N' and also the number of patterns occurring in each line. No need to include header (>lcl|QCYY01003067.1_cds_ROT65593.1_2 )
eg:-
line 2=0,0
line 4=15,2
line 6=10,1
How to improve this code:
grep -n '[{N}]' <filename> | cut -d : -f 1 | uniq -c

Another awk:
$ awk 'NR%2==0{printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"")}' file
Output:
line 2=0,0
line 4=15,2
line 6=10,1
Explained:
$ awk '
NR%2==0 { # process even records
printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"") # count with gsub
}' file
gsub(/N/,"N") counts the amount of Ns in the record (returns the amount of replacements). gsub(/N+/,"") counts the number of consecutive strings of Ns. Notice, that "" removes those Ns from the record so if you need to later further process the data, use gsub(/N+/,"&") instead.
Updated:
The version I wrote for your already-deleted next question.
I added an extra line to your data which demonstrates the question I asked in the comments (is ...N\nNN.. one (NNN) or two (N,NN) patterns of your definition):
...
>seq4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCNNN
NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC
This one is for GNU awk (for using RT):
$ gawk '
BEGIN {
RS=">seq[^\n]+"
}
NR>1 {
# gsub(/\n/,"") # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN
printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
}
{
rt=RT
}' file
Output (pay special attention to the seq4):
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,3
or if you uncomment the gsub(/\n/,"") to remove the newline separating strings, the output is:
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,2
One-liner (with the one gsub uncommented):
$ awk 'BEGIN{RS=">seq[^\n]+"}NR>1{gsub(/\n/,"");printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")}{rt=RT}' file

Could you please try following.
awk '
!/^>/{
while(match($0,/N+/)){
count++
total+=length(substr($0,RSTART,RLENGTH))
$0=substr($0,RSTART+RLENGTH)
}
printf("%s %d=%d,%d\n","line",FNR,total,count)
count=total=""
}
' Input_file
Output will be as follows.
line 2=0,0
line 4=15,2
line 6=10,1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!/^>/{ ##Checking condition if a line is NOT starting from > then do following.
while(match($0,/N+/)){ ##Running a while loop which will run till a match found for N characters continuous occurrence.
count++ ##Doing increment to variable count with 1 each time cursor comes here.
total+=length(substr($0,RSTART,RLENGTH)) ##Creating total variable which is keep adding its own value along with length of matched regex, where regex is looking for continuous occurrence of N character in current line.
$0=substr($0,RSTART+RLENGTH) ##Resetting value of current line to have only REST of line which starts from very next character of matched regex. So that we can skip previous matched regex and look for others in rest of the line.
} ##Closing BLOCK for above mentioned while loop here.
printf("%s %d=%d,%d\n","line",FNR,total,count) ##Printing values line,FNR,total,count variables here.
count=total="" ##Nullifying variables count and total here, so that previous values should NOT be added to current values of it.
}
' Input_file ##Mentioning Input_file name here.

Related

How to trim every nth line?

i would like to cut off the first 9 characters of each 4th line. I could use cut -c 9, but i don't know how to select only every 4th line, without loosing the remaining lines.
Input:
#V300059044L3C001R0010004402
AAGTAGATATCATGGAGCCG
+
FFFGFGGFGFGFFGFFGFFGGGGGFFFGG
#V300059044L3C001R0010009240
AAAGGGAGGGAGAATAATGG
+
GFFGFEGFGFGEFDFGGEFFGGEDEGEGF
Output:
#V300059044L3C001R0010004402
AAGTAGATATCATGGAGCCG
+
FGFFGFFGFFGGGGGFFFGG
#V300059044L3C001R0010009240
AAAGGGAGGGAGAATAATGG
+
FGEFDFGGEFFGGEDEGEGF
Could you please try following, written and tested with shown samples in GNU awk.
awk 'FNR%4==0{print substr($0,10);next} 1' Input_file
OR as per #tripleee's suggestion(in comments) try:
awk '!(FNR%4) { $0 = substr($0, 10) }1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR%4==0{ ##Checking condition if this line number is fully divided by 4(every 4th line).
print substr($0,10) ##Printing line from 10th character here.
next ##next will skip all further statements from here.
}
1 ##1 will print current Line.
' Input_file ##Mentioning Input_file name here.
GNU sed can choose every 4th line with 4~4, e.g.:
sed -E '4~4s/.{9}//'

Efficient coding to count capital characters in file

I want to count all the capital characters A-Z from a file.
I take the file as an argument and then i search the whole file for each letter and sum my result. My code is working fine, but is there another way to make it more efficient, without using loop?
sum=0
for var in {A..Z}
do
foo="$(grep -o $var "$1"| wc -l)"
sum=$((sum+foo))
done
I tried to do it like this but it gives me wrong results, because its counting spaces and end line.
cat "$1" | wc -m
You can do it with a single grep command similar to what you're already doing:
grep -o "[A-Z]" "$1" | wc -l
We can really avoid using multiple programs for counting capital letters in a file, this could done easily with a single awk and it will save us some cycles and should be FASTER too.
Could you please try following.
awk '
{
count+=gsub(/[A-Z]/,"&")
}
END{
print "Total number of capital letters in file are: " count
}
' Input_file
In case you want to run it as a script which takes Input_file as an argument change Input_file to $1 too.
Explanation: Adding explanation for above code, only for explanation purposes not for running(following one).
awk ' ##Starting awk program here.
{
count+=gsub(/[A-Z]/,"&") ##Creating a variable named count whose value will be keeping adding to itself, each time a substitution done from gsub.
##where gsub is awk out of the box function to substitute.
##Using gsub I am substituting each capital letter with itself and adding its count to count variable.
}
END{ ##Starting END block for this awk program. which will be executed once Input_file is done with reading.
print "Total number of capital letters in file are: " count ##Printing total number of capital letters which are there in count variable.
}
' Input_file ##mentioning Input_file name here.

replace names in fasta

I want to change the sequence names in a fasta file according a text file containing new names. I found several approaches but seqkit made a good impression, anyway I can´t get it running. Replace key with value by key-value file
The fasta file seq.fa looks like
>BC1
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>BC2
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>BC3
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
and the ref.txt tab delimited text file like
BC1 1234
BC2 1235
BC3 1236
using siqkit in Git Bash runs trough the file but doesn´t change the names.
seqkit replace -p' (.+)$' -r' {kv}' -k ref.txt seq.fa --keep-key
I´m used to r and new to bash and can´t find the bug but guess I need to adjust for tab and _ ?
As in the example https://bioinf.shenwei.me/seqkit/usage/#replace part 7. Replace key with value by key-value file the sequence name is tab delimited and only the second part is replaced.
Advise how to adjust the code?
Desired outcome should look like: Replacing BC1 by the number in the text file 1234
>1234
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>1235
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>1236
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
could you please try following.
awk '
FNR==NR{
a[$1]=$2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##FNR==NR is condition which will be TRUE when 1st Input_file named ref.txt will be read.
a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2 of current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && /^>/{ ##Checking condition if $2 of current line is present in array a and starts with > then do following.
print ">"a[$2] ##Printing > and value of array a whose index is $2.
next ##next will skip all further statements from here.
}
1 ##Mentioning 1 will print the lines(those which are NOT starting with > in Input_file seq.fa)
' ref.txt FS="[> ]" seq.fa ##Mentioning Input_file names here and setting FS= either space or > for Input_file seq.fa here.
EDIT: As per OP's comment need to add >1234_1 occurrence number too in output so adding following code now.
awk '
FNR==NR{
a[$1]=$2
b[$1]=++c[$2]
next
}
($2 in a) && /^>/{
print ">"a[$2]"_"b[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
awk solution that doesn't require GNU awk:
awk 'NR==FNR{a[$1]=$2;next}
NF==2{$2=a[$2]; print ">" $2;next}
1' FS='\t' ref.txt FS='>' seq.fa
The first statement is filling the array a with the content of the tab delimited file ref.txt.
The second statement prints all lines of the second files seq.fa with 2 fields given the > as field delimiter.
The last statement prints all lines of that same file.

sed squeeze multiple occurrence of word

I have text file with lines like below:
this is the code ;rfc1234;rfc1234
this is the code ;rfc1234;rfc1234;rfc1234;rfc1234
How can I squeeze the the repeating words in file to single word like below:
this is the code ;rfc1234
this is the code ;rfc1234
I tried 'tr' command but it's limited to squeezing characters only
with sed for arbitrary repeated strings prefixed with ;
$ sed -E 's/(;[^;]+)(\1)+/\1/g' file
or, if you want to delete everything after the first token without checking whether they match the preceding one or not
$ sed -E 's/(\S);.*/\1/' file
Explanation
(;[^;]+) is to capture a string starting with semicolon
(\1)+ followed by the same captured string one or more times
/\1/g replace the whole chain with one instance, and repeat
Following awk may help here. It will look for all items in last column of you Input_file and will keep only unique values in it.
awk '{num=split($NF,array,";");for(i=1;i<=num;i++){if(!array1[array[i]]++){val=val?val ";" array[i]:array[i]}};NF--;print $0";"val;val="";delete array;delete array1}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
num=split($NF,array,";");
for(i=1;i<=num;i++){
if(!array1[array[i]]++){
val=val?val ";" array[i]:array[i]}
};
NF--;
print $0";"val;
val="";
delete array;
delete array1
}' Input_file
Explanation:
awk '
{
num=split($NF,array,";"); ##Creating a variable named num whose value is length of array named array, which is created on last field of line with ; as a delimiter.
for(i=1;i<=num;i++){ ##Starting a for loop from i=1 to till value of num each time increment i as 1.
if(!array1[array[i]]++){ ##Chrcking here a condition if array named array1 index is value of array[i] is NOT coming more than 1 value then do following.
val=val?val ";" array[i]:array[i]}##Creating a variable named val here whose value is array[i] value and keep concatenating its own value of it.
};
NF--; ##Reducing the value of NF(number of fields) in current line to remove the last field from it.
print $0";"val; ##Printing the current line(without last field) ; and then value of val here.
val=""; ##Nullifying variable val here.
delete array; ##Deleting array named array here.
delete array1 ##Deleting array named array1 here.
}' Input_file ##Mentioning Input_file name here.
I started playing around with s/(.+)\1/\1/g. It seemed to work with perl (even found the is_is_) but didn't quite take me there:
$ perl -pe 's/(.+)\1+/\1/g' file
this the code ;rfc1234
this the code ;rfc1234;rfc1234
sed 's/\(;[^;]*\).*/\1/' file
You can use the below command to achieve this:-
echo "this is the code ;rfc1234;rfc1234" | sed 's/rfc1234//2g'
echo "this is the code ;rfc1234;rfc1234;rfc1234;rfc1234" | sed 's/rfc1234//2g'
or
sed 's/rfc1234//2g' yourfile.txt
This might work for you (GNU sed):
sed -r ':a;s/(\S+)\1+/\1/g;ta' file
The regex is repeated until only the first pattern remains.

How to Compare two files line by line and output the whole line if different

I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.
Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.

Resources