enter code hereI have a fasta file containing sequences
>lcl|QCYY01003067.1_cds_ROT65593.1_2
ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
>lcl|QCYY01003067.1_cds_ROT65593.1_3
ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
>lcl|QCYY01003067.1_cds_ROT65593.1_4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCC
I wanted to count the number of 'N' and also the number of patterns occurring in each line. No need to include header (>lcl|QCYY01003067.1_cds_ROT65593.1_2 )
eg:-
line 2=0,0
line 4=15,2
line 6=10,1
How to improve this code:
grep -n '[{N}]' <filename> | cut -d : -f 1 | uniq -c
Another awk:
$ awk 'NR%2==0{printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"")}' file
Output:
line 2=0,0
line 4=15,2
line 6=10,1
Explained:
$ awk '
NR%2==0 { # process even records
printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"") # count with gsub
}' file
gsub(/N/,"N") counts the amount of Ns in the record (returns the amount of replacements). gsub(/N+/,"") counts the number of consecutive strings of Ns. Notice, that "" removes those Ns from the record so if you need to later further process the data, use gsub(/N+/,"&") instead.
Updated:
The version I wrote for your already-deleted next question.
I added an extra line to your data which demonstrates the question I asked in the comments (is ...N\nNN.. one (NNN) or two (N,NN) patterns of your definition):
...
>seq4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCNNN
NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC
This one is for GNU awk (for using RT):
$ gawk '
BEGIN {
RS=">seq[^\n]+"
}
NR>1 {
# gsub(/\n/,"") # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN
printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
}
{
rt=RT
}' file
Output (pay special attention to the seq4):
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,3
or if you uncomment the gsub(/\n/,"") to remove the newline separating strings, the output is:
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,2
One-liner (with the one gsub uncommented):
$ awk 'BEGIN{RS=">seq[^\n]+"}NR>1{gsub(/\n/,"");printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")}{rt=RT}' file
Could you please try following.
awk '
!/^>/{
while(match($0,/N+/)){
count++
total+=length(substr($0,RSTART,RLENGTH))
$0=substr($0,RSTART+RLENGTH)
}
printf("%s %d=%d,%d\n","line",FNR,total,count)
count=total=""
}
' Input_file
Output will be as follows.
line 2=0,0
line 4=15,2
line 6=10,1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!/^>/{ ##Checking condition if a line is NOT starting from > then do following.
while(match($0,/N+/)){ ##Running a while loop which will run till a match found for N characters continuous occurrence.
count++ ##Doing increment to variable count with 1 each time cursor comes here.
total+=length(substr($0,RSTART,RLENGTH)) ##Creating total variable which is keep adding its own value along with length of matched regex, where regex is looking for continuous occurrence of N character in current line.
$0=substr($0,RSTART+RLENGTH) ##Resetting value of current line to have only REST of line which starts from very next character of matched regex. So that we can skip previous matched regex and look for others in rest of the line.
} ##Closing BLOCK for above mentioned while loop here.
printf("%s %d=%d,%d\n","line",FNR,total,count) ##Printing values line,FNR,total,count variables here.
count=total="" ##Nullifying variables count and total here, so that previous values should NOT be added to current values of it.
}
' Input_file ##Mentioning Input_file name here.
I have a file like below format:
$ cat file_in.csv
1308123;28/01/2019;28/01/2019;22/01/2019
1308456;20/11/2018;27/11/2018;09/11/2018;15/11/2018;10/11/2018;02/12/2018
1308789;06/12/2018;04/12/2018
1308012;unknown
How can i transpose as below, starting from second column:
1308123;28/01/2019
1308123;28/01/2019
1308123;22/01/2019
1308456;20/11/2018
1308456;27/11/2018
1308456;09/11/2018
1308456;15/11/2018
1308456;10/11/2018
1308456;02/12/2018
1308789;06/12/2018
1308789;04/12/2018
1308012;unknown
I'm testing my script, but obtain a wrong result
echo "123;23/05/2018;24/05/2018" | awk -F";" 'NR==3{a=$1";";next}{a=a$1";"}END{print a}'
Thanks in advance
1st Solution: Eaisest solution will be, loop through all fields(off course have set field separator as ;) and then print $1 along with all fields in new line. Also note that loop is running from i=2 to till value of NF leaving first field since we need to print in new line from column 2nd onwards.
awk 'BEGIN{FS=OFS=";"} {for(i=2;i<=NF;i++){print $1,$i}}' Input_file
2nd Solution: Using 1 time substitution(sub) and global substitutions(gsub) functionality of awk. Here I am changing very first occurence of ; with ###(assumed that your Input_file will NOT have this characters together, in case it is there choose any unique character(s) which are NOT in one's Input_file on place of ###), then globally subsituting ;(all occurences) with ORS val(a variable which has value of $1) and ; so make values in new column. Now finally remove ### from first field. Why we have done this approch if we DO NOT substitute very first occurence of ; with any other character then it will place a NEW LINE before substituion which we DO NOT want to have. (Also as per Ed sir's comment this solution was tested in 1 Input_file and may have issues while reading multiple Input_files)
awk 'BEGIN{FS=OFS=";"} {val=$1;sub(";","###");gsub(";",ORS val ";");sub("###",";",$1)} 1' Input_file
Another awk
awk -F";" '{ OFS="\n" $1 ";"; $1=$1;$1=""; printf("%s",$0) } ' file
I'm trying to extract all the records that matches the text "IN" in the 10th field from this file.
i tried but it's not giving me the accurate results. Any help provided here would be highly appreciated.
awk '$10 == "IN" {print $0}'
input_file: my input file
A1|A2|A3|A4|A5|A6|A7|A8|A9|PK|A11|A13|A14|A15|A16|A17|A18
1|2|3|4|5|6|7|8|9|IN|11|12|13|14|15|16|17|18
AW|BW|CQ|AA|AR|AF|RR|AKL|ASD|US|PP|BN|TY|OL|Q3|M8|I7|V6
AR|BR|CR|A8|AN|AQ|RU|A11|A13|IN|P9P|B0N|T2Y|O4L|Q43|M88|I71|V16
output_file: my output should be
1|2|3|4|5|6|7|8|9|IN|11|12|13|14|15|16|17|18
AR|BR|CR|A8|AN|AQ|RU|A11|A13|IN|P9P|B0N|T2Y|O4L|Q43|M88|I71|V16
all the records that matched "IN" in the 10th field should be filtered.
Since you haven't mentioned the field separator in awk code so by default it makes space as field separator and your Input_file is | pipe delimited so let awk know you should set it up in code.
Could you please try following.
awk -F'|' '$10=="IN"' Input_file
Explanation: Adding explanation for above code too.
awk -F'|' ' ##Setting field separator as |(pipe) for all lines of Input_file.
$10=="IN" ##Checking condition if 10th field is equal to IN here if yes then print the current line.
' Input_file ##Mentioning Input_file name here.
I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.
Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.
How can I remove partial duplicates in bash using either awk, grep or sort?
Input:
"3","6"
"3","7"
"4","9"
"5","6"
"26","48"
"543","7"
Expected Output:
"3","6"
"3","7"
"4","9"
"26","48"
Could you please try following and let me know if this helps you.
awk -F'[",]' '!a[$5]++' Input_file
Output will be as follows.
"3","6"
"3","7"
"4","9"
"26","48"
EDIT: Adding explanation too here.
awk -F'[",]' ' ##Setting field separator as " or , for every line of Input_file.
!a[$5]++ ##creating an array named a whose index is $5(fifth field) and checking condition if 5th field is NOT present in array a, so when any 5th field comes in array a then increasing its count so next time it will not take any duplicates in it. Since awk works on condition and then action, since here no action is mentioned so by default print of current line will happen.
' Input_file ##Mentioning the Input_file here too.