Efficient coding to count capital characters in file - bash

I want to count all the capital characters A-Z from a file.
I take the file as an argument and then i search the whole file for each letter and sum my result. My code is working fine, but is there another way to make it more efficient, without using loop?
sum=0
for var in {A..Z}
do
foo="$(grep -o $var "$1"| wc -l)"
sum=$((sum+foo))
done
I tried to do it like this but it gives me wrong results, because its counting spaces and end line.
cat "$1" | wc -m

You can do it with a single grep command similar to what you're already doing:
grep -o "[A-Z]" "$1" | wc -l

We can really avoid using multiple programs for counting capital letters in a file, this could done easily with a single awk and it will save us some cycles and should be FASTER too.
Could you please try following.
awk '
{
count+=gsub(/[A-Z]/,"&")
}
END{
print "Total number of capital letters in file are: " count
}
' Input_file
In case you want to run it as a script which takes Input_file as an argument change Input_file to $1 too.
Explanation: Adding explanation for above code, only for explanation purposes not for running(following one).
awk ' ##Starting awk program here.
{
count+=gsub(/[A-Z]/,"&") ##Creating a variable named count whose value will be keeping adding to itself, each time a substitution done from gsub.
##where gsub is awk out of the box function to substitute.
##Using gsub I am substituting each capital letter with itself and adding its count to count variable.
}
END{ ##Starting END block for this awk program. which will be executed once Input_file is done with reading.
print "Total number of capital letters in file are: " count ##Printing total number of capital letters which are there in count variable.
}
' Input_file ##mentioning Input_file name here.

Related

AWK Finding a way to print lines containing a word from a comma separated string

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03
Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt
You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

Bash: Keep all lines with duplicate values in column X

I have a file with a few thousand lines and 20+ columns. I now want to keep only the lines that have the same e-mail address in column 3 as in other lines.
file: (First Name; Last Name; E-Mail; ...)
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Jennifer;Lopez;jennifer#lopez.com
Andre;Agassi;tom#boyden.com
Paul;Walker;paul#walker.com
I want to keep ALL lines that have a matching e-mail address. In this case the expected output would be
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Andre;Agassi;tom#boyden.com
If I use
awk -F';' '!seen[$3]++' file
I will lose the first instance of the e-mail address, in this case line 1 and 2 and will keep ONLY the duplicates.
Is there a way to keep all lines?
This awk one-liner will help you:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file
It passes the file twice, the first time it calculates the count of occurrence, the 2nd pass will check and output.
With the given input example, it prints:
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Andre;Agassi;tom#boyden.com
If the output order doesn't matter, here's a one-pass approach:
$ awk -F';' '$3 in first{print first[$3] $0; first[$3]=""; next} {first[$3]=$0 ORS}' file
Mike;Tyson;mike#tyson.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Tom;Boyden;tom#boyden.com
Andre;Agassi;tom#boyden.com
Could you please try following, in a single read Input_file in single awk.
awk '
BEGIN{
FS=";"
}
{
mail[$3]++
mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0
}
END{
for(i in mailVal){
if(mail[i]>1){ print mailVal[i] }
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=";" ##Setting field separator as ; here.
}
{
mail[$3]++ ##Creating mail with index of 3rd field here and keep adding its value with 1 here.
mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0 ##Creating mailVal which has 3rd field as index and value is current line and keep concatinating to it wiht new line.
}
END{ ##Starting END block of this program from here.
for(i in mailVal){ ##Traversing through mailVal here.
if(mail[i]>1){ print mailVal[i] } ##Checking condition if value is greater than 1 then printing its value here.
}
}
' Input_file ##Mentioning Input_file name here.
I think #ceving just needs to go a little further.
ASSUMING the chosen column is NOT the first or last -
cut -f$col -d\; file | # slice out the right column
tr '[[:upper:]]' '[[:lower:]]' | # standardize case
sort | uniq -d | # sort and output only the dups
sed 's/^/;/; s/$/;/;' > dups # save the lowercased keys
grep -iFf dups file > subset.csv # pull matching records
This breaks if the chosen column is the first or last, but should otherwise preserve case and order from the original version.
If it might be the first or last, then pad the stream to that last grep and clean it afterwards -
sed 's/^/;/; s/$/;/;' file | # pad with leading/trailing delims
grep -iFf dups | # grab relevant records
sed 's/^;//; s/;$//;' > subset.csv # strip the padding
Find the duplicate e-mail addresses:
sed -s 's/^.*;/;/;s/$/$/' < file.csv | sort | uniq -d > dups.txt
Report the duplicate csv rows:
grep -f dups.txt file.csv
Update:
As "Ed Morton" pointed out the above commands will fail, when the e-mail addresses contain characters, which have a special meaning in a regular expression. This makes it necessary to escape the e-mail addresses.
One way to do so is to use Perl compatible regular expression. In a PCRE the escape sequences \Q and \E mark the beginning and the end of a string, which should not be treated as a regular expression. GNU grep supports PCREs with the option -P. But this can not be combined with the option -f. This makes it necessary to use something like xargs. But xargs interprets backslashes and ruins the regular expression. In order to prevent it, it is necessary to use the option -0.
Lessen learned: it is quite difficult to get it right without programming it in AWK.
sed -s 's/^.*;/;\\Q/;s/$/\\E$/' < file.csv | sort | uniq -d | tr '\n' '\0' > dups.txt
xargs -0 -i < dups.txt grep -P '{}' file.csv

How to count a matching pattern in one line?

enter code hereI have a fasta file containing sequences
>lcl|QCYY01003067.1_cds_ROT65593.1_2
ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
>lcl|QCYY01003067.1_cds_ROT65593.1_3
ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
>lcl|QCYY01003067.1_cds_ROT65593.1_4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCC
I wanted to count the number of 'N' and also the number of patterns occurring in each line. No need to include header (>lcl|QCYY01003067.1_cds_ROT65593.1_2 )
eg:-
line 2=0,0
line 4=15,2
line 6=10,1
How to improve this code:
grep -n '[{N}]' <filename> | cut -d : -f 1 | uniq -c
Another awk:
$ awk 'NR%2==0{printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"")}' file
Output:
line 2=0,0
line 4=15,2
line 6=10,1
Explained:
$ awk '
NR%2==0 { # process even records
printf "line %d=%d,%d\n",NR,gsub(/N/,"N"),gsub(/N+/,"") # count with gsub
}' file
gsub(/N/,"N") counts the amount of Ns in the record (returns the amount of replacements). gsub(/N+/,"") counts the number of consecutive strings of Ns. Notice, that "" removes those Ns from the record so if you need to later further process the data, use gsub(/N+/,"&") instead.
Updated:
The version I wrote for your already-deleted next question.
I added an extra line to your data which demonstrates the question I asked in the comments (is ...N\nNN.. one (NNN) or two (N,NN) patterns of your definition):
...
>seq4
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCNNN
NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC
This one is for GNU awk (for using RT):
$ gawk '
BEGIN {
RS=">seq[^\n]+"
}
NR>1 {
# gsub(/\n/,"") # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN
printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
}
{
rt=RT
}' file
Output (pay special attention to the seq4):
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,3
or if you uncomment the gsub(/\n/,"") to remove the newline separating strings, the output is:
>seq1=0,0
>seq2=15,2
>seq3=15,2
>seq4=18,2
One-liner (with the one gsub uncommented):
$ awk 'BEGIN{RS=">seq[^\n]+"}NR>1{gsub(/\n/,"");printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")}{rt=RT}' file
Could you please try following.
awk '
!/^>/{
while(match($0,/N+/)){
count++
total+=length(substr($0,RSTART,RLENGTH))
$0=substr($0,RSTART+RLENGTH)
}
printf("%s %d=%d,%d\n","line",FNR,total,count)
count=total=""
}
' Input_file
Output will be as follows.
line 2=0,0
line 4=15,2
line 6=10,1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!/^>/{ ##Checking condition if a line is NOT starting from > then do following.
while(match($0,/N+/)){ ##Running a while loop which will run till a match found for N characters continuous occurrence.
count++ ##Doing increment to variable count with 1 each time cursor comes here.
total+=length(substr($0,RSTART,RLENGTH)) ##Creating total variable which is keep adding its own value along with length of matched regex, where regex is looking for continuous occurrence of N character in current line.
$0=substr($0,RSTART+RLENGTH) ##Resetting value of current line to have only REST of line which starts from very next character of matched regex. So that we can skip previous matched regex and look for others in rest of the line.
} ##Closing BLOCK for above mentioned while loop here.
printf("%s %d=%d,%d\n","line",FNR,total,count) ##Printing values line,FNR,total,count variables here.
count=total="" ##Nullifying variables count and total here, so that previous values should NOT be added to current values of it.
}
' Input_file ##Mentioning Input_file name here.

using awk and gensub to remove the part in a string ending with "character+number+S"

My goal is to remove the end "1S" as well as the letter immediately before it, in this case "M". How do I achieve that? My non-working code :
echo "14M3856N61M1S" | gawk '{gensub(/([^(1S)]*)[a-zA-Z](1S$)/, "\\1", "g") ; print $0}'
>14M3856N61M1S
The desired results should be
>14M3856N61
Some additional information here . 1. I do not think substr will work here since my actual target strings would come with various lengths. 2. I prefer not to take the approach of defining special delimiter because this would be used together with "if" as part of the awk conditional operation while the
delimiter is already defined globally.
Thank you in advance!
Why not use a simple substitution to match the 1S at the last and match any character before it?
echo "14M3856N61M1S" | awk '{sub(/[[:alnum:]]{1}1S$/,"")}1'
14M3856N61M1S
Here the [[:alnum:]] corresponds the POSIX character class to match alphanumeric characters (digits and alphabets) and {1} represent to match just one. Or if you are sure about only characters could occur before the pattern 1S, replace [[:alnum:]] with [[:alpha:]].
To answer OP's question to put the match result on a separate variable, use match() as sub() does not return the substituted string but only the count of number of substitutions made.
echo "14M3856N61M1S" | awk 'match($0,/[[:alnum:]]{1}1S$/){str=substr($0,1,RSTART-1); print str}'
EDIT: As per OP's comment I am adding solutions where OP could get the result into a bash variable too as follows.
var=$(echo "14M3856N61M1S" | awk 'match($0,/[a-zA-Z]1S$/){print substr($0,1,RSTART-1)}' )
echo "$var"
14M3856N61
Could you please try following too.
echo "14M3856N61M1S" | awk 'match($0,/[a-zA-Z]1S$/){$0=substr($0,1,RSTART-1)} 1'
14M3856N61
Explanation of above command:
echo "14M3856N61M1S" | ##printing sample string value by echo command here and using |(pipe) for sending standard ouptut of it as standard input to awk command.
awk ' ##Starting awk command here.
match($0,/[a-zA-Z]1S$/){ ##using match keyword of awk here to match 1S at last of the line along with an alphabet(small or capital) before it too.
$0=substr($0,1,RSTART-1) ##If match found in above command then re-creating current line and keeping its value from 1 to till RSTART-1 value where RSTART and RLENGTH values are set by match out of the box variables by awk.
} ##Closing match block here.
1' ##Mentioning 1 will print the edited/non-edited values of lines here.
echo "14M3856N61M1S" | awk -F '.1S$' '{print $1}'
Output:
14M3856N61

sed squeeze multiple occurrence of word

I have text file with lines like below:
this is the code ;rfc1234;rfc1234
this is the code ;rfc1234;rfc1234;rfc1234;rfc1234
How can I squeeze the the repeating words in file to single word like below:
this is the code ;rfc1234
this is the code ;rfc1234
I tried 'tr' command but it's limited to squeezing characters only
with sed for arbitrary repeated strings prefixed with ;
$ sed -E 's/(;[^;]+)(\1)+/\1/g' file
or, if you want to delete everything after the first token without checking whether they match the preceding one or not
$ sed -E 's/(\S);.*/\1/' file
Explanation
(;[^;]+) is to capture a string starting with semicolon
(\1)+ followed by the same captured string one or more times
/\1/g replace the whole chain with one instance, and repeat
Following awk may help here. It will look for all items in last column of you Input_file and will keep only unique values in it.
awk '{num=split($NF,array,";");for(i=1;i<=num;i++){if(!array1[array[i]]++){val=val?val ";" array[i]:array[i]}};NF--;print $0";"val;val="";delete array;delete array1}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
num=split($NF,array,";");
for(i=1;i<=num;i++){
if(!array1[array[i]]++){
val=val?val ";" array[i]:array[i]}
};
NF--;
print $0";"val;
val="";
delete array;
delete array1
}' Input_file
Explanation:
awk '
{
num=split($NF,array,";"); ##Creating a variable named num whose value is length of array named array, which is created on last field of line with ; as a delimiter.
for(i=1;i<=num;i++){ ##Starting a for loop from i=1 to till value of num each time increment i as 1.
if(!array1[array[i]]++){ ##Chrcking here a condition if array named array1 index is value of array[i] is NOT coming more than 1 value then do following.
val=val?val ";" array[i]:array[i]}##Creating a variable named val here whose value is array[i] value and keep concatenating its own value of it.
};
NF--; ##Reducing the value of NF(number of fields) in current line to remove the last field from it.
print $0";"val; ##Printing the current line(without last field) ; and then value of val here.
val=""; ##Nullifying variable val here.
delete array; ##Deleting array named array here.
delete array1 ##Deleting array named array1 here.
}' Input_file ##Mentioning Input_file name here.
I started playing around with s/(.+)\1/\1/g. It seemed to work with perl (even found the is_is_) but didn't quite take me there:
$ perl -pe 's/(.+)\1+/\1/g' file
this the code ;rfc1234
this the code ;rfc1234;rfc1234
sed 's/\(;[^;]*\).*/\1/' file
You can use the below command to achieve this:-
echo "this is the code ;rfc1234;rfc1234" | sed 's/rfc1234//2g'
echo "this is the code ;rfc1234;rfc1234;rfc1234;rfc1234" | sed 's/rfc1234//2g'
or
sed 's/rfc1234//2g' yourfile.txt
This might work for you (GNU sed):
sed -r ':a;s/(\S+)\1+/\1/g;ta' file
The regex is repeated until only the first pattern remains.

Resources