Grep-ing for a line beginning with a Maven version number - bash

I'm trying to grep a file for a line that begins with a version number of the form:
X.Y.Z
where X, Y and Z are numbers between 0 and infinity.
As an example say the the line of interest begins 20.2.3
The following will return a result if the first character of the line is a digit:
grep ^[0-9]
The result is:
20.2.3`
where the bold indicates what grep has 'matched on'.
However this will also match lines beginning 4000-43 which I do not want.
So in my regex naivety I tried the following grep:
grep ^[0-9]+\.[0-9]+\.[0-9]
thinking this would match any line beginning with any number followed by two other numbers separated by decimal-points. But it does not.
If I try:
grep ^[0-9]+
it doesn't match anything at all.
How do I modify my regex to match the number format I'm looking for?

Regex + (one or more) quantifier should be escaped with \ when using in BRE (basic regular expressions) mode (default mode):
grep '^[0-9]\+\.[0-9]\+\.[0-9]' <<<"20.2.3"
20.2.3
Otherwise, to make it work - use -E option to enable ERE (extended regular expressions) mode:
grep -E '^[0-9]+\.[0-9]+\.[0-9]' <<<"20.2.3"
20.2.3

The following is a pretty good one, just add \ before the +.
grep "^[0-9]\+\.[0-9]\+\.[0-9]" <filename>

Since you want lines starting with numbers and periods but not starting with a period (-o to weed out anything following the version number):
$ echo 20.2.3 foo |
grep -o ^[^.][0-9.]\\+
20.2.3

Related

Grep pattern matching at most n times using Perl flag

I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:
chr5_ 16560869
chrX 46042911
chr12_ 131428407
chr22_ 13191864
chr5 165608
chrX_ 96055593
I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).
The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt (note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.
Thanks in advance!
Note that you chose the PCRE regex engine with your grep due to the -P option.
The \d{,2} does not match zero to two digits, it matches a digit and then a {,2} string. See the regex demo.
See the PCRE documentation:
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.
Also, see the limiting quantifier definition there:
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.
Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0):
grep -oE '[0-9]{,2}_' <<< "12_ 21"
## => 12_
grep -oP '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_
See the online demo.
Note
I'd advise to always specify the 0 min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
It would seem {,n} is not compatable in perl regex.
Using grep ERE instead
$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_ 16560869
chr12_ 131428407
chr22_ 13191864
chrX_ 96055593
There is a missing digit in the {,2}.
Give a try to :
grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
My first guess was to use egrep instead.
This other seems ok too:
egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt

How to count frequency of a word without counting compound words in bash?

I am using this to count the frequency in a text file using bash.
grep -ow -i "and" $1 | wc -l
It counts all the and in the file, including those that are part of compound words, like jerry-and-jeorge. These I wish to ignore and count all other independent and.
With a GNU grep, you can use the following command to count and words that are not enclosed with hyphens:
grep -ioP '\b(?<!-)and\b(?!-)' "$1" | wc -l
Details:
P option enables the PCRE regex syntax
\b(?<!-)and\b(?!-) matches
\b - a word boundary
(?<!-) - a negative lookbehind that fails the match if there is a hyphen immediately to the left of the current location
and - a fixed string
\b - a word boundary
(?!-) - a negative lookahead that fails the match if there is a hyphen immediately to the right of the current location.
See the online demo:
#!/bin/bash
s='jerry-and-jeorge, and, aNd, And.'
grep -ioP '\b(?<!-)and\b(?!-)' <<< "$s" | wc -l
# => 3 (not 4)

with shell script ,how to find number line in text

I`m trying to find lines which match the pattern x.y or x.y.z, where x,y and z are numbers.
For example, given the lines:
1.0/
2.2.5rc1/
2.3.0/
2.3.1/
abc-1.0.0/
the result should be:
1.0
2.3.0
2.3.1
How can I do this?
Things to know:
Call grep in extended mode using -E.
start the pattern with a ^ to signify you want the search to start at the first character.
To search for a digit, use \d
To search for a dot, use \.
To search for thing1 OR thing1, use thing1|thing2.
Note: As Jonathan Leffler pointed out below, \d is a notation that might not work across all version of grep. Try [0-9] or [[:digit:]] to be compliant in POSIX-standard implementations of grep.
Knowing that, we put it together like so:
grep -E "^(\d.\d.|\d.\d.\d)/" yourfile
You can do
grep -C 2 yourSearch yourFile
To send it in a file, do
grep -C 2 yourSearch yourFile > result.txt
Hope it helps!

grep excluding first char

How do I find a line where a pattern is in middle of line. i.e. in the following example. I want to only get 8th line but exclude 1st and 5th line grepping "#"
I know i would use grep "^#" to find only in first character but how to exclude it?
#DD65WKN1:203:H7T67ADXX:2:2216:19936:100494 1:N:0:
GTCGTTCTTCAGGTTCTC
+
FFFFFIIIIFFFIFFFFF
#DD65WKN1:203:H7T67ADXX:2:2216:6629:100501 1:N:0:
TAAAGTAGCAAAAATG
+
FFFFFFFFIFBFIFFF#DD65WKN1:203:H7T67ADXX:2:2216:6629:100501 1:N:0:
TAAAGTAGCAAAAATG
+
FFFFFFFFIFBFIFFF
Thanks
You can match any character beforehand, so that # won't be matched if just in the first position:
$ grep '.#' file
FFFFFFFFIFBFIFFF#DD65WKN1:203:H7T67ADXX:2:2216:6629:100501 1:N:0:
Note that . matches any character. To be completely sure (first solution would match a line starting with ##), you can negate # by using:
grep '[^#]#' file
Or also indicate that you want to find any line starting with a no-# set of characters (at least one, as indicated by +).
grep '^[^#]\+#' file
Use grep with Perl-regex option which supports negative lookbehind.
$ grep -P '(?<!^)#' file
FFFFFFFFIFBFIFFF#DD65WKN1:203:H7T67ADXX:2:2216:6629:100501 1:N:0:
The above grep command will print the line which doesn't have # symbol at the begining but it may present anwhere on that line.
The best thing about unix filters is combining them
grep --invert-match '^#' file | grep '#'
or more traditionally
sed '/^#/d' file | grep '#'

Using BASH, how to increment a number that uniquely only occurs once in most lines of an HTML file?

The target is always going to be between two characters, 'E' and '/' and there will never be but one occurrence of this combination, e.g. 'E01/' in most lines in the HTML file and will always be between '01' and '90'.
So, I need to programmatically read the file and replace each occurrence of 'Enn/' where 'nn' in 'Enn/' will be between '01' and '90' and must maintain the '0' for numbers '01' to '09' in 'Enn/' while incrementing the existing number by 1 throughout the HTML file.
Is this doable and if so how best to go about it?
Edit: Target lines will be in one or the other formats:
<DT>ProgramName
<DT>Program Name
You can use sed inside BASH as a fantastic one-liner, either:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+(10#\2>=90?0:1)))/ge' FILENAME
or if you are guaranteed the number is lower than 100:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+1)))/ge' FILENAME
Basically, you'll be doing inplace search and replace. The above will not add anything after 90 (since you didn't specify the exact nature of the overflow condition). So E89/ -> E90/, E90/ -> E90/, and if by chance you have E91/, it will remain E91/. Add this line inside a loop for multiple files
A small explanation of the above command:
-r states that you'll be using a regular expression
-i states to write back to the same file (be careful with overwriting!)
s/search/replace/ge this is the regex command you'll be using
s/ states you'll be using a string search
(.E) first grouping of all characters upto the first E (case sensitive)
([0-9]{2}) second grouping of numbers 0 through 9, repeated twice (fixed width)
(/.) third grouping getting the escaped trailing slash and everything after that
/ (slash separator) denotes end of search pattern and beginning of replacement pattern
printf "format" var this is the expression used for each replacement
\1 place first grouping found here
%02u the replace format for the var
\3 place third grouping found here
$((expression)) BASH arithmetic expression to use in printf format
10#\2 force second grouping as a base 10 number
+(10#\2>=90?0:1) add 0 or 1 to the second grouping based on if it is >= 90 (as used in first command)
+1 add 1 to the second grouping (see second command)
/ge flags for global replacement and the replace parameter will be an expression
GNU sed and awk are very powerful tools to do this sort of thing.
You can use the following perl one-liner to increment the numbers while maintaining the ones with leading 0s.
perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
$ cat file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
$ perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
You can add the -i option to make changes in-place. I would recommend creating backup before doing so.
Not as elegant as one line sed!
Break the commands used into multiple commands and you can debug your bash or grep or sed.
# find the number
# use -o to grep to just return pattern
# use head -n1 for safety to just get 1 number
n=$(grep -o "E[0-9][0-9]\/" file.html |grep -o "[0-9][0-9]"|head -n1)
#octal 08 and 09 are problem so need to do this
n1=10#$n
echo Debug n1=$n1 n=$n
n2=n1
# bash arithmetic done inside (( ))
# as ever with bash bracketing whitespace is needed
(( n2++ ))
echo debug n2=$n2
# use sed with -i -e for inline edit to replace number
sed -ie "s/E$n\//E$(printf '%02d' $n2)\//" file.html
grep "E[0-9][0-9]" file.html
awk might be better. Maybe could do it in one awk command also.
The sed one-liner in other answer is awesome :-)
This works in bash or sh.
http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

Resources