Grep pattern matching at most n times using Perl flag - bash

I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:
chr5_ 16560869
chrX 46042911
chr12_ 131428407
chr22_ 13191864
chr5 165608
chrX_ 96055593
I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).
The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt (note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.
Thanks in advance!

Note that you chose the PCRE regex engine with your grep due to the -P option.
The \d{,2} does not match zero to two digits, it matches a digit and then a {,2} string. See the regex demo.
See the PCRE documentation:
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.
Also, see the limiting quantifier definition there:
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.
Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0):
grep -oE '[0-9]{,2}_' <<< "12_ 21"
## => 12_
grep -oP '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_
See the online demo.
Note
I'd advise to always specify the 0 min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.

-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
It would seem {,n} is not compatable in perl regex.
Using grep ERE instead
$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_ 16560869
chr12_ 131428407
chr22_ 13191864
chrX_ 96055593

There is a missing digit in the {,2}.
Give a try to :
grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
My first guess was to use egrep instead.
This other seems ok too:
egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt

Related

Grep-ing for a line beginning with a Maven version number

I'm trying to grep a file for a line that begins with a version number of the form:
X.Y.Z
where X, Y and Z are numbers between 0 and infinity.
As an example say the the line of interest begins 20.2.3
The following will return a result if the first character of the line is a digit:
grep ^[0-9]
The result is:
20.2.3`
where the bold indicates what grep has 'matched on'.
However this will also match lines beginning 4000-43 which I do not want.
So in my regex naivety I tried the following grep:
grep ^[0-9]+\.[0-9]+\.[0-9]
thinking this would match any line beginning with any number followed by two other numbers separated by decimal-points. But it does not.
If I try:
grep ^[0-9]+
it doesn't match anything at all.
How do I modify my regex to match the number format I'm looking for?
Regex + (one or more) quantifier should be escaped with \ when using in BRE (basic regular expressions) mode (default mode):
grep '^[0-9]\+\.[0-9]\+\.[0-9]' <<<"20.2.3"
20.2.3
Otherwise, to make it work - use -E option to enable ERE (extended regular expressions) mode:
grep -E '^[0-9]+\.[0-9]+\.[0-9]' <<<"20.2.3"
20.2.3
The following is a pretty good one, just add \ before the +.
grep "^[0-9]\+\.[0-9]\+\.[0-9]" <filename>
Since you want lines starting with numbers and periods but not starting with a period (-o to weed out anything following the version number):
$ echo 20.2.3 foo |
grep -o ^[^.][0-9.]\\+
20.2.3

grep find letters between two spaces

I have to find words.
In my assignment a word is defined as letters between two spaces (" bla "). I have to find a decimalIntegerConstant like this but it has to be a word.
I use
grep -E -o " (0|[1-9]+[0-9]*)([Ll]?) "
but it doesn't work on, for example:
bla 0l labl 2 3 abla0La 0L sfdgpočítačsd
Output is
0l
2
0L
but 3 is missing.
Matches don't overlap. Your regex have matched 2. The blank after 2 is gone. It won't be considered for further matches.
POSIX grep cannot do what you want in one step, but you can do something like this in two stages (simplified from your regex, doesn't support [lL])
grep -o ' [0-9 ]* ' | grep -E -o '[0-9]+'
That is, match a sequence of space-separated numbers with leading and trailing spaces, and from that, match individual numbers regardless of spaces. De-simplify the definition of number to suit your needs.
Perl-compatible regular expressions have a way to match stuff without consuming it, for example, as mentioned in the comments:
grep -oP " (0|[1-9]+[0-9]*)[Ll]?(?= )"
(?= ) is a lookahead assertion, which means grep will look ahead in the input stream and make sure the match is followed by a space. The space will not be considered a part of the match and will not be consumed. When no space is found, the match fails.
PCRE are not guaranteed to work in all implementations of grep.
Edit: -o is not specified by Posix either.

how to use grep to search for 2 or more parentheses and words capitalization

I'm trying to search for a couple of strings in a path that includes about 80 txt files.
I'm trying to search for !!, ??, ;, capitalization, and parentheses.
I'm also trying to search for if there are more than 4 words capitalized, but I just didn't know how to do that
Here is what I did:
grep -lr '!!\|??\|;\|(.*(' path
Can someone help me with it?
Here is a sample input:
file1.txt:
ryan went over there !!
file2.txt:
am I going there??
file3.txt:
how about I GO TO THE PARK TODAY and not TOMORROW
file4.txt:
This is (not) (valid)
file5.txt:
to go; or not to go
the output should be something like this:
path/file1.txt
path/file2.txt
path/file3.txt
path/file5.txt
Try this regex:
grep -Er '\?\?|\!\!|\(.+\).+\(.+\)|([A-Z]+\b.){4,}|\;' /path/to/files/*.txt
Output:
./1.txt:ryan went over there !!
./2.txt:am I going there??
./3.txt:how about I GO TO THE PARK TODAY and not TOMORROW
./4.txt:This is (not) (valid)
./5.txt:to go; or not to go
grep -Elr will output:
./1.txt
./2.txt
./3.txt
./4.txt
./5.txt
The regex searches for:
??
!!
() used at least twice on a line
Four or more capitalized words on a line
;
grep -lr '!!\|??\|;\|(.*(' path
is what you want. (.*( will match a line containing (at least) two open parentheses with arbitrary text in between.
For readability, you might try
grep -lr -e '!!' -e '??' -e ';' -e '(.*(' path
Your notation is off. In modern grep, you need to backslash the braces, just like you backslash the vertical bar for alternation. More conveniently, you might want to switch to grep -E for backslashless syntax; but then you will need \( to match a literal opening parenthesis.
But either way, inside the braces, there can only be a maximum of two numbers: the lower and the upper bound for he number of repetitions.
However, in this case, because there is no limiting context, \({2) will match the first two of an arbitrarily large number of opening parentheses. In other words, \({2,4} will not fail to match if there are more than four parens (though the actual match will end after four, as you will be able to see e.g. with grep -o). If you need to limit to no longer than four, you will need to supply some sort of trailing context, such as ($|[^(]).
To find a line containing more than one but less than five nonadjacent opening parens, try something like
^[^(]*(\([^(]*){2,4}$

How to display all lines in file where last field is one digit

Could someone please help me out with question.
Display all of the lines in the file where the last field is one digit long.
Search for ',' before the field, then use a character class to make sure
it's one digit, and anchor it to the end of the line to make sure it's the
last field.
I have tried :
grep ",[0-9]{1}$" inventory
grep ",[.]{1)$" inventory
grep ",[/d]$" inventory
This will work:
grep ',[0-9]$' inventory
This will work also:
grep ',[[:digit:]]$' inventory
Try:
grep ',[0-9]$' inventory
You don't need the {1} quantifier, since that's the normal meaning of any unquantified regular expression.
grep uses basic regular expressions, and doesn't support \d to represent digits, so you have to use [0-9].
Grep uses basic regular expressions (BREs), so you need to escape the {}.
Since you want exactly one, you don't need {1} at all.
[.] removes the special meaning of ., only matching a literal period.
[/d] (1) digits are \d; (2) this matches / or d, [\d] would match \ or d.
You want ,[0-9]$.
It should be sufficient to use:
grep ',[0-9]$' inventory
You could make the first work with the -E option, but the repetition count seems like overkill:
grep -E ',[0-9]{1}$' inventory
The second is unrescuable because of the mismatch between { and ), and the [.] only matches a dot at the end of the line after the comma.
The last would work if your grep supports -P (Perl) and you fix the regex:
grep -P ',\d$' inventory
You don't need the explicit character class brackets because \d is already the 'digits' character class. OTOH, it works OK if you use ,[\d]$. The backslash is crucial.

replacing a string using SED doesn't work

Consider this line:
--dump-config=h264_sss-l2-2-ghb-16-500.ini --stats-file=h264_sss-l2-2-ghb-16-500.stat configs/example/cmp.py --l2cache -b h264_sss
and this string "l2-2-ghb-16". To change that string with SED, I ran this command:
sed 's/l2-.*-.*-.*-/l2-2-ghb-8-m-/g'
But the whole line then changed to
--dump-config=h264_sss-l2-2-ghb-8-m-b h264_sss
What is the problem
The .* portion matches the longest possible stretch of characters it can to make the pattern work. So the first .* doesn't match just 2 as you hope, but 2-ghb-16-500.ini --stats-file=h264_sss-l2-2-ghb-16, and so on. To make it work replace the dot's with [^-] (any non-dash character). So,
sed 's/l2-[^-]*-[^-]*-[^-]*-/l2-2-ghb-8-m-/g'
That regex is greedy inasmuch as .* will match the maximum number of characters.
That means it will attempt to stretch the match from what you think is the first pattern all the way to what you think is the second.
While you may think there are two matched patterns, the fact that this stretching is happening means that there's only one, and it's longer than you think.
A quick fix is to ensure that it doesn't match beyond the next - character with something like:
sed 's/l2-[^-]*-[^-]*-[^-]*-/l2-2-ghb-8-m-/g'
as per the following transcript:
pax> echo '--dump-config=h264_sss-l2-2-ghb-16-500.ini --stats-file=h264_sss-l2-2-ghb-16-500.stat configs/example/cmp.py --l2cache -b h264_sss' | sed 's/l2-[^-]*-[^-]*-[^-]*-/l2-2-ghb-8-m-/g'
--dump-config=h264_sss-l2-2-ghb-8-m-500.ini --stats-file=h264_sss-l2-2-ghb-8-m-500.stat configs/example/cmp.py --l2cache -b h264_sss
(command and output are slightly modified, lined up so you can easily see the transformations).
This works because, while .* says the largest field of any characters, [^-]* says the largest field of any characters except -.
sed looks for the most possible match. So -.*- will match a string as large as possible.

Resources