Grep with a regex character range that includes the NULL character

Grep with a regex character range that includes the NULL character - bash

When I include the NULL character (\x00) in a regex character range in BSD grep, the result is unexpected: no characters match. Why is this happening?
Here is an example:
$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']
Here I expect all characters up until the last one to match, however the result is no output (no matches).
Alternatively, when I start the character range from \x01, it works as expected:
$ echo 'ABCabc<>/ă' | grep -o [$'\x01'-$'\x7f']
A
B
C
a
b
c
<
>
/
Also, here are my grep and BASH versions:
$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ echo $BASH_VERSION
3.2.57(1)-release

On BSD grep, you may be able to use this:
LC_ALL=C grep -o '[[:print:][:cntrl:]]' <<< 'ABCabc<>/ă'
A
B
C
a
b
c
<
>
/
Or you can just install gnu grep using home brew package and run:
grep -oP '[[:ascii:]]' <<< 'ABCabc<>/ă'

Noting that $'...' is a shell quoting construct, this,
$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']
would try to pass a literal NUL character as part of the command line argument to grep. That's impossible to do in any Unix-like system, as the command line arguments are passed to the process as NUL-terminated strings. So in effect, grep sees just the arguments -o and [.
You would need to create some pattern that matches the NUL byte without including it literally. But I don't think grep supports the \000 or \x00 escapes itself. Perl does, though, so this prints the input line with the NUL:
$ printf 'foo\nbar\0\n' |perl -ne 'print if /\000/'
bar
As an aside, at least GNU grep doesn't seem to like that kind of a range expression, so if you were to use that, you'd to do something different. In the C locale, [[:cntrl:][:print:]]' might perhaps work to match the characters from \x01 to \x7f, but I didn't check comprehensively.
The manual for grep has some descriptions of the classes.
Note also that [$'\x00'-$'\x7f'] has an unquoted pair of [ and ] and so is a shell glob. This isn't related to the NUL byte, but if you had files that match the glob (any one-letter names, if the glob works on your system -- it doesn't on my Linux), or had failglob or nullglob set, it would probably give results you didn't want. Instead, quote the brackets too: $'[\x00-\x7f]'.

Related

sed: remove all characters except for last n characters

I am trying to remove every character in a text string except for the remaining 11 characters. The string is Sample Text_that-would$normally~be,here--pe_-l4_mBY and what I want to end up with is just -pe_-l4_mBY.
Here's what I've tried:
$ cat food
Sample Text_that-would$normally~be,here--pe_-l4_mBY
$ cat food | sed 's/^.*(.{3})$/\1/'
sed: 1: "s/^.*(.{3})$/\1/": \1 not defined in the RE
Please note that the text string isn't really stored in a file, I just used cat food as an example.
OS is macOS High Sierra 10.13.6 and bash version is 3.2.57(1)-release

You can use this sed with a capture group:
sed -E 's/.*(.{11})$/\1/' file
-pe_-l4_mBY

Basic regular expressions (used by default by sed) require both the parentheses in the capture group and the braces in the brace expression to be escaped. ( and { are otherwise treated as literal characters to be matched.
$ cat food | sed 's/^.*\(.\{3\}\)$/\1/'
mBY
By contrast, explicitly requesting sed to use extended regular expressions with the -E option reverses the meaning, with \( and \{ being the literal characters.
$ cat food | sed -E 's/^.*(.{3})$/\1/'
mBY

Try this also:
grep -o -E '.{11}$' food
grep, like sed, accepts an arbitrary number of file name arguments, so there is no need for a separate cat. (See also useless use of cat.)

You can use tail or Parameter Expansion :
string='Sample Text_that-would$normally~be,here--pe_-l4_mBY'
echo "$string" | tail -c 11
echo "${string#${string%??????????}}"
pe_-l4_mBY
pe_-l4_mBY

also with rev/cut/rev
$ echo abcdefghijklmnopqrstuvwxyz | rev | cut -c1-11 | rev
pqrstuvwxyz
man rev => rev - reverse lines characterwise

grep up to and including equal sign for CLI parameter

My goal is to match a command line argument prefix that looks like:
--abc=
Both of the patterns below (and many others), allow:
--abc==
Somehow, I can't find a grep way to ensure there is just one equal sign.
grep -i '^--[a-z]\{2,\}=\{1,1\}'
grep -i '^--[a-z]\{2,\}='
grep 2.20
CentOS Linux 7.3.1611

ERE:
^--[[:alpha:]]{2,}=[^=]+$
^--[[:alpha:]]{2,}= matches --, then two or more alphabetic characters in your locale, then a literal =
[^=]+$ matches one or more characters that are not = at the end
BRE:
^--[[:alpha:]]\{2,\}=[^=]\+$
Example:
$ grep -E '^--[[:alpha:]]{2,}=[^=]*$' <<<'--foobar=spam'
--foobar=spam
$ grep -E '^--[[:alpha:]]{2,}=[^=]*$' <<<'--foobar=23'
--foobar=23
$ grep -E '^--[[:alpha:]]{2,}=[^=]*$' <<<'--123ad='
$ grep -E '^--[[:alpha:]]{2,}=[^=]+$' <<<'--spamegg='

Match a unix line ending with grep

How can I match a unix line ending with grep? I already have a working script that uses unix2dos and cmp, but it's a bit slow, and a single grep command would fit in a lot better with the rest of my bash code.
I tried using a negative lookbehind on '\r'.
$ printf "foo\r\n" | grep -PUa '(?<!'$'\r'')$'
foo
Why doesn't that work? For the record, the regex pattern seems to evaluate just well this way:
$ printf '(?<!'$'\r'')$' | od -a
0000000 ( ? < ! cr ) $
0000007
Update:
$ grep --version
grep (GNU grep) 2.24
on MINGW64 on windows 7.

Your solution with grep -PUa '(?<!'$'\r'')$' worked with a more recent version of grep (2.25). However the support for Perl-compatible regular expression (-P) is stated to be highly experimental even in that newer version of grep, so it's not surprising that it didn't work in the previous version.
Use the following basic regular expression: \([^\r]\|^\)$, i.e. the following grep command when running from bash:
grep -Ua '\([^'$'\r'']\|^\)$'
An example demonstrating that it correctly handles both empty and non-empty lines:
$ printf "foo\nbar\r\n\nx\n\r\ny\nbaz\n" | grep -Ua '\([^'$'\r'']\|^\)$'
foo
x
y
baz
$
EDIT
The solution above treats the last line not including an end-of-line symbol as if it ended with a unix line ending. E.g.
$ printf "foo\nbar" | grep -Ua '\([^'$'\r'']\|^\)$'
foo
bar
That can be fixed by appending an artificial CRLF to the input - if the input ends with a newline, then the extra (empty) line will be dropped by grep, otherwise it will make grep to drop the last line:
$ { printf "foo\nbar"; printf "\r\n"; } | grep -Ua '\([^'$'\r'']\|^\)$'
foo
$

grep not finding ".*" string values

I have a file temp.txt as below.
a.*,super
I want to grep .* to check whether the value is present in the file or not.
Command used:
grep -i ".*" temp.txt
returns nothing

This is because grep considers the pattern as a regular expression.
To make grep interpret it as a literal, use -F.
grep -F ".*" temp.txt
Also, note -i is not needed, because there is no case distinction to take into account (we for example use it to make grep return AB, aB, Ab and ab when doing grep -i "ab").
As man grep says:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines,
any of which is to be matched. (-F is specified by POSIX.)
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i
is specified by POSIX.)

Using awk
awk '/\.\*/' file
or fgrep
fgrep ".*" file

Both ., * have special meaning in regular expression. Escape them to match literally.
$ cat temp.txt
a.*,super
$ grep "\.\*" temp.txt
a.*,super
$ echo $?
0
$ grep "there-is-no-such-string" temp.txt
$ echo $?
1
-i is not need because there's no alphabet in the regular expression.

Is there an easy way to pass a "raw" string to grep?

grep can't be fed "raw" strings when used from the command-line, since some characters need to be escaped to not be treated as literals. For example:
$ grep '(hello|bye)' # WON'T MATCH 'hello'
$ grep '\(hello\|bye\)' # GOOD, BUT QUICKLY BECOMES UNREADABLE
I was using printf to auto-escape strings:
$ printf '%q' '(some|group)\n'
\(some\|group\)\\n
This produces a bash-escaped version of the string, and using backticks, this can easily be passed to a grep call:
$ grep `printf '%q' '(a|b|c)'`
However, it's clearly not meant for this: some characters in the output are not escaped, and some are unnecessarily so. For example:
$ printf '%q' '(^#)'
\(\^#\)
The ^ character should not be escaped when passed to grep.
Is there a cli tool that takes a raw string and returns a bash-escaped version of the string that can be directly used as pattern with grep? How can I achieve this in pure bash, if not?

If you want to search for an exact string,
grep -F '(some|group)\n' ...
-F tells grep to treat the pattern as is, with no interpretation as a regex.
(This is often available as fgrep as well.)

If you are attempting to get grep to use Extended Regular Expression syntax, the way to do that is to use grep -E (aka egrep). You should also know about grep -F (aka fgrep) and, in newer versions of GNU Coreutils, grep -P.
Background: The original grep had a fairly small set of regex operators; it was Ken Thompson's original regular expression implementation. A new version with an extended repertoire was developed later, and for compatibility reasons, got a different name. With GNU grep, there is only one binary, which understands the traditional, basic RE syntax if invoked as grep, and ERE if invoked as egrep. Some constructs from egrep are available in grep by using a backslash escape to introduce special meaning.
Subsequently, the Perl programming language has extended the formalism even further; this regex dialect seems to be what most newcomers erroneously expect grep, too, to support. With grep -P, it does; but this is not yet widely supported on all platforms.
So, in grep, the following characters have a special meaning: ^$[]*.\
In egrep, the following characters also have a special meaning: ()|+?{}. (The braces for repetition were not in the original egrep.) The grouping parentheses also enable backreferences with \1, \2, etc.
In many versions of grep, you can get the egrep behavior by putting a backslash before the egrep specials. There are also special sequences like \<\>.
In Perl, a huge number of additional escapes like \w \s \d were introduced. In Perl 5, the regex facility was substantially extended, with non-greedy matching *? +? etc, non-grouping parentheses (?:...), lookaheads, lookbehinds, etc.
... Having said that, if you really do want to convert egrep regular expressions to grep regular expressions without invoking any external process, try ${regex/pattern/substitution} for each of the egrep special characters; but recognize that this does not handle character classes, negated character classes, or backslash escapes correctly.

When I use grep -E with user provided strings I escape them with this
ere_quote() {
sed 's/[][\.|$(){}?+*^]/\\&/g' <<< "$*"
}
example run
ere_quote ' \ $ [ ] ( ) { } | ^ . ? + *'
# output
# \\ \$ \[ \] \( \) \{ \} \| \^ \. \? \+ \*
This way you may safely insert the quoted string in your regular expression.
e.g. if you wanted to find each line starting with the user content, with the user providing funny strings as .*
userdata=".*"
grep -E -- "^$(ere_quote "$userdata")" <<< ".*hello"
# if you have colors in grep you'll see only ".*" in red

I think that previous answers are not complete because they miss one important thing, namely string which begin with dash (-). So while this won't work:
echo "A-B-C" | grep -F "-B-"
This one will:
echo "A-B-C" | grep -F -- "-B-"

quote() {
sed 's/[^\^]/[&]/g;s/[\^]/\\&/g' <<< "$*"
}
Usage: grep [OPTIONS] "$(quote [STRING])"
This function has some substantial benefits:
quote is independent from the regex flavor. You can use quote's output in
grep (-G)` (BRE, the default)
grep -E (ERE)
grep -P (PCRE)
sed (-E) "s/$(quote [STRING])/.../" (as long as you don't use \, [, or ] instead of /).
quote even works in corner cases that are not directly quoting related, for instance
Leading - are quoted so that they aren't misinterpreted as options by grep.
Trailing spaces are quoted so that the aren't removed by $(...).
quote only fails if [STRING] contains linebreaks. But in general there is no fix for this since tools like grep and sed may not support linebreaks in their search pattern (even if they are written as \n).
Also, there is the drawback that the quoted output usually is three times longer than the unquoted input.

Just want to comment example below which shows that substring "-B" is iterpreted by grep as a command line option and the command failed.
echo "A-B-C" | grep -F "-B-"
grep has a special option for this case:
-e PATTERNS, --regexp=PATTERNS
Use PATTERNS as the patterns. If this option is used multiple times or is combined with the -f (--file) option,
search for all patterns given. This option can be used to protect a pattern beginning with “-”.
So a fix for the issue is:
echo "A-B-C" | grep -F -e "-B-" -

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio