Why do you have to escape | and + in grep between apostrophes? - bash

I was under the impression that within single quotes, e.g. 'pattern', bash special characters are not interpolated, so one need only escape single quotes themselves.
Why then does echo "123" | grep '[0-9]+' output nothing, whereas echo "123" | grep '[0-9]\+' (plus sign escaped) output 123? (Likewise, echo "123" | grep '3|4' outputs nothing unless you escape the |.)
This is under bash 4.1.2 and grep 2.6.3 on CentOS 6.5.

grep uses Basic Regular Expressions, like sed and vi. In that you have to escape metacharacters, and it is tedious.
You probably want Extended Regular Expressions, so use egrep or grep -E (depending on the version in use). Check your man grep.
See also the GNU documentation for a full list of the characters involved.
Most languages use Extended Regular Expressions (EREs) these days, and they are much easier to use. Basic Regular Expressions (BREs) are really a throw-back.

That seems to be the regular expression engine that grep uses. If you use a different one, it works:
$ echo "123" | grep '[0-9]+'
$ echo "123" | grep -P '[0-9]+'
123
$ echo "123" | grep '3|4'
$ echo "123" | grep -P '3|4'
123

Related

Grep with a regex character range that includes the NULL character

When I include the NULL character (\x00) in a regex character range in BSD grep, the result is unexpected: no characters match. Why is this happening?
Here is an example:
$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']
Here I expect all characters up until the last one to match, however the result is no output (no matches).
Alternatively, when I start the character range from \x01, it works as expected:
$ echo 'ABCabc<>/ă' | grep -o [$'\x01'-$'\x7f']
A
B
C
a
b
c
<
>
/
Also, here are my grep and BASH versions:
$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ echo $BASH_VERSION
3.2.57(1)-release
On BSD grep, you may be able to use this:
LC_ALL=C grep -o '[[:print:][:cntrl:]]' <<< 'ABCabc<>/ă'
A
B
C
a
b
c
<
>
/
Or you can just install gnu grep using home brew package and run:
grep -oP '[[:ascii:]]' <<< 'ABCabc<>/ă'
Noting that $'...' is a shell quoting construct, this,
$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']
would try to pass a literal NUL character as part of the command line argument to grep. That's impossible to do in any Unix-like system, as the command line arguments are passed to the process as NUL-terminated strings. So in effect, grep sees just the arguments -o and [.
You would need to create some pattern that matches the NUL byte without including it literally. But I don't think grep supports the \000 or \x00 escapes itself. Perl does, though, so this prints the input line with the NUL:
$ printf 'foo\nbar\0\n' |perl -ne 'print if /\000/'
bar
As an aside, at least GNU grep doesn't seem to like that kind of a range expression, so if you were to use that, you'd to do something different. In the C locale, [[:cntrl:][:print:]]' might perhaps work to match the characters from \x01 to \x7f, but I didn't check comprehensively.
The manual for grep has some descriptions of the classes.
Note also that [$'\x00'-$'\x7f'] has an unquoted pair of [ and ] and so is a shell glob. This isn't related to the NUL byte, but if you had files that match the glob (any one-letter names, if the glob works on your system -- it doesn't on my Linux), or had failglob or nullglob set, it would probably give results you didn't want. Instead, quote the brackets too: $'[\x00-\x7f]'.

sed: remove all characters except for last n characters

I am trying to remove every character in a text string except for the remaining 11 characters. The string is Sample Text_that-would$normally~be,here--pe_-l4_mBY and what I want to end up with is just -pe_-l4_mBY.
Here's what I've tried:
$ cat food
Sample Text_that-would$normally~be,here--pe_-l4_mBY
$ cat food | sed 's/^.*(.{3})$/\1/'
sed: 1: "s/^.*(.{3})$/\1/": \1 not defined in the RE
Please note that the text string isn't really stored in a file, I just used cat food as an example.
OS is macOS High Sierra 10.13.6 and bash version is 3.2.57(1)-release
You can use this sed with a capture group:
sed -E 's/.*(.{11})$/\1/' file
-pe_-l4_mBY
Basic regular expressions (used by default by sed) require both the parentheses in the capture group and the braces in the brace expression to be escaped. ( and { are otherwise treated as literal characters to be matched.
$ cat food | sed 's/^.*\(.\{3\}\)$/\1/'
mBY
By contrast, explicitly requesting sed to use extended regular expressions with the -E option reverses the meaning, with \( and \{ being the literal characters.
$ cat food | sed -E 's/^.*(.{3})$/\1/'
mBY
Try this also:
grep -o -E '.{11}$' food
grep, like sed, accepts an arbitrary number of file name arguments, so there is no need for a separate cat. (See also useless use of cat.)
You can use tail or Parameter Expansion :
string='Sample Text_that-would$normally~be,here--pe_-l4_mBY'
echo "$string" | tail -c 11
echo "${string#${string%??????????}}"
pe_-l4_mBY
pe_-l4_mBY
also with rev/cut/rev
$ echo abcdefghijklmnopqrstuvwxyz | rev | cut -c1-11 | rev
pqrstuvwxyz
man rev => rev - reverse lines characterwise

grep up to and including equal sign for CLI parameter

My goal is to match a command line argument prefix that looks like:
--abc=
Both of the patterns below (and many others), allow:
--abc==
Somehow, I can't find a grep way to ensure there is just one equal sign.
grep -i '^--[a-z]\{2,\}=\{1,1\}'
grep -i '^--[a-z]\{2,\}='
grep 2.20
CentOS Linux 7.3.1611
ERE:
^--[[:alpha:]]{2,}=[^=]+$
^--[[:alpha:]]{2,}= matches --, then two or more alphabetic characters in your locale, then a literal =
[^=]+$ matches one or more characters that are not = at the end
BRE:
^--[[:alpha:]]\{2,\}=[^=]\+$
Example:
$ grep -E '^--[[:alpha:]]{2,}=[^=]*$' <<<'--foobar=spam'
--foobar=spam
$ grep -E '^--[[:alpha:]]{2,}=[^=]*$' <<<'--foobar=23'
--foobar=23
$ grep -E '^--[[:alpha:]]{2,}=[^=]*$' <<<'--123ad='
$ grep -E '^--[[:alpha:]]{2,}=[^=]+$' <<<'--spamegg='

Sed regex, extracting part of a string in Mac terminal

I have sample data like "(stuff/thing)" and I'm trying to extract "thing".
I'm doing this in the terminal on OSX and I can't quite seem to get this right.
Here's the last broken attempt
echo '(stuff/thing)' | sed -n 's/\((.*)\)/\1/p'
I would say:
$ echo '(stuff/thing)' | sed -n 's#.*/\([^)]*\))#\1#p'
thing
I start saying:
$ echo '(stuff/thing)' | sed -n 's#.*/##p'
thing)
Note I use # as sed delimiter for better readability.
Then, I want to get rid of what comes from the ). For this, we have to capture the block with \([^)]*\)) and print it back with \1.
So all together this is doing:
# print the captured group
# ^^
# |
.*/\([^)]*\))#\1
# ^^^| ^^^^^ |
# | | ------|---- all but )
# | | |
# | ^^ ^^
# | capture group
# |
# everything up to a /
To provide an awk alternative to fedorqui's helpful answer:
awk makes it easy to parse lines into fields based on separators:
$ echo '(stuff/thing)' | awk -F'[()/]' '{print $3}'
thing
-F[()/] specifies that any of the characters ( ) / should serve as a field separator when breaking each input line into fields.
$3 refers to the 3rd field (thing is the 3rd field, because the line starts with a field separator, which implies that field 1 ($1) is the empty string before it).
As for why your sed command didn't work:
Since you're not using -E, you must use basic regexes (BREs), where, counter-intuitively, parentheses must be escaped to be special - you have it the other way around.
The main problem, however, is that in order to output only part of the line, you must match ALL of it, and replace it with the part of interest.
With a BRE, that would be:
echo '(stuff/thing)' | sed -n 's/^.*\/\(.*\))$/\1/p'
With an ERE (extended regex), it would be:
echo '(stuff/thing)' | sed -En 's/^.*\/(.*)\)$/\1/p'`
Also note that both commands work as-is with GNU sed, so the problem is not Mac-specific (but note that the -E option to activate EREs is an alias there for the better-known -r).
That said, regex dialects do differ across implementations; GNU sed generally supports extensions to the POSIX-mandated BREs and EREs.
I would do it in 2 easy parts - remove everything up to and including the slash and then everything from the closing parenthesis onwards:
echo '(stuff/thing)' | sed -e 's/.*\///' -e 's/).*//'

bash - proper use of grep's infix operator "|"

I'm having a hard time discovering which method is best...
(debian latest, gnome)
This works;
$ ls -1 | grep "JPG|RAF|TIF"
I am trying to replace the "JPG|RAF|TIF" with a variable
e.g.
$ xFILTER="JPG RAF TIF"
I've tried to assign to the xFILTER variable...
xFILTER="JPG\|RAF\|TIF"
xFILTER="\"JPG|RAF|TIF\""
xFILTER="JPG\nRAF\nTIF"
$ ls -1 | grep -E "$xFILTER"
$ ls -1 | grep -e $xFILTER
$ ls -1 | grep -E "$(echo "xFILTER" | tr ' ' '|')"
Could someone please direct me towards a more sensible approach ?
Thank you.
This should work:
xFILTER="JPG|RAF|TIF"
ls -1 | grep -E "$xFILTER"
However as a word of caution it is not always good to parse ls output when your files can have spaces or new lines. Look into shopt -s extglob to enable extended globbing and search your files by extended patterns.
You're running into the different regular expressions syntaxes that grep supports.
Quoting the man page,
In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, +, {, \|, (, and ).
I'm surprised that you say your first example works.
Use one of the following:
xFILTER="JPG\|RAF\|TIF" ; ls -1 | grep $xFILTER
xFILTER="JPG|RAF|TIF" ; ls -1 | grep -E $xFILTER

Resources