Proper use of capture groups in SED command - bash

I need to convert a string "1,234" =to=> 1234.
this string is just a part of a bigger line. There are thousands of such lines in the file.
I have written a sed command which is not working as I expect it to.
echo \"1,234\" | sed 's/\("\)\([0-9]+\)\(,\)\([0-9]+\)\("\)/\2\4/g'
As far as I understand, in this code,
\1 is "
\2 is the digits before comma
\3 is ,
\4 is the digits after comma
I expect this command to output 1234 which should be \2\4. But it just yields back "1,234". So I think it is not being parsed properly. Some help would be appreciated.

I would suggest you use POSIX Extended Regular Expressions (ERE), where you don't have to escape parentheses and the repetition operator. To enable ERE in sed, you can use the -E switch (or -r in GNU sed). Your expression will then look like this:
$ echo '"1,234"' | sed -E 's/"([0-9]+),([0-9]+)"/\1\2/g'
1234
For completeness, your original BRE expression will function properly if you escape the +:
echo \"1,234\" | sed 's/\("\)\([0-9]\+\)\(,\)\([0-9]\+\)\("\)/\2\4/g'
1234

Your second and fourth groups contain [0-9]+, which matches any digit followed by a plus sign.
It looks like you meant [0-9]\+, to match one or more digits.
In passing: there's no need to group the parts you'll not be using (\1, \3 and \5). You can simplify to:
echo \"1,234\" | sed 's/"\([0-9]\+\),\([0-9]\+\)"/\1\2/g'
If you're finding all those \ hard to handled, you could use Extendend Regular Expression syntax, with the -E flag:
echo \"1,234\" | sed -E 's/"([0-9]+),([0-9]+)"/\1\2/g'

Related

Convert text from HttpStatus.NOT_FOUND into status().isNotFound() in bash

I want to convert the text in a bash variable i.e. HttpStatus.NOT_FOUND into status().isNotFound() and I had accomplished this by using sed:
result=HttpStatus.NOT_FOUND
result=$(echo $result | cut -d'.' -f2- | sed -r 's/(^|_)([A-Z])/\L\2/g' | sed -E 's/([[:lower:]])|([[:upper:]])/\U\1\L\2/g')
echo "status().is$result()"
Output:
status().isNotFound()
As you can see here I'm using 2 sed commands.
Is there a way to achieve the same result using 1 sed or any other simpler way?
Since it involves a lot of new text insertion in the replacement part, the sed command can be written in detail as below. Just pass the variable content over a pipe without using cut
result=HttpStatus.NOT_FOUND
echo "$result" |
sed -E 's/^.*(Status)\.([[:upper:]])([[:upper:]]+)_([[:upper:]])([[:upper:]]+)$/\L\1().is\u\2\L\3\u\4\L\5()/g'
The idea is add the case conversion functions of GNU sed on the captured groups. So we capture
(Status) in \1 in which we just lowercase the entire string and then append a ().is to the result
The next captured group, \2 would be first uppercase character following the . which would be N and the rest of the string OT in \3. We retain the second as such and do lower case of the third group.
The same sequence as above is repeated for the next word FOUND in \4 and \5.
The \L, \u are case conversion operators available in GNU sed.
If you are looking to modify only the part beyond the . to CamelCase, then you can use sed as
result=HttpStatus.NOT_FOUND
result=$(echo "$result" |
sed -E 's/^.*\.([[:upper:]])([[:upper:]]+)_([[:upper:]])([[:upper:]]+)/\u\1\L\2\u\3\L\4/g')
echo "status().is$result()"
This might work for you (GNU sed):
<<<"$result" sed -r 's/.*(Status)\.(.*)_(.*)/\L\1().is\u\2\u\3()/'
Use pattern matching/grouping/back references. The majority of the RHS is lowercase, so use the \L metacharacter to convert from Status... to lowercase and uppercase just the start of words using \u which converts only the next character to uppercase.
N.B. \L and likewise \U converts all following characters to lowercase/uppercase until \E or \U/\L, \l and \u only interrupt this for the next character.
Since you are using GNU sed (-r switch), here's another sed solution,
just a little bit more concise, and locale safe:
$ result=HttpStatus.NOT_FOUND
$ echo "$result" | sed -r 's/^.*([A-Z][a-z]*)\.([a-zA-Z])([a-zA-Z]*)_([a-zA-Z])([a-zA-Z]*)/\L\1().is\u\2\L\3\U\4\L\5()/'
status().isNotFound()
An even more concise way of sed is:
echo "$result" | sed -r 's/^.*([A-Z][a-z]*)\.([a-zA-Z]*)_([a-zA-Z]*)/\L\1().is\u\2\u\3()/'
They both are case insensitive for the second part, for example .nOt_fOuNd also works here.
And an GNU awk solution:
echo "$result" | awk 'function cap(str){return (toupper(substr(str,1,1)) tolower(substr(str,2)))}match($0, /([A-Z][a-z]*)\.([a-zA-Z]*)_([a-zA-Z]*)/, m){print tolower(m[1]) ".is" cap(m[2]) cap(m[3]) "()"}'
You can use the sed option "-e" to concatenate multible expressions.

sed: remove all characters except for last n characters

I am trying to remove every character in a text string except for the remaining 11 characters. The string is Sample Text_that-would$normally~be,here--pe_-l4_mBY and what I want to end up with is just -pe_-l4_mBY.
Here's what I've tried:
$ cat food
Sample Text_that-would$normally~be,here--pe_-l4_mBY
$ cat food | sed 's/^.*(.{3})$/\1/'
sed: 1: "s/^.*(.{3})$/\1/": \1 not defined in the RE
Please note that the text string isn't really stored in a file, I just used cat food as an example.
OS is macOS High Sierra 10.13.6 and bash version is 3.2.57(1)-release
You can use this sed with a capture group:
sed -E 's/.*(.{11})$/\1/' file
-pe_-l4_mBY
Basic regular expressions (used by default by sed) require both the parentheses in the capture group and the braces in the brace expression to be escaped. ( and { are otherwise treated as literal characters to be matched.
$ cat food | sed 's/^.*\(.\{3\}\)$/\1/'
mBY
By contrast, explicitly requesting sed to use extended regular expressions with the -E option reverses the meaning, with \( and \{ being the literal characters.
$ cat food | sed -E 's/^.*(.{3})$/\1/'
mBY
Try this also:
grep -o -E '.{11}$' food
grep, like sed, accepts an arbitrary number of file name arguments, so there is no need for a separate cat. (See also useless use of cat.)
You can use tail or Parameter Expansion :
string='Sample Text_that-would$normally~be,here--pe_-l4_mBY'
echo "$string" | tail -c 11
echo "${string#${string%??????????}}"
pe_-l4_mBY
pe_-l4_mBY
also with rev/cut/rev
$ echo abcdefghijklmnopqrstuvwxyz | rev | cut -c1-11 | rev
pqrstuvwxyz
man rev => rev - reverse lines characterwise

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format [duplicate]

I can't figure how to tell sed dot match new line:
echo -e "one\ntwo\nthree" | sed 's/one.*two/one/m'
I expect to get:
one
three
instead I get original:
one
two
three
sed is line-based tool. I don't think these is an option.
You can use h/H(hold), g/G(get).
$ echo -e 'one\ntwo\nthree' | sed -n '1h;1!H;${g;s/one.*two/one/p}'
one
three
Maybe you should try vim
:%s/one\_.*two/one/g
If you use a GNU sed, you may match any character, including line break chars, with a mere ., see :
.
Matches any character, including newline.
All you need to use is a -z option:
echo -e "one\ntwo\nthree" | sed -z 's/one.*two/one/'
# => one
# three
See the online sed demo.
However, one.*two might not be what you need since * is always greedy in POSIX regex patterns. So, one.*two will match the leftmost one, then any 0 or more chars as many as possible, and then the rightmost two. If you need to remove one, then any 0+ chars as few as possible, and then the leftmost two, you will have to use perl:
perl -i -0 -pe 's/one.*?two//sg' file # Non-Unicode version
perl -i -CSD -Mutf8 -0 -pe 's/one.*?two//sg' file # S&R in a UTF8 file
The -0 option enables the slurp mode so that the file could be read as a whole and not line-by-line, -i will enable inline file modification, s will make . match any char including line break chars, and .*? will match any 0 or more chars as few as possible due to a non-greedy *?. The -CSD -Mutf8 part make sure your input is decoded and output re-encoded back correctly.
You can use python this way:
$ echo -e "one\ntwo\nthree" | python -c 'import re, sys; s=sys.stdin.read(); s=re.sub("(?s)one.*two", "one", s); print s,'
one
three
$
This reads the entire python's standard input (sys.stdin.read()), then substitutes "one" for "one.*two" with dot matches all setting enabled (using (?s) at the start of the regular expression) and then prints the modified string (the trailing comma in print is used to prevent print from adding an extra newline).
This might work for you:
<<<$'one\ntwo\nthree' sed '/two/d'
or
<<<$'one\ntwo\nthree' sed '2d'
or
<<<$'one\ntwo\nthree' sed 'n;d'
or
<<<$'one\ntwo\nthree' sed 'N;N;s/two.//'
Sed does match all characters (including the \n) using a dot . but usually it has already stripped the \n off, as part of the cycle, so it no longer present in the pattern space to be matched.
Only certain commands (N,H and G) preserve newlines in the pattern/hold space.
N appends a newline to the pattern space and then appends the next line.
H does exactly the same except it acts on the hold space.
G appends a newline to the pattern space and then appends whatever is in the hold space too.
The hold space is empty until you place something in it so:
sed G file
will insert an empty line after each line.
sed 'G;G' file
will insert 2 empty lines etc etc.
How about two sed calls:
(get rid of the 'two' first, then get rid of the blank line)
$ echo -e 'one\ntwo\nthree' | sed 's/two//' | sed '/^$/d'
one
three
Actually, I prefer Perl for one-liners over Python:
$ echo -e 'one\ntwo\nthree' | perl -pe 's/two\n//'
one
three
Below discussion is based on Gnu sed.
sed operates on a line by line manner. So it's not possible to tell it dot match newline. However, there are some tricks that can implement this. You can use a loop structure (kind of) to put all the text in the pattern space, and then do the operation.
To put everything in the pattern space, use:
:a;N;$!ba;
To make "dot match newline" indirectly, you use:
(\n|.)
So the result is:
root#u1804:~# echo -e "one\ntwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#
Note that in this case, (\n|.) matches newline and all characters. See below example:
root#u1804:~# echo -e "oneXXXXXX\nXXXXXXtwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#

OSX sed newlines - why conversion of whitespace to newlines works, but newlines are not converted to spaces

sed on OSX has some quirks. This resource (http://nlfiedler.github.io/2010/12/05/newlines-in-sed-on-mac.html) contains information on how to convert whitespace into a newline:
echo 'foo bar baz quux' | sed -e 's/ /\'$'\n/g'
OR (#ghoti's suggestion which does make it easier to read):
echo 'foo bar baz quux' | sed -e $'s/ /\\\n/g'
However, when I try the reverse - converting newlines to whitespace, it doesn't work:
echo -e "foo\nbar" | sed -e 's/\'$'\n/ /g'
A more straightforward approach of just changing \n doesn't work either:
echo -e "foo\nbar" | sed -e 's/\n/ /g'
There's a related answer here: https://superuser.com/questions/307165/newlines-in-sed-on-mac-os-x, with a detailed answer by Spiff (right at the end of the page), however applying the same logic didn't resolve the problem.
Here's one way that does work on OSX (via http://www.benjiegillam.com/2011/09/using-sed-to-replace-newlines/):
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g'
However, I am still curious why reversing the original approach doesn't work.
UPDATE: here's how to make it work with two lines (the solution is to use N to embed the newline characters):
echo -e "foo\nbar\n" | sed -e 'N;s/\n/ /g'
AN ALTERNATIVE SOLUTION (see full answer by #ghoti for detailed explanation):
echo -e "foo\nbar\n" | sed -n '1h;2,$H;${;x;s/\n/ /gp;}'
However, this solution appears to be a tiny bit slower than the one suggested in the question statement (note order of these commands matters, so it might make sense to try testing them in different orders):
time seq 10000 | sed -n '1h;2,$H;${;x;s/\n/ /gp;}' > /dev/null
time seq 10000 | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' > /dev/null
Your question appears to be "why doesn't the reverse of the original approach [of converting spaces to newlines] work?".
In sed, the newline is more of a record separator than part of the line. Consider that $, the null at the end of the pattern space, comes after the last character of the line, and is not a newline of every line.
Sed commands that utilize newlines, like H and N and even s, do so outside the scope of newline-as-record-separator. The records you're substituting are between the newlines.
In order to substitute a newline, then, you need to get it INSIDE the pattern space, using N, H, etc.
So here's an option.
printf 'foo\nbar\nbaz\n' | sed -n '1h;2,$H;${;x;s/\n/ /gp;}'
The idea is that we'll append all our lines to the hold buffer, then at the end of the file, move the hold buffer back to the pattern space for substitution, and replace the newlines with spaces all at once.
The 1h;2,$H construction avoids a blank at the beginning of your output, caused by the newline that is appended before each line of data with H.
The GNU manual page for sed includes:
REGULAR EXPRESSIONS
POSIX.2 BREs should be supported, but they aren't completely because of performance problems. The \n sequence in a regular expression matches the newline character, and similarly for \a, \t, and other sequences.
The Mac OS X manual page for sed includes:
Sed Regular Expressions
The regular expressions used in sed, by default, are basic regular expressions (BREs, see re_format(7) for more information), but extended (modern) regular expressions can be used instead if the -E flag is given. In addition, sed has the following two additions to regular expressions:
In a context address, any character other than a backslash (\) or newline character may be used to delimit the regular expression. Also, putting a backslash character before the delimiting character causes the character to be treated literally. For example, in the context address \xabc\xdefx, the RE delimiter is an x and the second x stands for itself, so that the regular expression is abcxdef.
The escape sequence \n matches a newline character embedded in the pattern space. You cannot, however, use a literal newline character in an address or in the substitute command.
What these don't say, but what seems to be the case, is that in the s/regex/new/ command, the regex section is a regular expression, but the new section is not. In the replacement material, you have to use \ followed by a newline to embed a newline. In the search material (regex), you can use \n.
Note also that sed works on lines. By default, the newline at the end of the pattern space is pretty much unmatchable except with the regex metacharacter $; you can't simply remove that newline by matching it. You can, however, end up with multiple lines in the pattern space, and then you can match embedded newlines with the \n pattern.
A couple of alternatives, that I tend to fall back on when stymied by OSX sed peculiarities, are tr and perl.
echo -e "foo\nbar" | tr '\n' ' '
foo bar
echo -e "foo\nbar" | perl -pe 's/\n/ /'
foo bar

How to pass special characters through sed

I want to pass this command in my script:
sed -n -e "/Next</a></p>/,/Next</a></p>/ p" file.txt
This command (should) extract all text between the two matched patterns, which are both Next</a></p> in my case. However when I run my script I keep getting errors. I've tried:
sed -n -e "/Next\<\/a\>\<\/p\>/,/Next<\/a\>\<\/p>/ p" file.txt with no luck.
I believe the generic pattern for this command is this:
sed -n -e "/pattern1/,/pattern2/ p" file.txt
I can't get it working for Next</a></p> though and I'm guessing it has something to do with the special characters I am encasing. Is there any way to pass Next</a></p> in the sed command? Thanks in advance guys! This community is awesome!
You don't need to use / as a regular expression delimiter. Using a different character will make quoting issues slightly easier. The syntax is
\cregexc
where c can be any character (other than \) that you don't use in the regex. In this case, : might be a good choice:
sed -n -e '\:Next</a></p>:,\:Next</a></p>: p' file.txt
Note that I changed " to ' because inside double quotes, \ will be interpreted by bash as an escape character, whereas inside single quotes \ is just treated as a regular character. Consequently, you could have written the version with escaped slashes like this:
sed -n -e '/Next<\/a><\/p>/,/Next<\/a><\/p>/ p' file.txt
but I think the version with colons is (slightly) easier to read.
You need to escape the forward slashes inside the regular expressions with a \, since the forward slashes serve as delimiters for the regexes
sed -n -e '/Next<\/a><\/p>/,/Next<\/a><\/p>/p' file.txt

Resources