Why does this sed command not match whitespace? - bash

This bash script is supposed to remove leading whitespace from grep results:
#!/bin/bash
grep --color=always $# | sed -r -e's/:[[:space:]]*/:/'
But it doesn't match the whitespace. If I change the substitution text to "-", that shows up in the output, but it still never removes the whitespace. I've tried it without the "*", escaping the "*", with "+", etc, but nothing works. Does anyone know why not?
(I'm using sed version 4.2.1 on Ubuntu 12.04.)
Thanks all, this is my modified script, which shows grep color and also trims leading blanks:
#!/bin/bash
grep --color=always $# | sed -r -e's/[[:space:]]+//'

You need to remove the --color option for this to work. The color codes confuse sed:
grep $# | sed -r -e's/:[[:space:]]*/:/'

The color information output by grep takes the form of special character sequences (see answers to this StackOverflow question), so if the colon is colored and the whitespace isn't, or vice versa, then that means that one of these character sequences will be between them, so sed will not see them as adjacent characters.

The character class \s will match the whitespace characters and
For example:
$ sed -e "s/\s\{3,\}/ /g" inputFile
will substitute every sequence of at least 3 whitespaces with two spaces.

grep --color=always $# |sed 's/^ //g'
Removes leading white spaces.

Related

Convert text from HttpStatus.NOT_FOUND into status().isNotFound() in bash

I want to convert the text in a bash variable i.e. HttpStatus.NOT_FOUND into status().isNotFound() and I had accomplished this by using sed:
result=HttpStatus.NOT_FOUND
result=$(echo $result | cut -d'.' -f2- | sed -r 's/(^|_)([A-Z])/\L\2/g' | sed -E 's/([[:lower:]])|([[:upper:]])/\U\1\L\2/g')
echo "status().is$result()"
Output:
status().isNotFound()
As you can see here I'm using 2 sed commands.
Is there a way to achieve the same result using 1 sed or any other simpler way?
Since it involves a lot of new text insertion in the replacement part, the sed command can be written in detail as below. Just pass the variable content over a pipe without using cut
result=HttpStatus.NOT_FOUND
echo "$result" |
sed -E 's/^.*(Status)\.([[:upper:]])([[:upper:]]+)_([[:upper:]])([[:upper:]]+)$/\L\1().is\u\2\L\3\u\4\L\5()/g'
The idea is add the case conversion functions of GNU sed on the captured groups. So we capture
(Status) in \1 in which we just lowercase the entire string and then append a ().is to the result
The next captured group, \2 would be first uppercase character following the . which would be N and the rest of the string OT in \3. We retain the second as such and do lower case of the third group.
The same sequence as above is repeated for the next word FOUND in \4 and \5.
The \L, \u are case conversion operators available in GNU sed.
If you are looking to modify only the part beyond the . to CamelCase, then you can use sed as
result=HttpStatus.NOT_FOUND
result=$(echo "$result" |
sed -E 's/^.*\.([[:upper:]])([[:upper:]]+)_([[:upper:]])([[:upper:]]+)/\u\1\L\2\u\3\L\4/g')
echo "status().is$result()"
This might work for you (GNU sed):
<<<"$result" sed -r 's/.*(Status)\.(.*)_(.*)/\L\1().is\u\2\u\3()/'
Use pattern matching/grouping/back references. The majority of the RHS is lowercase, so use the \L metacharacter to convert from Status... to lowercase and uppercase just the start of words using \u which converts only the next character to uppercase.
N.B. \L and likewise \U converts all following characters to lowercase/uppercase until \E or \U/\L, \l and \u only interrupt this for the next character.
Since you are using GNU sed (-r switch), here's another sed solution,
just a little bit more concise, and locale safe:
$ result=HttpStatus.NOT_FOUND
$ echo "$result" | sed -r 's/^.*([A-Z][a-z]*)\.([a-zA-Z])([a-zA-Z]*)_([a-zA-Z])([a-zA-Z]*)/\L\1().is\u\2\L\3\U\4\L\5()/'
status().isNotFound()
An even more concise way of sed is:
echo "$result" | sed -r 's/^.*([A-Z][a-z]*)\.([a-zA-Z]*)_([a-zA-Z]*)/\L\1().is\u\2\u\3()/'
They both are case insensitive for the second part, for example .nOt_fOuNd also works here.
And an GNU awk solution:
echo "$result" | awk 'function cap(str){return (toupper(substr(str,1,1)) tolower(substr(str,2)))}match($0, /([A-Z][a-z]*)\.([a-zA-Z]*)_([a-zA-Z]*)/, m){print tolower(m[1]) ".is" cap(m[2]) cap(m[3]) "()"}'
You can use the sed option "-e" to concatenate multible expressions.

Removing punctuation and tabs with sed

I am using the following to remove punctuation, tabs, and convert uppercase text to lowercase in a text file.
sed 's/[[:punct:]]//g' $HOME/file.txt | sed $'s/\t//g' | tr '[:upper:]' '[:lower:]'
Do I need to use these two separate sed commands to remove punctuation and tabs or can this be done with a single sed command?
Also, could someone explain what the $ is doing in the second sed command? Without it the command doesn't remove tabs. I looked in the man page but I didn't see anything that mentioned this.
The input file looks like this:
Pochemu oni ne v shkole?
Kto tam?
Otkuda eto moloko?
Chei chai ona p’et?
Kogda vy chitaete?
Kogda ty chitaesh’?
A single sed with multiple -e expressions, which can be done as below for FreeBSD sed
sed -e $'s/\t//g' -e "s/[[:punct:]]\+//g" -e 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' file
With the y quanitifier for,
[2addr]y/string1/string2/
Replace all occurrences of characters in string1 in the pattern
space with the corresponding characters from string2.
If in GNU sed, \L quantifier for lower-case conversion should work fine.
sed -e $'s/\t//g' -e "s/[[:punct:]]\+//g" -e "s/./\L&/g"
$'' is a bash quoting mechanism to enable ANSI C-like escape sequences.

OSX sed newlines - why conversion of whitespace to newlines works, but newlines are not converted to spaces

sed on OSX has some quirks. This resource (http://nlfiedler.github.io/2010/12/05/newlines-in-sed-on-mac.html) contains information on how to convert whitespace into a newline:
echo 'foo bar baz quux' | sed -e 's/ /\'$'\n/g'
OR (#ghoti's suggestion which does make it easier to read):
echo 'foo bar baz quux' | sed -e $'s/ /\\\n/g'
However, when I try the reverse - converting newlines to whitespace, it doesn't work:
echo -e "foo\nbar" | sed -e 's/\'$'\n/ /g'
A more straightforward approach of just changing \n doesn't work either:
echo -e "foo\nbar" | sed -e 's/\n/ /g'
There's a related answer here: https://superuser.com/questions/307165/newlines-in-sed-on-mac-os-x, with a detailed answer by Spiff (right at the end of the page), however applying the same logic didn't resolve the problem.
Here's one way that does work on OSX (via http://www.benjiegillam.com/2011/09/using-sed-to-replace-newlines/):
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g'
However, I am still curious why reversing the original approach doesn't work.
UPDATE: here's how to make it work with two lines (the solution is to use N to embed the newline characters):
echo -e "foo\nbar\n" | sed -e 'N;s/\n/ /g'
AN ALTERNATIVE SOLUTION (see full answer by #ghoti for detailed explanation):
echo -e "foo\nbar\n" | sed -n '1h;2,$H;${;x;s/\n/ /gp;}'
However, this solution appears to be a tiny bit slower than the one suggested in the question statement (note order of these commands matters, so it might make sense to try testing them in different orders):
time seq 10000 | sed -n '1h;2,$H;${;x;s/\n/ /gp;}' > /dev/null
time seq 10000 | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' > /dev/null
Your question appears to be "why doesn't the reverse of the original approach [of converting spaces to newlines] work?".
In sed, the newline is more of a record separator than part of the line. Consider that $, the null at the end of the pattern space, comes after the last character of the line, and is not a newline of every line.
Sed commands that utilize newlines, like H and N and even s, do so outside the scope of newline-as-record-separator. The records you're substituting are between the newlines.
In order to substitute a newline, then, you need to get it INSIDE the pattern space, using N, H, etc.
So here's an option.
printf 'foo\nbar\nbaz\n' | sed -n '1h;2,$H;${;x;s/\n/ /gp;}'
The idea is that we'll append all our lines to the hold buffer, then at the end of the file, move the hold buffer back to the pattern space for substitution, and replace the newlines with spaces all at once.
The 1h;2,$H construction avoids a blank at the beginning of your output, caused by the newline that is appended before each line of data with H.
The GNU manual page for sed includes:
REGULAR EXPRESSIONS
POSIX.2 BREs should be supported, but they aren't completely because of performance problems. The \n sequence in a regular expression matches the newline character, and similarly for \a, \t, and other sequences.
The Mac OS X manual page for sed includes:
Sed Regular Expressions
The regular expressions used in sed, by default, are basic regular expressions (BREs, see re_format(7) for more information), but extended (modern) regular expressions can be used instead if the -E flag is given. In addition, sed has the following two additions to regular expressions:
In a context address, any character other than a backslash (\) or newline character may be used to delimit the regular expression. Also, putting a backslash character before the delimiting character causes the character to be treated literally. For example, in the context address \xabc\xdefx, the RE delimiter is an x and the second x stands for itself, so that the regular expression is abcxdef.
The escape sequence \n matches a newline character embedded in the pattern space. You cannot, however, use a literal newline character in an address or in the substitute command.
What these don't say, but what seems to be the case, is that in the s/regex/new/ command, the regex section is a regular expression, but the new section is not. In the replacement material, you have to use \ followed by a newline to embed a newline. In the search material (regex), you can use \n.
Note also that sed works on lines. By default, the newline at the end of the pattern space is pretty much unmatchable except with the regex metacharacter $; you can't simply remove that newline by matching it. You can, however, end up with multiple lines in the pattern space, and then you can match embedded newlines with the \n pattern.
A couple of alternatives, that I tend to fall back on when stymied by OSX sed peculiarities, are tr and perl.
echo -e "foo\nbar" | tr '\n' ' '
foo bar
echo -e "foo\nbar" | perl -pe 's/\n/ /'
foo bar

understanding SED commands

I need to understand a shell code which uses the following command to fetch directions from a source to destination using GOOGLE MAPS API:
wget --no-parent -O - https://maps.googleapis.com/maps/api/directions/json?origin=$begin\&destination=$finish\&sensor=false > new.txt
Next we fetch the following line of the output:
**"html_instructions" : "Head \u003cb\u003enorthwest\u003c/b\u003e"**
grep -n html_instructions new.txt > new1.txt
Can somebody please tell me the meaning of using:
sed -e 's/\\u003cb//g'
etc in the following command:
sed -e 's/\\u003cb//g' -e 's/\\u003e//g' -e 's/\\u003c\/b//g' -e 's/\\u003c//g' -e 's/div.*div//g' -e 's/.*://g' -e 's/"//g' -e 's/ "//g' new1.txt > new2.txt
Which outputs Head northwest only.
Thanks in advance!
sed -e 's/\\u003cb//g' -e 's/\\u003e//g' -e 's/\\u003c\/b//g' -e 's/\\u003c//g' -e 's/div.*div//g' -e 's/.*://g' -e 's/"//g' -e 's/ "//g' new1.txt > new2.txt
The string after each -e is a sed command. The sed command s/\\u003cb//g searches for all occurrences of the unicode character 003CB (which is a greek small letter upsilon with dialytika) and replaces it with nothing. In other words, it remove the character from the string.
Thus, that sed command removes every occurrence of unicode characters 003cb, u003e, and u003c from the lines and new1.txt and sends the output to new2.txt.
Additionally, s/div.*div//g causes any string that begins and ends with "div" to be removed. The command s/.*://g removes any text from the beginning of the line to the last colon in the line. s/"//g removes the every occurence of the double-quote character. s/ "//g removes every occurrence of space followed by double-quote.
In general, the sed command s/new/old/ searches for the first occurrence of new and replaces it with old. With a g appended at the end, as in s/new/old/g, it makes the substitution globally: looks for every occurrence of new and replaces it with old. Adding a lot of power to these commands, new may be a regular expression. Consider s/.*://g. The dot character has the special meaning of "any character at all". The star character means zero or more of the preceding character. Thus the regular expression.*:` means zero or more of any characters followed by a colon.
You can take all in one go with awk:
awk -F\" '/html_instructions/ {gsub(/(\\u003(c|cb|e)|\/b)/,x);print $4}'
Head northwest
So whole line should be:
wget --no-parent -O - https://maps.googleapis.com/maps/api/directions/json?origin=$begin\&destination=$finish\&sensor=false | awk -F\" '/html_instructions/ {gsub(/(\\u003(c|cb|e)|\/b)/,x);print $4}'
Head northwest
to get it into a variable
d=$(wget --no-parent -O - https://maps.googleapis.com/maps/api/directions/json?origin=$begin\&destination=$finish\&sensor=false | awk -F\" '/html_instructions/ {gsub(/(\\u003(c|cb|e)|\/b)/,x);print $4}')
echo $d
Head northwest

How to ensure I have exactly 2 spaces before string and zero spaces after

I get a string that can have from zero to multiple leading and trailing spaces.
I'm trying to get rid of them without lot of hackery but my code looks huge.
How to do this in a clean way?
as easy as:
$ src=" some text "
$ dst=" $(echo $src)"
$ echo ":$dst:"
: some text:
$(echo $src) will get rid of all around spaces.
than you simply add how much spaces you need before it.
How are you calling out the string? If it's an echo you can just put
Echo "<2 spaces>". "string";
if it's a normal string you just put 2 spaces between the first qoute and the string.
"<2spaces> string here"
One way using GNU sed:
sed 's/^[ \t]*/ /; s/[ \t]*$//' file.txt
You can apply this to a bash variable like this:
echo "$string" | sed 's/^[ \t]*/ /; s/[ \t]*$//'
And save it like this:
variable=$(echo "$string" | sed 's/^[ \t]*/ /; s/[ \t]*$//')
Explanation:
The first substitution will remove all leading whitespace and replace it with two spaces.
The second substitution will simply remove all lagging whitespace from a line.
The simplest is probably to use an external process.
value=$(echo "$value" | sed 's/^ *\(.*[^ ]\) *$/ \1/')
If you need to transform an empty string into two spaces, you'll need to modify the regex; and if you're not on Linux, your sed dialect may differ slightly. For maximum portability, switch to awk or Perl, or do it all in Bash. That gets a bit more complex, but for a start, trailing=${value##*[! ]} contains any trailing spaces, and you can trim them off with ${value%$trailing}, and similarly for leading spaces. See the section on variable substitution in the Bash manual for details.
You can use a regular expression to match everything between the leading and trailing spaces. The matched text is found in the BASH_REMATCH array (the text matching the first parentheses group is in element 1).
spcs='\ *'
text='.*[^ ]'
[[ $src =~ ^$spcs($text)$spcs$ ]]
dst=" ${BASH_REMATCH[1]}"

Resources