How to make bash regex'es more readable? - bash

Here is an example of nicely indented Python regex (taken from here):
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)
Now, I would like to use the same technique for bash regexes (i.e. sed or grep), but can't find any reference to similar features so far. Is it even possible to indent (and comment) something like this?
echo "$MULTILINE | sed -re 's/(expr1|expr2)|(expr3|expr4)/expr5/g'

You can use bash's line continuation, possibly:
echo "start of a line \
continues the previous line \
yet another continuation
oops. this is a brand new line"
Note the backslashes at the end of the first two lines. they essentially 'escape' the newline/linebreak that would otherwise tell bash you're starting a new line, which also implicitly terminate the statement being defined.

Related

How to make sed avoid replacement after specific symbol

I am writing a script for formatting a Fortran source code.
Simple formatting, like having all keywords in capitals or in small letters, etc.
Here is the main command
sed -i -e "/^\!/! s/$small\s/$cap /gI" $filein
It replaces every keyword $small (followed by a space) by a keyword $caps. And the replacement happens only if the line does not start with the "!".
It does what it should. Question:
How to avoid replacement if "!" is encountered in the middle of a line.
Or more generally, how to replace patterns everywhere, but not after a specific symbol, which can be either in the beginning of the line or somewhere else.
Example:
Program test ! It should not change the next program to caps
! Hi, is anything changing here? like program ?
This line does not have any key words
This line has Program and not exclamation mark.
"program" is a keyword. After running the script the result is:
PROGRAM test ! It should not change the next PROGRAM to caps
! Hi, is anything changed here? like program ?
This line does not have any key words
This line has PROGRAM and not exclamation mark.
I want:
PROGRAM test ! It should not change the next program to caps
! Hi, is anything changed here? like program ?
This line does not have any key words
This line has PROGRAM and not exclamation mark.
So far, I've failed to find a nice solution, which does the trick, hopefully with the sed command.
The typicall way in sed is to:
split the string into two parts - save one part in hold space.
do operations on pattern space
get hold space and shuffle for output.
Would be something along:
sed '/!/!b;/[^!]/{b};h;s/.*!//;x;s/!.*//;s/program/PROGRAM/gI;G;s/\n/!/'
/!/!b; - if the line has no !, then print it and start over.
h;s/.*!//;x;s/!.*// - put part after ! in hold space, part before ! in pattern space
s/program/PROGRAM/gI; - do the substitution on part of the string
G;s/\n/!/ - grab the part from hold space and shuffle output - it's easy here.
Assumptions:
OP needs to convert multiple keywords to uppercase
keywords to be capitalized do not include white space (eg, program name will need to be processed as two separate strings program and name)
input delimiter is white space
keywords with 'attached' non-alphanums will be ignored (eg, Program, will be ignored since , will be picked up as part of string) unless OP specifically includes the non-alphanum as part of the keyword definition (eg, keywords includes Program,)
all keywords to be converted to uppercase (ie, not going to worry about any flags to switch between lowercase, uppercase, camelcase, etc)
Sample input data:
$ cat source.txt
Program test ! It should not change the next program to caps # change first 'Program'
! Hi, is anything changing here? like program or MarK? # change nothing
This line does not have any key words ! except here - pRoGraM Mark # change nothing
This line has Program and not exclamation mARk plus MarKer. # change 'Program' and 'mARk' but not MarKer
Hi, hi, hI # change 'Hi,' and 'hi,' but not 'hI'
List of keywords provided in a separate file (whitespace delimited);
$ cat keywords.dat
program
mark hi, # 2 separate keywords: 'mark' and 'hi,' (comma included)
One awk idea:
awk -v comment="!" ' # define character after which conversions are to be ignored
FNR==NR { for ( i=1; i<=NF; i++) # first file contains keywords; process each field as a separate keywork
keywords[toupper($i)] # convert to uppercase and use as index in associative array keywords[]
next
}
{ for ( i=1; i<=NF; i++ ) # second file, process each field separately
{ if ( $i == comment ) # if field is our comment character then stop processing rest of line else ...
break
if ( toupper($i) in keywords ) # if current field is a keyword then convert to uppercase
$i=toupper($i)
}
print # print the current line
}
' keywords.dat source.txt
This generates:
PROGRAM test ! It should not change the next program to caps
! Hi, is anything changing here? like program or MarK?
This line does not have any key words ! except here - pRoGraM Mark
This line has PROGRAM and not exclamation MARK plus MarKer.
HI, HI, hI
NOTES:
while GNU awk can be told to overwrite the input file (eg, awk -i inplace == sed -i), this will require a different approach for processing the keywords.dat file (to keep from overwriting with nothing)
(quite a bit) of additional logic could be added to support uppercase vs lowercase vs camelcase vs whatever ... ignore or include non-alphanums in comparisons ... using multiple/different 'comment' characters ... standardizing other portions of (Fortran) code (eg, indentation) ... etc
This might work for you (GNU sed):
small='Program ' caps='PROGRAM '
sed -E ':a;s/^([^!]*)('"$small"')/\1\n/;ta;s/\n/'"$caps"'/g' file
Replace any occurrence of the variable $small before the symbol ! with a newline, then replace all newlines by the variable $caps.
N.B. The newline is chosen because it can not normally exist in any line presented by sed as it is the delimiter sed uses to present lines in the pattern space. Secondly, the words matching $small are iteratively replaced by a newline, then all newlines globally replaced by $caps. This allows for the replacement to by a superset of the first. If this were not the order of operations, the iterative process may become an endless loop.
If $small is to represent a case insensitive match, add the i flag to the first substitution.
I've tried suggested options, but all of them did not work as expected for the whole file.
I have ended up with multiple sed commands; I am sure that it is not the best solution, but it works for me and does what I need.
My main problem was to avoid replacement after "!" if it appears somewhere in the middle of the line.
So I switched this problem to the one I could handle.
sed -i -e "/^\!/! s/!/!c7u!!c7u!/" $filein # 1. If a line does NOT start with !, search next "!" and replace it with "!c7u!!c7u!"
sed -i "s/!c7u!/\n/" $filein # 2. Move that comment to a new line
for ((i=0; i<$nwords; i++ )); do # Loop through all keywords
word=${words[$i]} # Take a keyword from the list
small=${word,,} # Write it in small letters
cap=${word^^} # Write it in capitals
sed -i -e "/^\!/! s/$small\b/$cap/gI" $filein # 3. Actual replacement in lines not starting with "!"
done
sed -i -e :a -e '$!N;s/\n!c7u//;ta' -e 'P;D' $filein # 4. Undo step 1-2, moving inline comments back

What is the function of this sed shell script?

I was reading this example about setting up a cluster with pgpool and Watchdog and decided to give it a try as an exercise.
I'm far from being a master of shell scripting, but I could follow the documentation and modify it according to the settings of my virtual machines. But I don't get what is the purpose of the following snippet:
if [ ${PGVERSION} -ge 12 ]; then
sed -i -e \"\\\$ainclude_if_exists = '$(echo ${RECOVERYCONF} | sed -e 's/\//\\\//g')'\" \
-e \"/^include_if_exists = '$(echo ${RECOVERYCONF} | sed -e 's/\//\\\//g')'/d\" ${DEST_NODE_PGDATA}/postgresql.conf
fi
In my case PGVERSION will be 12 (so the script will execute the code after the condition), RECOVERYCONF is /usr/local/pgsql/data/myrecovery.conf and DEST_NODE_PGDATA is /usr/local/pgsql/data.
I get (please excuse and correct me if I'm wrong) that -e indicates that a script comes next, the $(some commands) part evaluates the expression and returns the result, and that the sed regular expression indicates that the '/'s will be replaced by \/ (forward slash and slash). What is puzzling me are the "\\\$ainclude_if_exists =" and "/^include_if_exists" parts, I don't know what they mean or what are they intended for, nor how they interact. Also, the -e after the first sed regular expression is confusing me.
If you are interested in the context, those commands are near the end of the /var/lib/pgsql/11/data/recovery_1st_stage example script.
Thanks in advance for your time.
Here's a tiny representation of the same code:
sed -i -e '$amyvalue = foo' -e '/^myvalue = foo/d' myfile.txt
The first sed expression is:
$ # On the last line
a # Append the following text
myvalue = foo # (text to be appended)
The second is:
/ # On lines matching regex..
^myvalue = foo # (regex to match)
/ # (end of regex)
d # ..delete the line
So it deletes any myvalue = foo that may already exists, and then adds one such line at the end. The point is just to ensure that you have exactly one line, by A. adding the line if it's missing, B. not duplicate the line if it already exists.
The rest of the expression is merely complicated by the fact that this snippet uses variables and is embedded in a double quoted string that's being passed to a different host via ssh, and therefore requires some additional escaping both of the variables and of the quotes.

Replacing a parameter in bash using sed

Trying to clean up several dozen redundant nagios config files, but sed isn't working for me (yes I'm fairly new to bash), here's the string I want to replace:
use store-service
host_name myhost
service_description HTTP_JVM_SYM_DS
check_command check_http!'-p 8080 -N -u /SymmetricDS/app'
check_interval 1
with this:
use my-template-service
host_name myhost
just the host_name should stay unchanged since it'll be different for each file. Any help will be greatly appreciated. Tried escaping the ' and !, but get this error -bash: !'-p: event not found
Thanks
Disclaimer: This question is somewhat light on info and rings a bit like "write my code for me". In good faith I'm assuming that it's not that, so I am answering in hopes that this can be used to learn more about text processing/regex substitutions in general, and not just to be copy-pasted somewhere and forgotten.
I suggest using perl instead of sed. While sed is often the right tool for the job, in this case I think Perl's better, for the following reasons:
Perl lets you easily do multi-line matches on a regex. This is possible with sed, but difficult (see this question for more info).
With multiple lines and complex delimiters and quote characters, sed starts to display different behavior depending on what platform you're using it on. For example, trying to do this with sed in "sorta multiline" mode gave me different results on OSX versus Linux (really GNU sed vs BSD sed). When using semi-advanced functionality like that, I'd stick with a tool that behaves consistently across platforms, which Perl does in this case.
Perl lets you deal with ASCII values and other special characters without a ton of "toothpick tower" escaping or subshelling. Since it's convenient to use ASCII values to match the single quotes in your pattern (we could use mixed double and single quotes instead, but that makes it harder to copy/paste this command into, say, a subshell or an eval'd part of a script), it's better to use a tool that supports this without extra hassle. It's possible with sed, but tricky; see this article for more info.
In sed/BRE, doing something as simple as a "one or more" match usually requires escaping special characters, aka [[:space:]]\{1,\}, which gets tedious. Since it's convenient to use a lot of repetition/grouping characters in this pattern, I prefer Perl for conciseness in this case, since it improves clarity of the matching code.
Perl lets you write comments in regex statements in one-liner mode via the x modifier. For big, multiline patterns like this one, having the pattern broken up and commented for readability really helps if you ever need to go back and change it. sed has comments too, but using them in single-pasteable-command mode (as opposed to a file of sed script code) can be tricky, and can result in less readable commands.
Anyway, following is the matcher I came up with. It's commented inline as much as I can make it, but the non-commented parts are explained here:
The -0777 switch tells perl to consume input files whole before processing them, rather than operating line-by-line. See perlrun for more info on this and the other flags. Thanks to #glennjackman for pointing this out in the comments on the original question!
The -p switch tells Perl to read STDIN until it sees a delimiter (which is end-of-input as set by -0777), run the program supplied, and print that program's return value before shutting down. Since our "program" is just a string substitution statement, its return value is the substituted string.
The -e switch tells perl to evaluate the next string argument for a program to run, rather than finding a script file or similar.
Input is piped from mytext.txt, which could be a file containing your pattern. You could also pipe input to Perl e.g. via cat mytext.txt | perl ... and it would work exactly the same way.
The regex modifiers work as follows: I use the multiline m modifier to match more than one \n-delimited statement, and the extended x modifier so we can have comments and turn off matching of literal whitespace, for clarity. You could get rid of comments and literal whitespace and splat it all into one line if you wanted, but good luck making any changes after you've forgotten what it does. See perlre for more info on these modifiers.
This command will replace the literal string you supplied, in a file that contains it (it can have more than just that string before/after it; only that block of text will be manipulated). It is less than literal in one minor way: it allows any number (one or more) of space characters between the first and second words in each line. If I remember Nagios configs, the number of spaces doesn't particularly matter anyway.
This command will not change the contents of a file it is supplied. If a file does not match the pattern, its contents will be printed out unchanged by this command. If it contains that pattern, the replaced contents will be printed out. You can write those contents to a new file, or do anything you like with them.
perl -0777pe '
# Use the pipe "|" character as an expression delimiter, since
# the pattern contains slashes.
s|
# 'use', one or more space-equivalent characters, and then 'store-service',
# on one line.
use \s+ store-service \n
# Open a capturing group.
(
# Capture the host name line in its entirety, then close the group.
host_name \s+ \S+
# Close the group and end the line.
) \n
service_description \s+ HTTP_JVM_SYM_DS \n
# Look for check_command, spaces, and check_http!, but keep matching on the
# same line.
check_command \s+ check_http!
# Look for a single quote character by ASCII value, since shell
# escaping these can be ugly/tricky, and makes your code less copy-
# pasteable in/out of scripts/subcommands.
\047
# Look for the arguments to check_http, delimited by explicit \s
# spaces, since we are in "extended" mode in order to be able to write
# these comments and the expression on multiple lines.
-p \s 8080 \s -N \s -u \s /SymmetricDS/app
# Look for another single quote and the end of the line.
\047 \n
check_interval \s+ 1\n
# Replace all of the matched text with the "use my-template-service" line,
# followed by the contents of the first matching group (the host_name line).
# You could capture the "use" statement in another group, or use e.g.
# sprintf() to align fields here instead of a big literal space line, but
# this is the simplest, most obvious way to get the replacement done.
|use my-template-service\n$1|mx
' < mytext.txt
Assuming you can glob the files to select on the log files of interest, I would first filter the files that you want to replace to be limited to five lines.
You can do that with Bash and awk:
for fn in *; do # make that glob apply to your files...
[[ -e "$fn" && -f "$fn" && -s "$fn" ]] || continue
line_cnt=$(awk 'FNR==NR{next}
END {print NR}' "$fn")
(( line_cnt == 5 )) || continue
# at this point you only have files with 5 lines of text...
done
Once you have done that, you can add another awk to the loop to make the replacements:
for fn in *; do
[[ -e "$fn" && -f "$fn" && -s "$fn" ]] || continue
line_cnt=$(awk -v l=5 'FNR==NR{next}
END {print NR}' "$fn")
(( line_cnt == 5 )) || continue
awk 'BEGIN{tgt["use"]="my-template-service"
tgt["host_name"]=""}
$1 in tgt { if (tgt[$1]=="") s=$2
else s=tgt[$1]
printf "%-33s%s\n", $1, s
}
' "$fn"
done
This is the GNU sed solution, check it. Backup your files before testing.
#!/bin/bash
# You should escape all special characters in this string (like $, ^, /, {, }, etc),
# which you need interpreted literally, not as regex - by the backslash.
# Your original string was contained only slashes from this list, but
# I decide don't escape them by backslashes, but change sed's s/pattern/replace/
# command to the s|patter|replace|. You can pick any more fittable character.
needle="use\s{1,}store-service\n\
host_name\s{1,}myhost\n\
service_description\s{1,}HTTP_JVM_SYM_DS\n\
check_command\s{1,}check_http!'-p 8080 -N -u /SymmetricDS/app'\n\
check_interval\s{1,}1"
replacement="use my-template-service\n\
host_name myhost"
# This echo command displays the generated substitute command,
# which will be used by sed
# uncomment it for viewing
# echo "s/$needle/$replacement/"
# for changing the file in place add the -i option.
sed -r "
/use\s{1,}store-service/ {
N;N;N;N;
s|$needle|$replacement|
}" input.txt
Input
one
two
use store-service
host_name myhost
service_description HTTP_JVM_SYM_DS
check_command check_http!'-p 8080 -N -u /SymmetricDS/app'
check_interval 1
three
four
Output
one
two
use my-template-service
host_name myhost
three
four

How to split strings over multiple lines in Bash?

How can i split my long string constant over multiple lines?
I realize that you can do this:
echo "continuation \
lines"
>continuation lines
However, if you have indented code, it doesn't work out so well:
echo "continuation \
lines"
>continuation lines
This is what you may want
$ echo "continuation"\
> "lines"
continuation lines
If this creates two arguments to echo and you only want one, then let's look at string concatenation. In bash, placing two strings next to each other concatenate:
$ echo "continuation""lines"
continuationlines
So a continuation line without an indent is one way to break up a string:
$ echo "continuation"\
> "lines"
continuationlines
But when an indent is used:
$ echo "continuation"\
> "lines"
continuation lines
You get two arguments because this is no longer a concatenation.
If you would like a single string which crosses lines, while indenting but not getting all those spaces, one approach you can try is to ditch the continuation line and use variables:
$ a="continuation"
$ b="lines"
$ echo $a$b
continuationlines
This will allow you to have cleanly indented code at the expense of additional variables. If you make the variables local it should not be too bad.
Here documents with the <<-HERE terminator work well for indented multi-line text strings. It will remove any leading tabs from the here document. (Line terminators will still remain, though.)
cat <<-____HERE
continuation
lines
____HERE
See also http://ss64.com/bash/syntax-here.html
If you need to preserve some, but not all, leading whitespace, you might use something like
sed 's/^ //' <<____HERE
This has four leading spaces.
Two of them will be removed by sed.
____HERE
or maybe use tr to get rid of newlines:
tr -d '\012' <<-____
continuation
lines
____
(The second line has a tab and a space up front; the tab will be removed by the dash operator before the heredoc terminator, whereas the space will be preserved.)
For wrapping long complex strings over many lines, I like printf:
printf '%s' \
"This will all be printed on a " \
"single line (because the format string " \
"doesn't specify any newline)"
It also works well in contexts where you want to embed nontrivial pieces of shell script in another language where the host language's syntax won't let you use a here document, such as in a Makefile or Dockerfile.
printf '%s\n' >./myscript \
'#!/bin/sh` \
"echo \"G'day, World\"" \
'date +%F\ %T' && \
chmod a+x ./myscript && \
./myscript
You can use bash arrays
$ str_array=("continuation"
"lines")
then
$ echo "${str_array[*]}"
continuation lines
there is an extra space, because (after bash manual):
If the word is double-quoted, ${name[*]} expands to a single word with
the value of each array member separated by the first character of the
IFS variable
So set IFS='' to get rid of extra space
$ IFS=''
$ echo "${str_array[*]}"
continuationlines
In certain scenarios utilizing Bash's concatenation ability might be appropriate.
Example:
temp='this string is very long '
temp+='so I will separate it onto multiple lines'
echo $temp
this string is very long so I will separate it onto multiple lines
From the PARAMETERS section of the Bash Man page:
name=[value]...
...In the context where an assignment statement is assigning a value to a shell variable or array index, the += operator can be used to append to or add to the variable's previous value. When += is applied to a variable for which the integer attribute has been set, value is evaluated as an arithmetic expression and added to the variable's current value, which is also evaluated. When += is applied to an array variable using compound assignment (see Arrays below), the variable's value is not unset (as it is when using =), and new values are appended to the array beginning at one greater than the array's maximum index (for indexed arrays) or added as additional key-value pairs in an associative array. When applied to a string-valued variable, value is expanded and appended to the variable's value.
You could simply separate it with newlines (without using backslash) as required within the indentation as follows and just strip of new lines.
Example:
echo "continuation
of
lines" | tr '\n' ' '
Or if it is a variable definition newlines gets automatically converted to spaces. So, strip of extra spaces only if applicable.
x="continuation
of multiple
lines"
y="red|blue|
green|yellow"
echo $x # This will do as the converted space actually is meaningful
echo $y | tr -d ' ' # Stripping of space may be preferable in this case
This isn't exactly what the user asked, but another way to create a long string that spans multiple lines is by incrementally building it up, like so:
$ greeting="Hello"
$ greeting="$greeting, World"
$ echo $greeting
Hello, World
Obviously in this case it would have been simpler to build it one go, but this style can be very lightweight and understandable when dealing with longer strings.
Line continuations also can be achieved through clever use of syntax.
In the case of echo:
# echo '-n' flag prevents trailing <CR>
echo -n "This is my one-line statement" ;
echo -n " that I would like to make."
This is my one-line statement that I would like to make.
In the case of vars:
outp="This is my one-line statement" ;
outp+=" that I would like to make." ;
echo -n "${outp}"
This is my one-line statement that I would like to make.
Another approach in the case of vars:
outp="This is my one-line statement" ;
outp="${outp} that I would like to make." ;
echo -n "${outp}"
This is my one-line statement that I would like to make.
Voila!
I came across a situation in which I had to send a long message as part of a command argument and had to adhere to the line length limitation. The commands looks something like this:
somecommand --message="I am a long message" args
The way I solved this is to move the message out as a here document (like #tripleee suggested). But a here document becomes a stdin, so it needs to be read back in, I went with the below approach:
message=$(
tr "\n" " " <<-END
This is a
long message
END
)
somecommand --message="$message" args
This has the advantage that $message can be used exactly as the string constant with no extra whitespace or line breaks.
Note that the actual message lines above are prefixed with a tab character each, which is stripped by here document itself (because of the use of <<-). There are still line breaks at the end, which are then replaced by tr with spaces.
Note also that if you don't remove newlines, they will appear as is when "$message" is expanded. In some cases, you may be able to workaround by removing the double-quotes around $message, but the message will no longer be a single argument.
Depending on what sort of risks you will accept and how well you know and trust the data, you can use simplistic variable interpolation.
$: x="
this
is
variably indented
stuff
"
$: echo "$x" # preserves the newlines and spacing
this
is
variably indented
stuff
$: echo $x # no quotes, stacks it "neatly" with minimal spacing
this is variably indented stuff
Following #tripleee 's printf example (+1):
LONG_STRING=$( printf '%s' \
'This is the string that never ends.' \
' Yes, it goes on and on, my friends.' \
' My brother started typing it not knowing what it was;' \
" and he'll continue typing it forever just because..." \
' (REPEAT)' )
echo $LONG_STRING
This is the string that never ends. Yes, it goes on and on, my friends. My brother started typing it not knowing what it was; and he'll continue typing it forever just because... (REPEAT)
And we have included explicit spaces between the sentences, e.g. "' Yes...". Also, if we can do without the variable:
echo "$( printf '%s' \
'This is the string that never ends.' \
' Yes, it goes on and on, my friends.' \
' My brother started typing it not knowing what it was;' \
" and he'll continue typing it forever just because..." \
' (REPEAT)' )"
This is the string that never ends. Yes, it goes on and on, my friends. My brother started typing it not knowing what it was; and he'll continue typing it forever just because... (REPEAT)
Acknowledgement for the song that never ends
However, if you have indented code, it doesn't work out so well:
echo "continuation \
lines"
>continuation lines
Try with single quotes and concatenating the strings:
echo 'continuation' \
'lines'
>continuation lines
Note: the concatenation includes a whitespace.
This probably doesn't really answer your question but you might find it useful anyway.
The first command creates the script that's displayed by the second command.
The third command makes that script executable.
The fourth command provides a usage example.
john#malkovich:~/tmp/so$ echo $'#!/usr/bin/env python\nimport textwrap, sys\n\ndef bash_dedent(text):\n """Dedent all but the first line in the passed `text`."""\n try:\n first, rest = text.split("\\n", 1)\n return "\\n".join([first, textwrap.dedent(rest)])\n except ValueError:\n return text # single-line string\n\nprint bash_dedent(sys.argv[1])' > bash_dedent
john#malkovich:~/tmp/so$ cat bash_dedent
#!/usr/bin/env python
import textwrap, sys
def bash_dedent(text):
"""Dedent all but the first line in the passed `text`."""
try:
first, rest = text.split("\n", 1)
return "\n".join([first, textwrap.dedent(rest)])
except ValueError:
return text # single-line string
print bash_dedent(sys.argv[1])
john#malkovich:~/tmp/so$ chmod a+x bash_dedent
john#malkovich:~/tmp/so$ echo "$(./bash_dedent "first line
> second line
> third line")"
first line
second line
third line
Note that if you really want to use this script, it makes more sense to move the executable script into ~/bin so that it will be in your path.
Check the python reference for details on how textwrap.dedent works.
If the usage of $'...' or "$(...)" is confusing to you, ask another question (one per construct) if there's not already one up. It might be nice to provide a link to the question you find/ask so that other people will have a linked reference.

How do you escape a user-provided search term that you don't want evaluated for sed?

I'm trying to escape a user-provided search string that can contain any arbitrary character and give it to sed, but can't figure out how to make it safe for sed to use. In sed, we do s/search/replace/, and I want to search for exactly the characters in the search string without sed interpreting them (e.g., the '/' in 'my/path' would not close the sed expression).
I read this related question concerning how to escape the replace term. I would have thought you'd do the same thing to the search, but apparently not because sed complains.
Here's a sample program that creates a file called "my_searches". Then it reads each line of that file and performs a search and replace using sed.
#!/bin/bash
# The contents of this heredoc will be the lines of our file.
read -d '' SAMPLES << 'EOF'
/usr/include
P#$$W0RD$?
"I didn't", said Jane O'Brien.
`ls -l`
~!##$%^&*()_+-=:'}{[]/.,`"\|
EOF
echo "$SAMPLES" > my_searches
# Now for each line in the file, do some search and replace
while read line
do
echo "------===[ BEGIN $line ]===------"
# Escape every character in $line (e.g., ab/c becomes \a\b\/\c). I got
# this solution from the accepted answer in the linked SO question.
ES=$(echo "$line" | awk '{gsub(".", "\\\\&");print}')
# Search for the line we read from the file and replace it with
# the text "replaced"
sed 's/'"$ES"'/replaced/' < my_searches # Does not work
# Search for the text "Jane" and replace it with the line we read.
sed 's/Jane/'"$ES"'/' < my_searches # Works
# Search for the line we read and replace it with itself.
sed 's/'"$ES"'/'"$ES"'/' < my_searches # Does not work
echo "------===[ END ]===------"
echo
done < my_searches
When you run the program, you get sed: xregcomp: Invalid content of \{\} for the last line of the file when it's used as the 'search' term, but not the 'replace' term. I've marked the lines that give this error with # Does not work above.
------===[ BEGIN ~!##$%^&*()_+-=:'}{[]/.,`"| ]===------
sed: xregcomp: Invalid content of \{\}
------===[ END ]===------
If you don't escape the characters in $line (i.e., sed 's/'"$line"'/replaced/' < my_searches), you get this error instead because sed tries to interpret various characters:
------===[ BEGIN ~!##$%^&*()_+-=:'}{[]/.,`"| ]===------
sed: bad format in substitution expression
sed: No previous regexp.
------===[ END ]===------
So how do I escape the search term for sed so that the user can provide any arbitrary text to search for? Or more precisely, what can I replace the ES= line in my code with so that the sed command works for arbitrary text from a file?
I'm using sed because I'm limited to a subset of utilities included in busybox. Although I can use another method (like a C program), it'd be nice to know for sure whether or not there's a solution to this problem.
This is a relatively famous problem—given a string, produce a pattern that matches only that string. It is easier in some languages than others, and sed is one of the annoying ones. My advice would be to avoid sed and to write a custom program in some other language.
You could write a custom C program, using the standard library function strstr. If this is not fast enough, you could use any of the Boyer-Moore string matchers you can find with Google—they will make search extremely fast (sublinear time).
You could write this easily enough in Lua:
local function quote(s) return (s:gsub('%W', '%%%1')) end
local function replace(first, second, s)
return (s:gsub(quote(first), second))
end
for l in io.lines() do io.write(replace(arg[1], arg[2], l), '\n') end
If not fast enough, speed things up by applying quote to arg[1] only once, and inline frunciton replace.
As ghostdog mentioned, awk '{gsub(".", "\\\\&");print}' is incorrect because it escapes out non-special characters. What you really want to do is perhaps something like:
awk 'gsub(/[^[:alpha:]]/, "\\\\&")'
This will escape out non-alpha characters. For some reason I have yet to determine, I still cant replace "I didn't", said Jane O'Brien. even though my code above correctly escapes it to
\"I\ didn\'t\"\,\ said\ Jane\ O\'Brien\.
It's quite odd because this works perfectly fine
$ echo "\"I didn't\", said Jane O'Brien." | sed s/\"I\ didn\'t\"\,\ said\ Jane\ O\'Brien\./replaced/
replaced`
this : echo "$line" | awk '{gsub(".", "\\\\&");print}' escapes every character in $line, which is wrong!. do an echo $ES after that and $ES appears to be \/\u\s\r\/\i\n\c\l\u\d\e. Then when you pass to the next sed, (below)
sed 's/'"$ES"'/replaced/' my_searches
, it will not work because there is no line that has pattern \/\u\s\r\/\i\n\c\l\u\d\e. The correct way is something like:
$ sed 's|\([#$#^&*!~+-={}/]\)|\\\1|g' file
\/usr\/include
P\#\$\$W0RD\$?
"I didn't", said Jane O'Brien.
\`ls -l\`
\~\!\#\#\$%\^\&\*()_\+-\=:'\}\{[]\/.,\`"\|
you put all the characters you want escaped inside [], and choose a suitable delimiter for sed that is not in your character class, eg i chose "|". Then use the "g" (global) flag.
tell us what you are actually trying to do, ie an actual problem you are trying to solve.
This seems to work for FreeBSD sed:
# using FreeBSD & Mac OS X sed
ES="$(printf "%q" "${line}")"
ES="${ES//+/\\+}"
sed -E s$'\777'"${ES}"$'\777'replaced$'\777' < my_searches
sed -E s$'\777'Jane$'\777'"${line}"$'\777' < my_searches
sed -E s$'\777'"${ES}"$'\777'"${line}"$'\777' < my_searches
The -E option of FreeBSD sed is used to turn on extended regular expressions.
The same is available for GNU sed via the -r or --regexp-extended options respectively.
For the differences between basic and extended regular expressions see, for example:
http://www.gnu.org/software/sed/manual/sed.html#Extended-regexps
Maybe you can use FreeBSD-compatible minised instead of GNU sed?
# example using FreeBSD-compatible minised,
# http://www.exactcode.de/site/open_source/minised/
# escape some punctuation characters with printf
help printf
printf "%s\n" '!"#$%&'"'"'()*+,-./:;<=>?#[\]^_`{|}~'
printf "%q\n" '!"#$%&'"'"'()*+,-./:;<=>?#[\]^_`{|}~'
# example line
line='!"#$%&'"'"'()*+,-./:;<=>?#[\]^_`{|}~ ... and Jane ...'
# escapes in regular expression
ES="$(printf "%q" "${line}")" # escape some punctuation characters
ES="${ES//./\\.}" # . -> \.
ES="${ES//\\\\(/(}" # \( -> (
ES="${ES//\\\\)/)}" # \) -> )
# escapes in replacement string
lineEscaped="${line//&/\&}" # & -> \&
minised s$'\777'"${ES}"$'\777'REPLACED$'\777' <<< "${line}"
minised s$'\777'Jane$'\777'"${lineEscaped}"$'\777' <<< "${line}"
minised s$'\777'"${ES}"$'\777'"${lineEscaped}"$'\777' <<< "${line}"
To avoid potential backslash confusion, we could (or rather should) use a backslash variable like so:
backSlash='\\'
ES="${ES//${backSlash}(/(}" # \( -> (
ES="${ES//${backSlash})/)}" # \) -> )
(By the way using variables in such a way seems like a good approach for tackling parameter expansion issues ...)
... or to complete the backslash confusion ...
backSlash='\\'
lineEscaped="${line//${backSlash}/${backSlash}}" # double backslashes
lineEscaped="${lineEscaped//&/\&}" # & -> \&
If you have bash, and you're just doing a pattern replacement, just do it natively in bash. The ${parameter/pattern/string} expansion in Bash will work very well for you, since you can just use a variable in place of the "pattern" and replacement "string" and the variable's contents will be safe from word expansion. And it's that word expansion which makes piping to sed such a hassle. :)
It'll be faster than forking a child process and piping to sed anyway. You already know how to do the whole while read line thing, so creatively applying the capabilities in Bash's existing parameter expansion documentation can help you reproduce pretty much anything you can do with sed. Check out the bash man page to start...

Resources