sed with arbitrary variables that might contain slashes - bash

I'm trying to use sed in the following-ish way:
VAR=`echo $STRING | sed s/$TOKEN/$REPLACEMENT/`
Unfortunately, I've come upon a case where $REPLACEMENT might possibly contain slashes. This causes the bash to complain, as it (the shell) potentially expands it to something like this:
#given $VAR=I like bananas, $TOKEN=bananas, and $REPLACEMENT=apples/oranges
VAR=`echo I like bananas | sed s/bananas/apples/oranges/`
So now sed is given an invalid argument with too many /'s. Is there any good way to handle that?

You can use any separator you like. "s!$TOKEN!$REPLACEMENT!" and "s%$TOKEN%$REPLACEMENT%" are popular alternatives.
Of course, in the general case, if the input could contain any characters whatsoever, you're back to square one. You could switch to a language which doesn't mix code and data so frivolously... including, in fact, the shell itself;
echo "${VAR/$TOKEN/$REPLACEMENT}"
(This is a Bash extension, though. It is available in some other shells, but not in classic Bourne shell.)

Here is the fix
VAR="I like bananas"
TOKEN="bananas"
REPLACEMENT="apples/oranges"
echo $VAR |sed "s#$TOKEN#$REPLACEMENT#"
I like apples/oranges

You can't reliably use sed for this as:
you typically can't find a character that is guaranteed not to be
in any of the $TOKEN or $REPLACEMENT strings, and
sed cannot search for a string - it ALWAYS searches for regular expressions and
so any RE metacharacters in $TOKEN will be evaluated as such and you
cannot reliably implement code to escape them (despite what many
people have attempted).
So, just use awk:
VAR=$(echo "$STRING" | awk -v t="$TOKEN" -v r="$REPLACEMENT" 'idx=index($0,t) {$0 = substr($0,1,idx-1) r substr($0,idx+length(t))} 1')
That will work for absolutely any character in any of the 3 strings except a newline in $STRING.
Without the echo it will handle a newline in $STRING too:
VAR=$(awk -v s="$STRING" -v t="$TOKEN" -v r="$REPLACEMENT" '
BEGIN {
if (idx = index(s,t))
s = substr(s,1,idx-1) r substr(s,idx+length(t))
print s
}')

Related

Replacing contents in a file via bashrc script and writing to directory [duplicate]

Suppose I have 'abbc' string and I want to replace:
ab -> bc
bc -> ab
If I try two replaces the result is not what I want:
echo 'abbc' | sed 's/ab/bc/g;s/bc/ab/g'
abab
So what sed command can I use to replace like below?
echo abbc | sed SED_COMMAND
bcab
EDIT:
Actually the text could have more than 2 patterns and I don't know how many replaces I will need. Since there was a answer saying that sed is a stream editor and its replaces are greedily I think that I will need to use some script language for that.
Maybe something like this:
sed 's/ab/~~/g; s/bc/ab/g; s/~~/bc/g'
Replace ~ with a character that you know won't be in the string.
I always use multiple statements with "-e"
$ sed -e 's:AND:\n&:g' -e 's:GROUP BY:\n&:g' -e 's:UNION:\n&:g' -e 's:FROM:\n&:g' file > readable.sql
This will append a '\n' before all AND's, GROUP BY's, UNION's and FROM's, whereas '&' means the matched string and '\n&' means you want to replace the matched string with an '\n' before the 'matched'
sed is a stream editor. It searches and replaces greedily. The only way to do what you asked for is using an intermediate substitution pattern and changing it back in the end.
echo 'abcd' | sed -e 's/ab/xy/;s/cd/ab/;s/xy/cd/'
Here is a variation on ooga's answer that works for multiple search and replace pairs without having to check how values might be reused:
sed -i '
s/\bAB\b/________BC________/g
s/\bBC\b/________CD________/g
s/________//g
' path_to_your_files/*.txt
Here is an example:
before:
some text AB some more text "BC" and more text.
after:
some text BC some more text "CD" and more text.
Note that \b denotes word boundaries, which is what prevents the ________ from interfering with the search (I'm using GNU sed 4.2.2 on Ubuntu). If you are not using a word boundary search, then this technique may not work.
Also note that this gives the same results as removing the s/________//g and appending && sed -i 's/________//g' path_to_your_files/*.txt to the end of the command, but doesn't require specifying the path twice.
A general variation on this would be to use \x0 or _\x0_ in place of ________ if you know that no nulls appear in your files, as jthill suggested.
Here is an excerpt from the SED manual:
-e script
--expression=script
Add the commands in script to the set of commands to be run while processing the input.
Prepend each substitution with -e option and collect them together. The example that works for me follows:
sed < ../.env-turret.dist \
-e "s/{{ name }}/turret$TURRETS_COUNT_INIT/g" \
-e "s/{{ account }}/$CFW_ACCOUNT_ID/g" > ./.env.dist
This example also shows how to use environment variables in your substitutions.
This might work for you (GNU sed):
sed -r '1{x;s/^/:abbc:bcab/;x};G;s/^/\n/;:a;/\n\n/{P;d};s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/;ta;s/\n(.)/\1\n/;ta' file
This uses a lookup table which is prepared and held in the hold space (HS) and then appended to each line. An unique marker (in this case \n) is prepended to the start of the line and used as a method to bump-along the search throughout the length of the line. Once the marker reaches the end of the line the process is finished and is printed out the lookup table and markers being discarded.
N.B. The lookup table is prepped at the very start and a second unique marker (in this case :) chosen so as not to clash with the substitution strings.
With some comments:
sed -r '
# initialize hold with :abbc:bcab
1 {
x
s/^/:abbc:bcab/
x
}
G # append hold to patt (after a \n)
s/^/\n/ # prepend a \n
:a
/\n\n/ {
P # print patt up to first \n
d # delete patt & start next cycle
}
s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/
ta # goto a if sub occurred
s/\n(.)/\1\n/ # move one char past the first \n
ta # goto a if sub occurred
'
The table works like this:
** ** replacement
:abbc:bcab
** ** pattern
Tcl has a builtin for this
$ tclsh
% string map {ab bc bc ab} abbc
bcab
This works by walking the string a character at a time doing string comparisons starting at the current position.
In perl:
perl -E '
sub string_map {
my ($str, %map) = #_;
my $i = 0;
while ($i < length $str) {
KEYS:
for my $key (keys %map) {
if (substr($str, $i, length $key) eq $key) {
substr($str, $i, length $key) = $map{$key};
$i += length($map{$key}) - 1;
last KEYS;
}
}
$i++;
}
return $str;
}
say string_map("abbc", "ab"=>"bc", "bc"=>"ab");
'
bcab
May be a simpler approach for single pattern occurrence you can try as below:
echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
My output:
~# echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
bcab
For multiple occurrences of pattern:
sed 's/\(ab\)\(bc\)/\2\1/g'
Example
~# cat try.txt
abbc abbc abbc
bcab abbc bcab
abbc abbc bcab
~# sed 's/\(ab\)\(bc\)/\2\1/g' try.txt
bcab bcab bcab
bcab bcab bcab
bcab bcab bcab
Hope this helps !!
echo "C:\Users\San.Tan\My Folder\project1" | sed -e 's/C:\\/mnt\/c\//;s/\\/\//g'
replaces
C:\Users\San.Tan\My Folder\project1
to
mnt/c/Users/San.Tan/My Folder/project1
in case someone needs to replace windows paths to Windows Subsystem for Linux(WSL) paths
If replacing the string by Variable, the solution doesn't work.
The sed command need to be in double quotes instead on single quote.
#sed -e "s/#replacevarServiceName#/$varServiceName/g" -e "s/#replacevarImageTag#/$varImageTag/g" deployment.yaml
Here is an awk based on oogas sed
echo 'abbc' | awk '{gsub(/ab/,"xy");gsub(/bc/,"ab");gsub(/xy/,"bc")}1'
bcab
I believe this should solve your problem. I may be missing a few edge cases, please comment if you notice one.
You need a way to exclude previous substitutions from future patterns, which really means making outputs distinguishable, as well as excluding these outputs from your searches, and finally making outputs indistinguishable again. This is very similar to the quoting/escaping process, so I'll draw from it.
s/\\/\\\\/g escapes all existing backslashes
s/ab/\\b\\c/g substitutes raw ab for escaped bc
s/bc/\\a\\b/g substitutes raw bc for escaped ab
s/\\\(.\)/\1/g substitutes all escaped X for raw X
I have not accounted for backslashes in ab or bc, but intuitively, I would escape the search and replace terms the same way - \ now matches \\, and substituted \\ will appear as \.
Until now I have been using backslashes as the escape character, but it's not necessarily the best choice. Almost any character should work, but be careful with the characters that need escaping in your environment, sed, etc. depending on how you intend to use the results.
Every answer posted thus far seems to agree with the statement by kuriouscoder made in his above post:
The only way to do what you asked for is using an intermediate
substitution pattern and changing it back in the end
If you are going to do this, however, and your usage might involve more than some trivial string (maybe you are filtering data, etc.), the best character to use with sed is a newline. This is because since sed is 100% line-based, a newline is the one-and-only character you are guaranteed to never receive when a new line is fetched (forget about GNU multi-line extensions for this discussion).
To start with, here is a very simple approach to solving your problem using newlines as an intermediate delimiter:
echo "abbc" | sed -E $'s/ab|bc/\\\n&/g; s/\\nab/bc/g; s/\\nbc/ab/g'
With simplicity comes some trade-offs... if you had more than a couple variables, like in your original post, you have to type them all twice. Performance might be able to be improved a little bit, too.
It gets pretty nasty to do much beyond this using sed. Even with some of the more advanced features like branching control and the hold buffer (which is really weak IMO), your options are pretty limited.
Just for fun, I came up with this one alternative, but I don't think I would have any particular reason to recommend it over the one from earlier in this post... You have to essentially make your own "convention" for delimiters if you really want to do anything fancy in sed. This is way-overkill for your original post, but it might spark some ideas for people who come across this post and have more complicated situations.
My convention below was: use multiple newlines to "protect" or "unprotect" the part of the line you're working on. One newline denotes a word boundary. Two newlines denote alternatives for a candidate replacement. I don't replace right away, but rather list the candidate replacement on the next line. Three newlines means that a value is "locked-in", like your original post way trying to do with ab and bc. After that point, further replacements will be undone, because they are protected by the newlines. A little complicated if I don't say so myself... ! sed isn't really meant for much more than the basics.
# Newlines
NL=$'\\\n'
NOT_NL=$'[\x01-\x09\x0B-\x7F]'
# Delimiters
PRE="${NL}${NL}&${NL}"
POST="${NL}${NL}"
# Un-doer (if a request was made to modify a locked-in value)
tidy="s/(\\n\\n\\n${NOT_NL}*)\\n\\n(${NOT_NL}*)\\n(${NOT_NL}*)\\n\\n/\\1\\2/g; "
# Locker-inner (three newlines means "do not touch")
tidy+="s/(\\n\\n)${NOT_NL}*\\n(${NOT_NL}*\\n\\n)/\\1${NL}\\2/g;"
# Finalizer (remove newlines)
final="s/\\n//g"
# Input/Commands
input="abbc"
cmd1="s/(ab)/${PRE}bc${POST}/g"
cmd2="s/(bc)/${PRE}ab${POST}/g"
# Execute
echo ${input} | sed -E "${cmd1}; ${tidy}; ${cmd2}; ${tidy}; ${final}"

Variable Manipulation not working as expected in macOS bash script

Given:
itemName='boo\boo\1\7\064.txt'
I want to convert the octals to printables while removing unprintables. The catch: I don't want to remove backslashed alphas like the \b. The result should be:
newItemName='boo\boo4.txt'
I can't figure out why part of the sed statement doesn't work correctly:
newItemName="$(printf "%s" "$itemName" | sed -E 's/(\\[0-7]{1,3})/'"$(somevar="&";printf "${somevar:1}";)"'/g' | tr -dc '[:print:]')"
I used somevar="&"; instead of directly accessing & so I could use variable manipulation.
The search statement s/(\[0-7]{1,3})/ works fine.
In the printf if I use $somevar or ${somevar:0} instead of ${somevar:1} I get the original string as expected (e.g. \064).
What doesn't work is the ${somevar:1}.
These also don't work: ${somevar/\/} or ${somevar//\/}.
What am I misunderstanding about how variable manipulation works?
Is there an easier way to do this? I've searched and searched...
Sam; long time no see! The problem here is the order of evaluation. All of the shell expressions, including the $(somevar="&";printf "${somevar:1}";), are evaluated before sed is even launched. As a result, somevar isn't the string matched by the regex, it's just a literal ampersand. That means ${somevar:1} is just the empty string, and you wind up just running sed -E 's/(\\[0-7]{1,3})//g'.
You need a way to take the matched string and run a calculation on it (after it's been matched), and sed just isn't flexible enough to do this. But perl is. perl has an s operator, similar to sed's, but with the e option the replacement is executed as a perl expression rather than just a literal string. Give this a try:
newItemName="$(printf "%s\n" "$itemName" | perl -pe 's/\\([0-7]{1,3})/chr oct $1/eg' | tr -dc '[:print:]')"
What am I misunderstanding about how variable manipulation works?
I believe you are misunderstanding how sed works.
When & character is used inside the replacement string, it is replaced by the whole string matched. See this sed introduction.
Now about ${var:offset} parameter expansion:
somevar=&
printf "$somevar"
would print &. Then:
printf "${somevar:1}"
would extract substring starting at offset 1 to the end of string. The first character is at offset, well, 0, so at at offset 1 there is no character, because out variable somevar has one character. So it will print nothing.
printf "${somevar:0}"
would print a substring starting at offset 0 to the end of the string. So the whole string. So ${somevar:0} is equal to $somevar. It will print &.
So:
$(somevar="&";printf "${somevar:1}";)
expands to nothing, because ${somevar:1} expands to nothing. So you sed command looks like this:
sed -E 's/(\\[0-7]{1,3})//g'
The sed command substitutes a \ character followed by a number 0-7 one to 3 times for nothing, multiple times. It does what you want.
Now if it would be ${somevar:0} then:
$(somevar="&";printf "${somevar:0}";)
expands to &, so your sed command would look like this:
sed -E 's/(\\[0-7]{1,3})/&/g'
so it would substitute a \\[0-7]{1,3} for itself. Ie. it does nothing.
You could loose the -E option and (...) backreference, and just use posixly compatible sed:
sed 's/\\[0-7]\{1,3\}//g'
Is there an easier way to do this?
Your method looks fine. You could use a here string instead of printf and you could strengthen the sed to match octal numbers better, depending on needs:
newItemName="$(
<<<"$itemName" sed 's/\\\([0-3][0-7]\{0,2\}\|[0-7]\{1,2\}\)//g' |
tr -dc '[:print:]'
)"

Capturing Groups From a Grep RegEx

I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:
files="*.jpg"
for f in $files
do
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
echo $name
done
So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.
I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I would like to attack this from the *nix purist angle.
Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?
If you're using Bash, you don't even have to use grep:
files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files # unquoted in order to allow the glob to expand
do
if [[ $f =~ $regex ]]
then
name="${BASH_REMATCH[1]}"
echo "${name}.jpg" # concatenate strings
name="${name}.jpg" # same thing stored in a variable
else
echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
fi
done
It's better to put the regex in a variable. Some patterns won't work if included literally.
This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.
You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:
123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz
To eliminate the second and fourth examples, make your regex like this:
^[0-9]+_([a-z]+)_[0-9a-z]*
which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:
^[0-9]+_([a-z]+)_[0-9a-z]*$
then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.
If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):
name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg
The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).
The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.
In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.
The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.
This isn't really possible with pure grep, at least not generally.
But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).
Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:
echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'
The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.
(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).
Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)
I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep, which doesn't seem to have been mentioned yet. Try changing the lines:
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
to the following:
name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')
to get only the contents of the capturing group 1.
The pcregrep tool utilizes all of the same syntax you've already used with grep, but implements the functionality that you need.
The parameter -o works just like the grep version if it is bare, but it also accepts a numeric parameter in pcregrep, which indicates which capturing group you want to show.
With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.
Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.
Not possible in just grep I believe
for sed:
name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`
I'll take a stab at the bonus though:
echo "$name.jpg"
This is a solution that uses gawk. It's something I find I need to use often so I created a function for it
function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }
to use just do
$ echo 'hello world' | regex1 'hello\s(.*)'
world
str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
week="${BASH_REMATCH[1]}"
day="${BASH_REMATCH[2]}"
hour="${BASH_REMATCH[3]}"
echo $week --- $day ---- $hour
fi
output:
1 --- 2 ---- 1
A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:
f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}
Then name will have the value abc.
See Apple developer docs, search forward for 'Parameter Expansion'.
I prefer the one line python or perl command, both often included in major linux disdribution
echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' | python -c $'
import re
import sys
for i in sys.stdin:
g=re.match(r\'.*href="(.*)"\',i);
if g is not None:
print g.group(1)
'
and to handle files:
ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
i=i.strip()
f=open(i,"r")
for j in f:
g=re.match(r\'.*href="(.*)"\',j);
if g is not None:
print g.group(1)
f.close()
'
The follow example shows how to extract the 3 character sequence from a filename using a regex capture group:
for f in 123_abc_123.jpg 123_xyz_432.jpg
do
echo "f: " $f
name=$( perl -ne 'if (/[0-9]+_([a-z]+)_[0-9a-z]*/) { print $1 . "\n" }' <<< $f )
echo "name: " $name
done
Outputs:
f: 123_abc_123.jpg
name: abc
f: 123_xyz_432.jpg
name: xyz
So the if-regex conditional in perl will filter out all non-matching lines at the same time, for those lines that do match, it will apply the capture group(s) which you can access with $1, $2, ... respectively,
if you have bash, you can use extended globbing
shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done
or
ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done

How to use sed to test and then edit one line of input?

I want to test whether a phone number is valid, and then translate it to a different format using a script. This far I can test the number like this:
sed -n -e '/(0..)-...\s..../p' -e '/(0..)-...-..../p'
However, I don't just want to test the number and output it, I would like to remove the brackets, dashes and spaces and output that.
Is there any way to do that using sed? Or should I be using something else, like AWK?
I'm not sure why you're using a 0 in that position. You're saying "a zero followed by any two characters" in the area code position. Is that really what you mean?
Anyway, you want to use the sed substitution operator with the p command in conjunction with the -n switch. Here's one way to do it:
sed -n 's/(\([0-9][0-9][0-9]\))\s\?\([0-9][0-9][0-9]\)[- ]\([0-9][0-9][0-9][0-9]\)/\1\2\3/p'
You can also use something as simple as egrep to validate lines and tr to remove the characters you don't want to see:
egrep "\([0-9]+\)[0-9.-]+" <file> |tr -d '()\-'
Note that it will only work if you don't want to keep any of those characters.
This is a more succinct version of Jonathan Feinberg's answer. It uses extended regular expressions to avoid having to do all the escaping that the curly braces would require (in addition to moving the escaping of parentheses from the special ones to the literal ones).
sed -r 's/\(([[:digit:]]{3})\)\s?([[:digit:]]{3})[ -]([[:digit:]]{4})/\1\2\3/'
this suggestion depends on how your number format looks like , for example, i assume phone number like this
echo "(703) 234 5678" | awk '
{
for(i=1;i<=NF;i++){
gsub(/\(|\)/,"",$i) # remove ( and )
if ($i+0>=0 ){ # check if it more than 0 and a number
print $i
}
if (){
# some other checks
}
}
}
'
do it systematically, and you don't have to waste time crafting out complex regex

In a bash script, how do I sanitize user input?

I'm looking for the best way to take a simple input:
echo -n "Enter a string here: "
read -e STRING
and clean it up by removing non-alphanumeric characters, lower(case), and replacing spaces with underscores.
Does order matter? Is tr the best / only way to go about this?
As dj_segfault points out, the shell can do most of this for you. Looks like you'll have to fall back on something external for lower-casing the string, though. For this you have many options, like the perl one-liners above, etc., but I think tr is probably the simplest.
# first, strip underscores
CLEAN=${STRING//_/}
# next, replace spaces with underscores
CLEAN=${CLEAN// /_}
# now, clean out anything that's not alphanumeric or an underscore
CLEAN=${CLEAN//[^a-zA-Z0-9_]/}
# finally, lowercase with TR
CLEAN=`echo -n $CLEAN | tr A-Z a-z`
The order here is somewhat important. We want to get rid of underscores, plus replace spaces with underscores, so we have to be sure to strip underscores first. By waiting to pass things to tr until the end, we know we have only alphanumeric and underscores, and we can be sure we have no spaces, so we don't have to worry about special characters being interpreted by the shell.
Bash can do this all on it's own, thank you very much. If you look at the section of the man page on Parameter Expansion, you'll see that that bash has built-in substitutions, substring, trim, rtrim, etc.
To eliminate all non-alphanumeric characters, do
CLEANSTRING=${STRING//[^a-zA-Z0-9]/}
That's Occam's razor. No need to launch another process.
For Bash >= 4.0:
CLEAN="${STRING//_/}" && \
CLEAN="${CLEAN// /_}" && \
CLEAN="${CLEAN//[^a-zA-Z0-9]/}" && \
CLEAN="${CLEAN,,}"
This is especially useful for creating container names programmatically using docker/podman. However, in this case you'll also want to remove the underscores:
# Sanitize $STRING for a container name
CLEAN="${STRING//[^a-zA-Z0-9]/}" && \
CLEAN="${CLEAN,,}"
After a bit of looking around it seems tr is indeed the simplest way:
export CLEANSTRING="`echo -n "${STRING}" | tr -cd '[:alnum:] [:space:]' | tr '[:space:]' '-' | tr '[:upper:]' '[:lower:]'`"
Occam's razor, I suppose.
You could run it through perl.
export CLEANSTRING=$(perl -e 'print join( q//, map { s/\\s+/_/g; lc } split /[^\\s\\w]+/, \$ENV{STRING} )')
I'm using ksh-style subshell here, I'm not totally sure that it works in bash.
That's the nice thing about shell, is that you can use perl, awk, sed, grep....

Resources