How to sanitize a string in bash? - bash

I have a free form string which I need to sanitize in bash in order to produce safe-and-nice filenames.
Example:
STAGE_NAME="Some usafe name 2/2#"
Expected sanitized result"
"some-unsafe-name-2-2"
Logic:
lowercase chars
replace all unsupported or unsafe chars with dash (including spaces)
remove duplicated dashes
remove any dashes from prefix or suffix
Use of external tools like sed is allowed as long they are portable (not using options that are no available on BSD/OSX/...).

You can use this pure bash function for this sanitization:
sanitize() {
local s="${1?need a string}" # receive input in first argument
s="${s//[^[:alnum:]]/-}" # replace all non-alnum characters to -
s="${s//+(-)/-}" # convert multiple - to single -
s="${s/#-}" # remove - from start
s="${s/%-}" # remove - from end
echo "${s,,}" # convert to lowercase
}
Then call it as:
sanitize "///Some usafe name 2/2##"
some-usafe-name-2-2
sanitize "Some usafe name 2/2#"
some-usafe-name-2-2
Just for an academic exercise here is an awk one-liner doing the same:
awk -F '[^[:alnum:]]+' -v OFS=- '{$0=tolower($0); $1=$1; gsub(/^-|-$/, "")} 1'

Related

Prepending letter to field value

I have a file 0.txt containing the following value fields contents in parentheses:
(bread,milk,),
(rice,brand B,),
(pan,eggs,Brandc,),
I'm looking in OS and elsewhere for how to prepend the letter x to the beginning of each value between commas so that my output file becomes (using bash unix):
(xbread,xmilk,),
(xrice,xbrand B,),
(xpan,xeggs,xBrand C,),
the only thing I've really tried but not enough is:
awk '{gsub(/,/,",x");print}' 0.txt
for all purposes the prefix should not be applied to the last commas at the end of each line.
With awk
awk 'BEGIN{FS=OFS=","}{$1="(x"substr($1,2);for(i=2;i<=NF-2;i++){$i="x"$i}}1'
Explanation:
# Before you start, set the input and output delimiter
BEGIN{
FS=OFS=","
}
# The first field is special, the x has to be inserted
# after the opening (
$1="(x"substr($1,2)
# Prepend 'x' from field 2 until the previous to last field
for(i=2;i<=NF-2;i++){
$i="x"$i
}
# 1 is always true. awk will print in that case
1
The trick is to anchor the regexp so that it matches the whole comma-terminated substring you want to work with, not just the comma (and avoids other “special” characters in the syntax).
awk '{ gsub(/[^,()]+,/, "x&") } 1' 0.txt
sed -r 's/([^,()]+,)/x\1/g' 0.txt

sed replace string with pipe and stars

I have the following string:
|**barak**.version|2001.0132012031539|
in file text.txt.
I would like to replace it with the following:
|**barak**.version|2001.01.2012031541|
So I run:
sed -i "s/\|\*\*$module\*\*.version\|2001.0132012031539/|**$module**.version|$version/" text.txt
but the result is a duplicate instead of replacing:
|**barak**.version|2001.01.2012031541|**barak**.version|2001.0132012031539|
What am I doing wrong?
Here is the value for module and version:
$ echo $module
barak
$ echo $version
2001.01.2012031541
Assumptions:
lines of interest start and end with a pipe (|) and have one more pipe somewhere in the middle of the data
search is based solely on the value of ${module} existing between the 1st/2nd pipes in the data
we don't know what else may be between the 1st/2nd pipes
the version number is the only thing between the 2nd/3rd pipes
we don't know the version number that we'll be replacing
Sample data:
$ module='barak'
$ version='2001.01.2012031541'
$ cat text.txt
**barak**.version|2001.0132012031539| <<<=== leave this one alone
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.0132012031539| <<<=== replace this one
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.0132012031539| <<<=== replace this one
One sed solution with -Extended regex support enabled and making use of a capture group:
$ sed -E "s/^(\|[^|]*${module}[^|]*).*/\1|${version}|/" text.txt
Where:
\| - first occurrence (escaped pipe) tells sed we're dealing with a literal pipe; follow-on pipes will be treated as literal strings
^(\|[^|]*${module}[^|]*) - first capture group that starts at the beginning of the line, starts with a pipe, then some number of non-pipe characters, then the search pattern (${module}), then more non-pipe characters (continues up to next pipe character)
.* - matches rest of the line (which we're going to discard)
\1|${version}| - replace line with our first capture group, then a pipe, then the new replacement value (${version}), then the final pipe
The above generates:
**barak**.version|2001.0132012031539|
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.01.2012031541| <<<=== replaced
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.01.2012031541| <<<=== replaced
An awk alternative using GNU awk:
awk -v mod="$module" -v vers="$version" -F \| '{ OFS=FS;split($2,map,".");inmod=substr(map[1],3,length(map[1])-4);if (inmod==mod) { $3=vers } }1' file
Pass two variables mod and vers to awk using $module and $version. Set the field delimiter to |. Split the second field into array map using the split function and using . as the delimiter. Then strip the leading and ending "**" from the first index of the array to expose the module name as inmod using the substr function. Compare this to the mod variable and if there is a match, change the 3rd delimited field to the variable vers. Print the lines with short hand 1
Pipe is only special when you're using extended regular expressions: sed -E
There's no reason why you need extended here, stick with basic regex:
sed "
# for lines matching module.version
/|\*\*$module\*\*.version|/ {
# replace the version
s/|2001.0132012031539|/|$version|/
}
" text.txt
or as an unreadable one-liner
sed "/|\*\*$module\*\*.version|/ s/|2001.0132012031539|/|$version|/" text.txt

Using sed with a substitution variable that may have curly braces

I'm writing a script for looping over a set of files in a directory searching for a string (stringA) in one file (srcFile), copying the line that follows it (stringToCopy), and pasting it on the line after another search string (stringB) in another file (outputFile). The copy/paste script that I have so far is as follows
stringA="This is string A"
stringB="This is string B"
srcFile=srcFile.txt
outpuFile=outputFile.txt
replacement="/$stringA/{getline; print}"
stringToCopy="$(awk "$replacement" $srcFile)"
sed -i "/$stringB/!b;n;c${stringToCopy}" $outputFile
The script works great, except when stringToCopy ends up containing curly braces. Example is
srcFile.txt:
This is string A
text to copy: {0}
outputFile.txt:
This is string B
line to be replaced
Once the script is done, I would expect outputFile.txt to be
This is string B
text to copy: {0}
But sed chokes with
sed: -e expression #1, char 106: unknown command: `m'
I've tried hardcoding the problematic string and trying different variations of escaping the curlies and quoting the string, but haven't found a winning combination and I'm at a loss for how to make it work.
EDIT
I had a derp moment and forgot that my stringA also has curly braces, that happened to cause my awk command to math multiple lines. This caused my stringToCopy to have newlines in it which is my real issue, not the curly braces. So the real question is, how to make awk treat curly braces as literal characters so that
srcFile.txt
This is string A: {0}
text to copy: {0}
This is string A:
Other junk
And stringA="This is string A: {0}"
Doesn't set stringToCopy to
text to copy: {0}
Other junk
A bit of a kludge in that we're going to add some extra coding specifically for braces ...
Current situation:
$ awk '/This is string A: {0}/{getline; print}' srcFile.txt
text to copy: {0} # this is the line we want
Other junk # we do not want this line
We can eliminate the second line by escaping the braces in the search pattern, eg:
$ awk '/This is string A: \{0\}/{getline; print}' srcFile.txt
text to copy: {0}
So, how to escape the braces? We can use some explicit parameter expansions to replace the braces with escaped braces in the $stringA variable, keeping in mind that we also need to escape the braces in the parameter expansion phase, too:
$ stringA="This is string A: {0}"
$ stringA="${stringA//\{/\\{}" # replace '{' with '\{'
$ stringA="${stringA//\}/\\}}" # replace '}' with '\}'
$ echo "${stringA}"
This is string A: \{0\}
We can then proceed with the rest of the code as is:
$ replacement="/$stringA/{getline; print}"
$ echo "${replacement}"
/This is string A: \{0\}/{getline; print}
$ stringToCopy="$(awk "$replacement" $srcFile)"
$ echo "${stringToCopy}"
text to copy: {0}
As for the final sed step I had to remove the ! to get it to work correctly:
$ sed -i "/$stringB/b;n;c${stringToCopy}" $outputFile
$ cat "${outputFile}"
This is string B
text to copy: {0}
NOTES:
if you preface your coding with set -xv you can see how variables are being interpreted at each step; use set +xv to turn off
obviously you'll probably run into issues if you do in fact have more than 1 matching row in $srcFile
if you find other characters that need to be escaped then you'll need to add additional parameter expansions for said characters

What ##*/ does in bash? [duplicate]

I have a string like this:
/var/cpanel/users/joebloggs:DNS9=domain.example
I need to extract the username (joebloggs) from this string and store it in a variable.
The format of the string will always be the same with exception of joebloggs and domain.example so I am thinking the string can be split twice using cut?
The first split would split by : and we would store the first part in a variable to pass to the second split function.
The second split would split by / and store the last word (joebloggs) into a variable
I know how to do this in PHP using arrays and splits but I am a bit lost in bash.
To extract joebloggs from this string in bash using parameter expansion without any extra processes...
MYVAR="/var/cpanel/users/joebloggs:DNS9=domain.example"
NAME=${MYVAR%:*} # retain the part before the colon
NAME=${NAME##*/} # retain the part after the last slash
echo $NAME
Doesn't depend on joebloggs being at a particular depth in the path.
Summary
An overview of a few parameter expansion modes, for reference...
${MYVAR#pattern} # delete shortest match of pattern from the beginning
${MYVAR##pattern} # delete longest match of pattern from the beginning
${MYVAR%pattern} # delete shortest match of pattern from the end
${MYVAR%%pattern} # delete longest match of pattern from the end
So # means match from the beginning (think of a comment line) and % means from the end. One instance means shortest and two instances means longest.
You can get substrings based on position using numbers:
${MYVAR:3} # Remove the first three chars (leaving 4..end)
${MYVAR::3} # Return the first three characters
${MYVAR:3:5} # The next five characters after removing the first 3 (chars 4-9)
You can also replace particular strings or patterns using:
${MYVAR/search/replace}
The pattern is in the same format as file-name matching, so * (any characters) is common, often followed by a particular symbol like / or .
Examples:
Given a variable like
MYVAR="users/joebloggs/domain.example"
Remove the path leaving file name (all characters up to a slash):
echo ${MYVAR##*/}
domain.example
Remove the file name, leaving the path (delete shortest match after last /):
echo ${MYVAR%/*}
users/joebloggs
Get just the file extension (remove all before last period):
echo ${MYVAR##*.}
example
NOTE: To do two operations, you can't combine them, but have to assign to an intermediate variable. So to get the file name without path or extension:
NAME=${MYVAR##*/} # remove part before last slash
echo ${NAME%.*} # from the new var remove the part after the last period
domain
Define a function like this:
getUserName() {
echo $1 | cut -d : -f 1 | xargs basename
}
And pass the string as a parameter:
userName=$(getUserName "/var/cpanel/users/joebloggs:DNS9=domain.example")
echo $userName
What about sed? That will work in a single command:
sed 's#.*/\([^:]*\).*#\1#' <<<$string
The # are being used for regex dividers instead of / since the string has / in it.
.*/ grabs the string up to the last backslash.
\( .. \) marks a capture group. This is \([^:]*\).
The [^:] says any character _except a colon, and the * means zero or more.
.* means the rest of the line.
\1 means substitute what was found in the first (and only) capture group. This is the name.
Here's the breakdown matching the string with the regular expression:
/var/cpanel/users/ joebloggs :DNS9=domain.example joebloggs
sed 's#.*/ \([^:]*\) .* #\1 #'
Using a single Awk:
... | awk -F '[/:]' '{print $5}'
That is, using as field separator either / or :, the username is always in field 5.
To store it in a variable:
username=$(... | awk -F '[/:]' '{print $5}')
A more flexible implementation with sed that doesn't require username to be field 5:
... | sed -e s/:.*// -e s?.*/??
That is, delete everything from : and beyond, and then delete everything up until the last /. sed is probably faster too than awk, so this alternative is definitely better.
Using a single sed
echo "/var/cpanel/users/joebloggs:DNS9=domain.example" | sed 's/.*\/\(.*\):.*/\1/'
I like to chain together awk using different delimitators set with the -F argument. First, split the string on /users/ and then on :
txt="/var/cpanel/users/joebloggs:DNS9=domain.com"
echo $txt | awk -F"/users/" '{print$2}' | awk -F: '{print $1}'
$2 gives the text after the delim, $1 the text before it.
I know I'm a little late to the party and there's already good answers, but here's my method of doing something like this.
DIR="/var/cpanel/users/joebloggs:DNS9=domain.example"
echo ${DIR} | rev | cut -d'/' -f 1 | rev | cut -d':' -f1

How to replace a string like "[1.0 - 4.0]" with a numeric value using awk or sed?

I have a CSV file that I am piping through a set of awk/sed commands.
Some lines in the CSV file look like this:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
where the 8th and 9th columns are a string representing a numeric range.
How can I use awk or sed to replace those fields with a numeric value? Either the beginning of the range, or the end of the range?
So this line would end up as
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
or
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,3.0,0.768
I got as far as removing the brackets but past that I'm stuck. I considered splitting on the " - ", but many lines in my file have a regular numeric value, not a range, in those last two columns, and that makes things messy (I don't want to end up with some lines having a different number of columns).
Here is a sed command that will take each range and break it up into two fields. It looks for strings like "[A - B]" and converts them to A,B. It can easily be modified to just use one of the values if needed by changing the \1,\2 portion. The regular expression assumes that all numbers have at least one digit on either side of a required decimal place. So, 1, .5, and 3. would not be valid. If you need that, the regex can be made to be more accommodating.
$ cat file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
$ sed -Ee 's|"\[([0-9]+\.[0-9]+) - ([0-9]+\.[0-9]+)\]"|\1,\2|g' file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,3.0,0.384,0.768
Since your data is field-based, awk is the logical choice.
Note that while awk generally isn't aware of double-quoted fields, that is not a problem here, because the double-quoted fields do not have embedded , instances.
#!/usr/bin/env bash
useStart1=1 # set to `0` to use the *end* of the *penultimate* fields' range instead.
useStart2=1 # set to `0` to use the *end* of the *last* field's range instead.
awk -v useStart1=$useStart1 -v useStart2=$useStart2 '
BEGIN { FS=OFS="," }
{
split($(NF-1), tokens1, /[][" -]+/)
split($NF, tokens2, /[][" -]+/)
$(NF-1) = useStart1 ? tokens1[2] : tokens1[3]
$NF = useStart2 ? tokens2[2] : tokens2[3]
print
}
' <<'EOF'
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
EOF
The code above yields:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
Modifying the values of $useStart1 and $useStart2 yields the appropriate variations.

Resources