BASH regex syntax for replacing a sub-string

BASH regex syntax for replacing a sub-string - bash

I'm working in bash and I want to remove a substring from a string, I use grep to detect the string and that works as I want, my if conditions are true, I can test them in other tools and they select exactly the string element I want.
When it comes to removing the element from the string I'm having difficulty.
I want to remove something like ": Series 1", where there could be different numbers including 0 padded, a lower case s or extra spaces.
temp='Testing: This is a test: Series 1'
echo "A. "$temp
if echo "$temp" | grep -q -i ":[ ]*[S|s]eries[ ]*[0-9]*" && [ "$temp" != "" ]; then
title=$temp
echo "B. "$title
temp=${title//:[ ]*[S|s]eries[ ]*[0-9]*/ }
echo "C. "$temp
fi
# I trim temp for spaces here
series_title=${temp// /_}
echo "D. "$series_title
The problem I have is that at points C & D
Give me:
C. Testing
D. Testing_

You can perform regex matching from bash alone without using external tools.
It's not clear what your requirement is. But from your code, I guess following will help.
temp='Testing: This is a test: Series 1'
# Following will do a regex match and extract necessary parts
# i.e. extract everything before `:` if the entire pattern is matched
[[ $temp =~ (.*):\ *[Ss]eries\ *[0-9]* ]] || { echo "regex match failed"; exit; }
# now you can use the extracted groups as follows
echo "${BASH_REMATCH[1]}" # Output = Testing: This is a test
As mentioned in the comments, if you need to extract parts both before and after the removed section,
temp='Testing: This is a test: Series 1 <keep this>'
[[ $temp =~ (.*):\ *[Ss]eries\ *[0-9]*\ *(.*) ]] || { echo "invalid"; exit; }
echo "${BASH_REMATCH[1]} ${BASH_REMATCH[2]}" # Output = Testing: This is a test <keep this>
Keep in mind that [0-9]* will match zero lengths too. If you need to force that there need to be at least single digit, use [0-9]+ instead. Same goes for <space here>* (i.e. zero or more spaces) and others.

Related

In Bash, is it possible to match a string variable containing wildcards to another string

I am trying to compare strings against a list of other strings read from a file.
However some of the strings in the file contain wildcard characters (both ? and *) which need to be taken into account when matching.
I am probably missing something but I am unable to see how to do it
Eg.
I have strings from file in an array which could be anything alphanumeric (and include commas and full stops) with wildcards : (a?cd, xy, q?hz, j,h-??)
and I have another string I wish to compare with each item in the list in turn. Any of the strings may contain spaces.
so what I want is something like
teststring="abcdx.rubb ish,y"
matchstrings=("a?cd" "*x*y" "q?h*z" "j*,h-??")
for i in "${matchstrings[#]}" ; do
if [[ "$i" == "$teststring" ]]; then # this test here is the problem
<do something>
else
<do something else>
fi
done
This should match on the second "matchstring" but not any others
Any help appreciated

Yes; you just have the two operands to == reversed; the glob goes on the right (and must not be quoted):
if [[ $teststring == $i ]]; then
Example:
$ i=f*
$ [[ foo == $i ]] && echo pattern match
pattern match
If you quote the parameter expansion, the operation is treated as a literal string comparison, not a pattern match.
$ [[ foo == "$i" ]] || echo "foo != f*"
foo != f*
Spaces in the pattern are not a problem:
$ i="foo b*"
$ [[ "foo bar" == $i ]] && echo pattern match
pattern match

You can do this even completely within POSIX, since case alternatives undergo parameter substitution:
#!/bin/sh
teststring="abcdx.rubbish,y"
while IFS= read -r matchstring; do
case $teststring in
($matchstring) echo "$matchstring";;
esac
done << "EOF"
a?cd
*x*y
q?h*z
j*,h-??
EOF
This outputs only *x*y as desired.

Bash shell test if all characters in one string are in another string

I have two strings which I want to compare for equal chars, the strings must contain the exact chars but mychars can have extra chars.
mychars="abcdefg"
testone="abcdefgh" # false h is not in mychars
testtwo="abcddabc" # true all char in testtwo are in mychars
function test() {
if each char in $1 is in $2 # PSEUDO CODE
then
return 1
else
return 0
fi
}
if test $testone $mychars; then
echo "All in the string" ;
else ; echo "Not all in the string" ; fi
# should echo "Not all in the string" because the h is not in the string mychars
if test $testtwo $mychars; then
echo "All in the string" ;
else ; echo "Not all in the string" ; fi
# should echo 'All in the string'
What is the best way to do this? My guess is to loop over all the chars in the first parameter.

You can use tr to replace any char from mychars with a symbol, then you can test if the resulting string is any different from the symbol, p.e.,:
tr -s "[$mychars]" "." <<< "ggaaabbbcdefg"
Outputs:
.
But:
tr -s "[$mychars]" "." <<< "xxxggaaabbbcdefgxxx"
Prints:
xxx.xxx
So, your function could be like the following:
function test() {
local dictionary="$1"
local res=$(tr -s "[$dictionary]" "." <<< "$2")
if [ "$res" == "." ]; then
return 1
else
return 0
fi
}
Update: As suggested by #mklement0, the whole function could be shortened (and the logic fixed) by the following:
function test() {
local dictionary="$1"
[[ '.' == $(tr -s "[$dictionary]" "." <<< "$2") ]]
}

The accepted answer's solution is short, clever, and efficient.
Here's a less efficient alternative, which may be of interest if you want to know which characters are unique to the 1st string, returned as a sorted, distinct list:
charTest() {
local charsUniqueToStr1
# Determine which chars. in $1 aren't in $2.
# This returns a sorted, distinct list of chars., each on its own line.
charsUniqueToStr1=$(comm -23 \
<(sed 's/\(.\)/\1\'$'\n''/g' <<<"$1" | sort -u) \
<(sed 's/\(.\)/\1\'$'\n''/g' <<<"$2" | sort -u))
# The test succeeds if there are no chars. in $1 that aren't also in $2.
[[ -z $charsUniqueToStr1 ]]
}
mychars="abcdefg" # define reference string
charTest "abcdefgh" "$mychars"
echo $? # print exit code: 1 - 'h' is not in reference string
charTest "abcddabc" "$mychars"
echo $? # print exit code: 0 - all chars. are in reference string
Note that I've renamed test() to charTest() to avoid a name collision with the test builtin/utility.
sed 's/\(.\)/\1\'$'\n''/g' splits the input into individual characters by placing each on a separate line.
Note that the command creates an extra empty line at the end, but that doesn't matter in this case; to eliminate it, append ; ${s/\n$//;} to the sed script.
The command is written in a POSIX-compliant manner, which complicates it, due to having to splice in an \-escaped actual newline (via an ANSI C-quoted string, $\n'); if you have GNU sed, you can simplify to sed -r 's/(.)/\1\n/g
sort -u then sorts the resulting list of characters and weeds out duplicates (-u).
comm -23 compares the distinct set of sorted characters in both strings and prints those unique to the 1st string (comm uses a 3-column layout, with the 1st column containing lines unique to the 1st file, the 2nd column containing lines unique to the 2nd column, and the 3rd column printing lines the two input files have in common; -23 suppresses the 2nd and 3rd columns, effectively only printing the lines that are unique to the 1st input).
[[ -z $charsUniqueToStr1 ]] then tests if $charsUniqueToStr1 is empty (-z);
in other words: success (exit code 0) is indicated, if the 1st string contains no chars. that aren't also contained in the 2nd string; otherwise, failure (exit code 1); by virtue of the conditional ([[ .. ]]) being the last statement in the function, its exit code also becomes the function's exit code.

substring extraction in bash

iamnewbie: this code is inefficient but it should extract the substring, the problem is with last echo statement,need some insight.
function regex {
#this function gives the regular expression needed
echo -n \'
for (( i = 1 ; i <= $1 ; i++ ))
do
echo -n .
done
echo -n '\('
for (( i = 1 ; i <= $2 ; i++ ))
do
echo -n .
done
echo -n '\)'
echo -n \'
}
# regex function ends
echo "Enter the string:"
read stg
#variable stg holds the string entered
if [ -z "$stg" ] ; then
echo "Null string"
exit
else
echo "Length of the $stg is:"
z=`expr "$stg" : '.*' `
#variable z holds the length of given string
echo $z
fi
echo "Enter the number of trailing characters to be extracted from $stg:"
read n
m=`expr $z - $n `
#variable m holds an integer value which is equal to total length - length of characters to be extracted
x=$(regex $m $n)
echo ` expr "$stg" : "$x" `
#the echo statement(above) is just printing a newline!! But not the result
What I intend to do with this code is, if I enter "racecar" and give "3" , it should display "car" which are the last three characters. Instead of displaying "car" its just printing a newline. Please correct this code rather than giving a better one.

Although you didn't ask for a better solution, it's worth mentioning:
$ n=3
$ stg=racecar
$ echo "${stg: -n}"
car
Note that the space after the : in ${stg: -n} is required. Without the space, the parameter expansion is a default-value expansion rather than a substring expansion. With the space, it's a substring expansion; -n is interpreted as an arithmetic expression (which means that n is interpreted as $n) and since the result is a negative number, it specifies the number of characters from the end to start the substring. See the Bash manual for details.
Your solution is based on evaluating the equivalent of:
expr "$stg" : '......$...$'
with an appropriate number of dots. It's important to understand what the above bash syntax actually means. It invokes the command expr, passing it three arguments:
arg 1: the contents of the variable stg
arg 2: :
arg 3: ......$...$
Note that there are no quotes visible. That's because the quotes are part of bash syntax, not part of the argument values.
If the value of stg had enough characters, the result of the above expr invocation would be to print out the 7th, 8th and 9th character of the value of stg`. Otherwise, it would print a blank line, and fail.
But that's not what you are doing. You're creating the regular expression:
'......$...$'
which has single quotes in it. Since single-quotes are not special characters in a regex, they match themselves; in other words, that pattern will match a string which starts with a single quote, followed by nine arbitrary characters, followed by another single quote. And if the string does match, it will print the three characters prior to the second single-quote.
Of course, since the regular expression you make has a . for every character in the target string, it won't match the target even if the target started and begun with a single-quote, since there would be too many dots in the regex to match that.
If you don't put single quotes into the regex, then your program will work, but I have to say that few times have I seen such an intensely circuitous implementation of the substring function. If you're not trying to win an obfuscated bash competition (a difficult challenge since most production bash code is obfuscated by nature), I'd suggest you use normal bash features instead of trying to do everything with regexen.
One of those is the syntax to determine the length of a string:
$ stg=racecar
$ echo ${#stg}
7
(although, as shown at the beginning, you don't actually even need that.)

What about:
$ n=3
$ string="racecar"
$ [[ "$string" =~ (.{$n})$ ]]
$ echo ${BASH_REMATCH[1]}
car
This looks for the last n characters at the end of the line. In a script:
#!/bin/bash
read -p "Enter a string: " string
read -p "Enter the number of characters you want from the end: " n
[[ "$string" =~ (.{$n})$ ]]
echo "These are the last $n characters: ${BASH_REMATCH[1]}"
You may want to add some more error handling, but this'll do it.

I'm not sure you need loops for this task. I wrote some example to get two parameters from user and cut the word according to it.
#!/bin/bash
read -p "Enter some word? " -e stg
#variable stg holds the string entered
if [ -z "$stg" ] ; then
echo "Null string"
exit 1
fi
read -p "Enter some number to set word length? " -e cutNumber
# check that cutNumber is a number
if ! [ "$cutNumber" -eq "$cutNumber" ]; then
echo "Not a number!"
exit 1
fi
echo "Cut first n characters:"
echo ${stg:$cutNumber}
echo
echo "Show first n characters:"
echo ${stg:0:$cutNumber}
echo "Alternative get last n characters:"
echo -n "$stg" | tail -c $cutNumber
echo
Example:
Enter some word? TheRaceCar
Enter some number to set word length? 7
Cut first n characters:
Car
Show first n characters:
TheRace
Alternative get last n characters:
RaceCar

multiline regexp matching in bash

I would like to do some multiline matching with bash's =~
#!/bin/bash
str='foo = 1 2 3
bar = what about 42?
boo = more words
'
re='bar = (.*)'
if [[ "$str" =~ $re ]]; then
echo "${BASH_REMATCH[1]}"
else
echo no match
fi
Almost there, but if I use ^ or $, it will not match, and if I don't use them, . eats newlines too.
EDIT:
sorry, values after = could be multi-word values.

I could be wrong, but after a quick read from here, especially Note 2 at the end of the page, bash can sometimes include the newline character when matching with the dot operator. Therefore, a quick solution would be:
#!/bin/bash
str='foo = 1
bar = 2
boo = 3
'
re='bar = ([^\
]*)'
if [[ "$str" =~ $re ]]; then
echo "${BASH_REMATCH[1]}"
else
echo no match
fi
Notice that I now ask it match anything except newlines. Hope this helps =)
Edit: Also, if I understood correctly, the ^ or $ will actually match the start or the end (respectively) of the string, and not the line. It would be better if someone else could confirm this, but it is the case and you do want to match by line, you'll need to write a while loop to read each line individually.

Multiple matches in a string using regex in bash

Been looking for some more advanced regex info on regex with bash and have not found much information on it.
Here's the concept, with a simple string:
myString="DO-BATCH BATCH-DO"
if [[ $myString =~ ([[:alpha:]]*)-([[:alpha:]]*) ]]; then
echo ${BASH_REMATCH[1]} #first perens
echo ${BASH_REMATCH[2]} #second perens
echo ${BASH_REMATCH[0]} #full match
fi
outputs:
BATCH
DO
DO-BATCH
So fine it does the first match (BATCH-DO) but how do I pull a second match (DO-BATCH)? I'm just drawing a blank here and can not find much info on bash regex.

OK so one way I did this is to put it in a for loop:
myString="DO-BATCH BATCH-DO"
for aString in ${myString[#]}; do
if [[ ${aString} =~ ([[:alpha:]]*)-([[:alpha:]]*) ]]; then
echo ${BASH_REMATCH[1]} #first perens
echo ${BASH_REMATCH[2]} #second perens
echo ${BASH_REMATCH[0]} #full match
fi
done
which outputs:
DO
BATCH
DO-BATCH
BATCH
DO
BATCH-DO
Which works but I kind of was hoping to pull it all from one regex if possible.

In your answer, myString is not an array, but you use an array reference to access it. This works in Bash because the 0th element of an array can be referred to by just the variable name and vice versa. What that means is that you could use:
for aString in $myString; do
to get the same result in this case.
In your question, you say the output includes "BATCH-DO". I get "DO-BATCH" so I presume this was a typo.
The only way to get the extra strings without using a for loop is to use a longer regex. By the way, I recommend putting Bash regexes in variable. It makes certain types much easier to use (those the contain whitespace or special characters, for example.
pattern='(([[:alpha:]]*)-([[:alpha:]]*)) +(([[:alpha:]]*)-([[:alpha:]]*))'
[[ $myString =~ $pattern ]]
declare -p BASH_REMATCH #dump the array
Outputs:
declare -ar BASH_REMATCH='([0]="DO-BATCH BATCH-DO" [1]="DO-BATCH" [2]="DO" [3]="BATCH" [4]="BATCH-DO" [5]="BATCH" [6]="DO")'
The extra set of parentheses is needed if you want to capture the individual substrings as well as the hyphenated phrases. If you don't need the individual words, you can eliminate the inner sets of parentheses.
Notice that you don't need to use if if you only need to extract substrings. You only need if to take conditional action based on a match.
Also notice that ${BASH_REMATCH[0]} will be quite different with the longer regex since it contains the whole match.

Per #Dennis Williamson's post I messed around and ended up with the following:
myString="DO-BATCH BATCH-DO"
pattern='(([[:alpha:]]*)-([[:alpha:]]*)) +(([[:alpha:]]*)-([[:alpha:]]*))'
[[ $myString =~ $pattern ]] && { read -a myREMatch <<< ${BASH_REMATCH[#]}; }
echo "\${myString} -> ${myString}"
echo "\${#myREMatch[#]} -> ${#myREMatch[#]}"
for (( i = 0; i < ${#myREMatch[#]}; i++ )); do
echo "\${myREMatch[$i]} -> ${myREMatch[$i]}"
done
This works fine except myString must have the 2 values to be there. So I post this because its is kinda interesting and I had fun messing with it. But to get this more generic and address any amount of paired groups (ie DO-BATCH) I'm going to go with a modified version of my original answer:
myString="DO-BATCH BATCH-DO"
myRE="([[:alpha:]]*)-([[:alpha:]]*)"
read -a myString <<< $myString
for aString in ${myString[#]}; do
echo "\${aString} -> ${aString}"
if [[ ${aString} =~ ${myRE} ]]; then
echo "\${BASH_REMATCH[#]} -> ${BASH_REMATCH[#]}"
echo "\${#BASH_REMATCH[#]} -> ${#BASH_REMATCH[#]}"
for (( i = 0; i < ${#BASH_REMATCH[#]}; i++ )); do
echo "\${BASH_REMATCH[$i]} -> ${BASH_REMATCH[$i]}"
done
fi
done
I would have liked a perlre like multiple match but this works fine.

Although this is a year old question (without accepted answer), could the regex pattern be simplified to:
myRE="([[:alpha:]]*-[[:alpha:]]*)"
by removing the inner parenthesis to find a smaller (more concise) set of the words DO-BATCH and BATCH-DO?
It works for me in you 18:10 time answer. ${BASH_REMATCH[0]} and ${BASH_REMATCH[1]} result in the 2 words being found.

In case you don't actually know how many matches there will be ahead of time, you can use this:
#!/bin/bash
function handle_value {
local one=$1
local two=$2
echo "i found ${one}-${two}"
}
function match_all {
local current=$1
local regex=$2
local handler=$3
while [[ ${current} =~ ${regex} ]]; do
"${handler}" "${BASH_REMATCH[#]:1}"
# trim off the portion already matched
current="${current#${BASH_REMATCH[0]}}"
done
}
match_all \
"DO-BATCH BATCH-DO" \
'([[:alpha:]]*)-([[:alpha:]]*)[[:space:]]*' \
'handle_value'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio