How to capture the longest match of a repeating pattern using BASH_REMATCH

How to capture the longest match of a repeating pattern using BASH_REMATCH - bash

I am trying to capture the longest match of a repeating pattern
do_run() {
local regex='.*((abc)+).*'
local str='_abcabcabc123_'
echo "regex=${regex}"$'\n'
echo "str=${str}"$'\n'
if [[ "${str}" =~ ${regex} ]]
then
for i in ${!BASH_REMATCH[#]}
do
echo "$i=${BASH_REMATCH[i]}"
done
else
echo "no match"
fi
}
I get the following output :
regex=.*((abc)+).*
str=_abcabcabc_
0=_abcabcabc123_
1=abc
2=abc
I am trying to get something like :
regex=.*((abc)+).*
str=_abcabcabc123_
0=_abcabcabc123_
x=abcabcabc
(Update : x is just here to indicate that the index of the matching group does not matter but I need to know what number to use to retrieve the matching group ...)
Update:
After reading comment, the following regex will work : ((abc)+)
However, I also need to capture what precedes and what follows ((abc)+).
I had not mentionned it earlier because I thought the same solution would be applied.
So the new code would be :
do_run() {
local regex='(.*)((abc)+)(.*)'
local str='_abcabcabc123_'
echo "regex=${regex}"$'\n'
echo "str=${str}"$'\n'
if [[ "${str}" =~ ${regex} ]]
then
for i in ${!BASH_REMATCH[#]}
do
echo "$i=${BASH_REMATCH[i]}"
done
else
echo "no match"
fi
}
I get then the following output :
regex=(.*)((abc)+)(.*)
str=_abcabcabc123_
0=_abcabcabc123_
1=_abcabc
2=abc
3=abc
4=123_
I want to be able to retrieve abcabcabc from a matching group but also what precedes it and what follows it

As a workaround you can do like this:
[STEP 101] $ cat foo.sh
v=_abcabcabc123_
if [[ $v =~ (abc)+ ]]; then
middle=${BASH_REMATCH[0]}
[[ $v =~ (.*)"$middle" ]]
before=${BASH_REMATCH[1]}
[[ $v =~ "$middle"(.*) ]]
after=${BASH_REMATCH[1]}
echo "before: $before"
echo "middle: $middle"
echo "after : $after"
fi
[STEP 102] $ bash foo.sh
before: _
middle: abcabcabc
after : 123_
[STEP 103] $

I also need to capture what precedes and what follows ((abc)+).
For that, typically you'll need a negative lookahead with perl regex, something along (?<!abc)((abs)+)(.*).
I am bad at perl regex, with perl-enabled grep I was able to this:
$ grep -oxP '(.*)(?<!abc)((abc)+)\K(.*)' <<<'_abcabcabc123_'
123_
$ grep -oP '((abc)+)' <<<'_abcabcabc123_'
abcabcabc
$ rev <<<'_abcabcabc123_' | grep -oP '(.*)(?<!cba)((cba)+)\K(.*)' | rev
_
Bash has no lookarounds and no perl regex. Consider using python or perl.
But you may use sed by splitting the part on the regex and then reading lines, which may be simpler:
$ readarray -t lines < <(<<<'_abcabcabc123_' sed -E 's/((abc)+)/\n&\n/'); declare -p lines
declare -a lines=([0]="_" [1]="abcabcabc" [2]="123_")
Another idea: you may use bash expansion to replace the abc parts by something unique, then split it on that separator:
$ IFS=' ' read -r before post < <(printf "%s\n" "${str//abc/ }") ; declare -p before post
declare -- before="_"
declare -- post="123_"
# or
$ IFS='#' read -r before post < <(<<<"${str//abc/#}" tr -s '#') ; declare -p before post
declare -- before="_"
declare -- post="123_"

For your given input this regex would work:
re='^([^a]|a[^b]*|ab[^c]*)((abc)+)(.*)'
str='_abcabcabc123_'
[[ $str =~ $re ]] && declare -p BASH_REMATCH
Output:
declare -ar BASH_REMATCH=([0]="_abcabcabc123_" [1]="_" [2]="abcabcabc" [3]="abc" [4]="123_")
So you can use:
"${BASH_REMATCH[1]}" # string before
"${BASH_REMATCH[2]}" # string containing all "abc"s
"${BASH_REMATCH[4]}" # string after
RegEx Demo

Related

Get first character of each string with BASH_REMATCH

I'am trying to get the first character of each string using regex and BASH_REMATCH in shell script.
My input text file contain :
config_text = STACK OVER FLOW
The strings STACK OVER FLOW must be uppercase like that.
My output should be something like this :
SOF
My code for now is :
var = config_text
values=$(grep $var test_file.txt | tr -s ' ' '\n' | cut -c 1)
if [[ $values =~ [=(.*)]]; then
echo $values
fi
As you can see I'am using tr and cut but I'am looking to replace them with only BASH_REMATCH because these two commands have been reported in many links as not functional on MacOs.
I tried something like this :
var = config_text
values=$(grep $var test_file.txt)
if [[ $values =~ [=(.*)(\b[a-zA-Z])]]; then
echo $values
fi
VALUES as I explained should be :
S O F
But it seems \b does not work on shell script.
Anyone have an idea how to get my desired output with BASH_REMATCH ONLY.
Thanks in advance for any help.

A generic BASH_REMATCH solution handling any number of words and any separator.
local input="STACK OVER FLOW" pattern='([[:upper:]]+)([^[:upper:]]*)' result=""
while [[ $input =~ $pattern ]]; do
result+="${BASH_REMATCH[1]::1}${BASH_REMATCH[2]}"
input="${input:${#BASH_REMATCH[0]}}"
done
echo "$result"
# Output: "S O F"

Bash's regexes are kind of cumbersome if you don't know how many words there are in the input string. How's this instead?
config_text="STACK OVER FLOW"
sed 's/\([^[:space:]]\)[^[:space:]]*/\1/g' <<<"$config_text"

First Put a valid shebang and paste your script at https://shellcheck.net for validation/recommendation.
With the assumption that the line starts with config and ends with FLOW e.g.
config_text = STACK OVER FLOW
Now the script.
#!/usr/bin/env bash
values="config_text = STACK OVER FLOW"
regexp="config_text = ([[:upper:]]{1})[^ ]+ ([[:upper:]]{1})[^ ]+ ([[:upper:]]{1}).+$"
while IFS= read -r line; do
[[ "$line" = "$values" && "$values" =~ $regexp ]] &&
printf '%s %s %s\n' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}" "${BASH_REMATCH[3]}"
done < test_file.txt
If there is Only one line or the target string/pattern is at the first line of the test_file.txt, the while loop is not needed.
#!/usr/bin/env bash
values="config_text = STACK OVER FLOW"
regexp="config_text = ([[:upper:]]{1})[^ ]+ ([[:upper:]]{1})[^ ]+ ([[:upper:]]{1}).+$"
IFS= read -r line < test_file.txt
[[ "$line" = "$values" && "$values" =~ $regexp ]] &&
printf '%s %s %s\n' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}" "${BASH_REMATCH[3]}"
Make sure you have and running/using Bashv4+ since MacOS, defaults to Bashv3
See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?

Another option rather than bash regex would be to utilize bash parameter expansion substring ${parameter:offset:length} to extract the desired characters:
$ read -ra arr <text.file ; printf "%s%s%s\n" "${arr[2]:0:1}" "${arr[3]:0:1}" "${arr[4]:0:1}"
SOF

Looping over shell script arguments and passing quoted arguments to function

I have a script below that sources a directory of bash scripts and then parses the flags of the command to run a specific function from the sourced files.
Given this function within the scripts dir:
function reggiEcho () {
echo $1
}
Here are some examples of current output
$ reggi --echo hello
hello
$ reggi --echo hello world
hello
$ reggi --echo "hello world"
hello
$ reggi --echo "hello" --echo "world"
hello
world
As you can see quoted parameters are not honored as they should be `"hello world" should echo properly.
This is the script, the issue is within the while loop.
How do I parse these flags, and maintain passing in quoted parameters into the function?
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
STR="$(find $DIR/scripts -type f -name '*.sh' -print)"
ARR=( $STR )
TUSAGE="\n"
for f in "${ARR[#]}"; do
if [ -f $f ]
then
. $f --source-only
if [ -z "$USAGE" ]
then
:
else
TUSAGE="$TUSAGE \t$USAGE\n"
fi
USAGE=""
else
echo "$f not found"
fi
done
TUSAGE="$TUSAGE \t--help (shows this help output)\n"
function usage() {
echo "Usage: --function <args> [--function <args>]"
echo $TUSAGE
exit 1
}
HELP=false
cmd=()
while [ $# -gt 0 ]; do # loop until no args left
if [[ $1 = '--help' ]] || [[ $1 = '-h' ]] || [[ $1 = '--h' ]] || [[ $1 = '-help' ]]; then
HELP=true
fi
if [[ $1 = --* ]] || [[ $1 = -* ]]; then # arg starts with --
if [[ ${#cmd[#]} -gt 0 ]]; then
"${cmd[#]}"
fi
top=`echo $1 | tr -d -` # remove all flags
top=`echo ${top:0:1} | tr '[a-z]' '[A-Z]'`${top:1} # make sure first letter is uppercase
top=reggi$top # prepend reggi
cmd=( "$top" ) # start new array
else
echo $1
cmd+=( "$1" )
fi
shift
done
if [[ "$HELP" = true ]]; then
usage
elif [[ ${#cmd[#]} -gt 0 ]]; then
${cmd[#]}
else
usage
fi

There are many places in this script where you have variable references without double-quotes around them. This means the variables' values will be subject to word spitting and wildcard expansion, which can have various weird effects.
The specific problem you're seeing is due to an unquoted variable reference on the fourth-from-last line, ${cmd[#]}. With cmd=( echo "hello world" ), word splitting makes this equivalent to echo hello world rather than echo "hello world".
Fixing that one line will fix your current problem, but there are a number of other unquoted variable references that may cause other problems later. I recommend fixing all of them. Cyrus' recommendation of shellcheck.net is good at pointing them out, and will also note some other issues I won't cover here. One thing it won't mention is that you should avoid all-caps variable names (DIR, TUSAGE, etc) -- there are a bunch of all-caps variables with special meanings, and it's easy to accidentally reuse one of them and wind up with weird effects. Lowercase and mixed-case variables are safer.
I also recommend against using \t and \n in strings, and counting on echo to translate them into tabs and newlines, respectively. Some versions of echo do this automatically, some require the -e option to tell them to do it, some will print "-e" as part of their output... it's a mess. In bash, you can use $'...' to translate those escape sequences directly, e.g:
tusage="$tusage"$' \t--help (shows this help output)\n' # Note mixed quoting modes
echo "$tusage" # Note that double-quoting is *required* for this to work right
You should also fix the file listing so it doesn't depend on being unquoted (see chepner's comment). If you don't need to scan subdirectories of $DIR/scripts, you can do this with a simple wildcard (note lowercase vars and that the var is double-quoted, but the wildcard isn't):
arr=( "$dir/scripts"/*.sh )
If you need to look in subdirectories, it's more complicated. If you have bash v4 you can use a globstar wildcard, like this:
shopt -s globstar
arr=( "$dir/scripts"/**/*.sh )
If your script might have to run under bash v3, see BashFAQ #20: "How can I find and safely handle file names containing newlines, spaces or both?", or just use this:
while IFS= read -r -d '' f <&3; do
if [ -f $f ]
# ... etc
done 3< <(find "$dir/scripts" -type f -name '*.sh' -print0)
(That's my favorite it-just-works idiom for iterating over find's matches. Although it does require bash, not some generic POSIX shell.)

Bash variable substitution and strings

Let's say I have two variables:
a="AAA"
b="BBB"
I read a string from a file. This string is the following:
str='$a $b'
How to create a new string from the first one that substitutes the variables?
newstr="AAA BBB"

bash variable indirection whithout eval:
Well, as eval is evil, we may try to make this whithout them, by using indirection in variable names.
a="AAA"
b="BBB"
str='$a $b'
newstr=()
for cnt in $str ;do
[ "${cnt:0:1}" == '$' ] && cnt=${cnt:1} && cnt=${!cnt}
newstr+=($cnt)
done
newstr="${newstr[*]}"
echo $newstr
AAA BBB
Another try:
var1="Hello"
var2="2015"
str='$var1 world! Happy new year $var2'
newstr=()
for cnt in $str ;do
[ "${cnt:0:1}" == '$' ] && cnt=${cnt:1} && cnt=${!cnt}
newstr+=($cnt)
done
newstr="${newstr[*]}"
echo $newstr
Hello world! Happy new year 2015
Addendum As correctly pointed by #EtanReisner's comment, if your string do contain some * or other glob expendable stings, you may have to use set -f to prevent bad things:
cd /bin
var1="Hello"
var2="star"
var3="*"
str='$var1 this string contain a $var2 as $var3 *'
newstr=()
for cnt in $str ;do
[ "${cnt:0:1}" == '$' ] && cnt=${cnt:1} && cnt=${!cnt};
newstr+=("$cnt");
done;
newstr="${newstr[*]}"
echo "$newstr"
Hello this string contain a star as * bash bunzip2 busybox....zmore znew
echo ${#newstr}
1239
Note: I've added " at newstr+=("$cnt"); to prevent glob expansion, but set -f seem required...
newstr=()
set -f
for cnt in $str ;do
[ "${cnt:0:1}" == '$' ] && cnt=${cnt:1} && cnt=${!cnt}
newstr+=("$cnt")
done
set +f
newstr="${newstr[*]}"
echo "$newstr"
Hello this string contain a star as * *
Nota 2: This is far away from a perfect solution. For sample if string do contain ponctuation, this won't work again... Example:
str='$var1, this string contain a $var2 as $var3: *'
with same variables as previous run will render:
' this string contain a star as *' because ${!var1,} and ${!var3:} don't exist.
... and if $str do contain special chars:
As #godblessfq asked:
If str contains a line break, how do I do the substitution and preserve the newline in the output?
So this is not robust as every indirected variable must be first, last or space separated from all special chars!
str=$'$var1 world!\n... 2nd line...'
var1=Hello
newstr=()
set -f
IFS=' ' read -d$'\377' -ra array <<<"$str"
for cnt in "${array[#]}";do
[ "${cnt:0:1}" == '$' ] && cnt=${cnt:1} && cnt=${!cnt}
newstr+=("$cnt")
done
set +f
newstr="${newstr[*]}"
echo "$newstr"
Hello world!
... 2nd line...
As <<< inline string add a trailing newline, last echo command could be written:
echo "${newstr%$'\n'}"

The easiest solution is to use eval:
eval echo "$str"
To assign it to a variable, use command substitution:
replaced=$(eval echo "$str")

Disclaimer: I only discovered perl an hour ago. But this seems to work robustly, whatever special characters you throw at it:
newstr=$(a2="$a" b2="$b" perl -pe 's/\$a\b/$ENV{a2}/g; s/\$b\b/$ENV{b2}/g' <(echo -e "$str"))
Test:
a='A*A\nA'
b='B*B\nB'
str='$a $aa * \n $b $bb'
newstr=$(a2="$a" b2="$b" perl -pe 's/\$a\b/$ENV{a2}/g; s/\$b\b/$ENV{b2}/g' <(echo -e "$str"))
echo -e "$newstr"
Output:
A*A
A $aa *
B*B
B $bb

I'd use awk solution with awk-variables. This will allow passing a text containing special chars and subsitute any placeholder with it.
a workaround to recognize $ would be using [\x24]:
awk -v a="$a" -v b="$b" '{gsub("[\x24]a",a);gsub("[\x24]b",b); print}' <<< $str
here
-v defines variable a="$a"
[x24] is ASCII for $, so [x24]a equal to $a
gsub(x,y) - replaces x with y

While loop does not execute

I currently have this code:
listing=$(find "$PWD")
fullnames=""
while read listing;
do
if [ -f "$listing" ]
then
path=`echo "$listing" | awk -F/ '{print $(NF)}'`
fullnames="$fullnames $path"
echo $fullnames
fi
done
For some reason, this script isn't working, and I think it has something to do with the way that I'm writing the while loop / declaring listing. Basically, the code is supposed to pull out the actual names of the files, i.e. blah.txt, from the find $PWD.

read listing does not read a value from the string listing; it sets the value of listing with a line read from standard input. Try this:
# Ignoring the possibility of file names that contain newlines
while read; do
[[ -f $REPLY ]] || continue
path=${REPLY##*/}
fullnames+=( $path )
echo "${fullnames[#]}"
done < <( find "$PWD" )
With bash 4 or later, you can simplify this with
shopt -s globstar
for f in **/*; do
[[ -f $f ]] || continue
path+=( "$f" )
done
fullnames=${paths[#]##*/}

parse and expand interval

In my script I need to expand an interval, e.g.:
input: 1,5-7
to get something like the following:
output: 1,5,6,7
I've found other solutions here, but they involve python and I can't use it in my script.

Solution with Just Bash 4 Builtins
You can use Bash range expansions. For example, assuming you've already parsed your input you can perform a series of successive operations to transform your range into a comma-separated series. For example:
value1=1
value2='5-7'
value2=${value2/-/..}
value2=`eval echo {$value2}`
echo "input: $value1,${value2// /,}"
All the usual caveats about the dangers of eval apply, and you'd definitely be better off solving this problem in Perl, Ruby, Python, or AWK. If you can't or won't, then you should at least consider including some pipeline tools like tr or sed in your conversions to avoid the need for eval.

Try something like this:
#!/bin/bash
for f in ${1//,/ }; do
if [[ $f =~ - ]]; then
a+=( $(seq ${f%-*} 1 ${f#*-}) )
else
a+=( $f )
fi
done
a=${a[*]}
a=${a// /,}
echo $a
Edit: As #Maxim_united mentioned in the comments, appending might be preferable to re-creating the array over and over again.

This should work with multiple ranges too.
#! /bin/bash
input="1,5-7,13-18,22"
result_str=""
for num in $(tr ',' ' ' <<< "$input"); do
if [[ "$num" == *-* ]]; then
res=$(seq -s ',' $(sed -n 's#\([0-9]\+\)-\([0-9]\+\).*#\1 \2#p' <<< "$num"))
else
res="$num"
fi
result_str="$result_str,$res"
done
echo ${result_str:1}
Will produce the following output:
1,5,6,7,13,14,15,16,17,18,22

expand_commas()
{
local arg
local st en i
set -- ${1//,/ }
for arg
do
case $arg in
[0-9]*-[0-9]*)
st=${arg%-*}
en=${arg#*-}
for ((i = st; i <= en; i++))
do
echo $i
done
;;
*)
echo $arg
;;
esac
done
}
Usage:
result=$(expand_commas arg)
eg:
result=$(expand_commas 1,5-7,9-12,3)
echo $result
You'll have to turn the separated words back into commas, of course.
It's a bit fragile with bad inputs but it's entirely in bash.

Here's my stab at it:
input=1,5-7,10,17-20
IFS=, read -a chunks <<< "$input"
output=()
for chunk in "${chunks[#]}"
do
IFS=- read -a args <<< "$chunk"
if (( ${#args[#]} == 1 )) # single number
then
output+=(${args[*]})
else # range
output+=($(seq "${args[#]}"))
fi
done
joined=$(sed -e 's/ /,/g' <<< "${output[*]}")
echo $joined
Basically split on commas, then interpret each piece. Then join back together with commas at the end.

A generic bash solution using the sequence expression `{x..y}'
#!/bin/bash
function doIt() {
local inp="${#/,/ }"
declare -a args=( $(echo ${inp/-/..}) )
local item
local sep
for item in "${args[#]}"
do
case ${item} in
*..*) eval "for i in {${item}} ; do echo -n \${sep}\${i}; sep=, ; done";;
*) echo -n ${sep}${item};;
esac
sep=,
done
}
doIt "1,5-7"
Should work with any input following the sample in the question. Also with multiple occurrences of x-y
Use only bash builtins

Using ideas from both #Ansgar Wiechers and #CodeGnome:
input="1,5-7,13-18,22"
for s in ${input//,/ }
do
if [[ $f =~ - ]]
then
a+=( $(eval echo {${s//-/..}}) )
else
a+=( $s )
fi
done
oldIFS=$IFS; IFS=$','; echo "${a[*]}"; IFS=$oldIFS
Works in Bash 3

Considering all the other answers, I came up with this solution, which does not use any sub-shells (but one call to eval for brace expansion) or separate processes:
# range list is assumed to be in $1 (e.g. 1-3,5,9-13)
# convert $1 to an array of ranges ("1-3" "5" "9-13")
IFS=,
local range=($1)
unset IFS
list=() # initialize result list
local r
for r in "${range[#]}"; do
if [[ $r == *-* ]]; then
# if the range is of the form "x-y",
# * convert to a brace expression "{x..y}",
# * using eval, this gets expanded to "x" "x+1" … "y" and
# * append this to the list array
eval list+=( {${r/-/..}} )
else
# otherwise, it is a simple number and can be appended to the array
list+=($r)
fi
done
# test output
echo ${list[#]}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio