Using Interval expressions with bash extended globbing - bash

I know for a fact, that bash supports extended glob with a regular expression like support for #(foo|bar), *(foo) and ?(foo). This syntax is quite unique i.e. different from that of EREs -- extended globs use a prefix notation (where the operator appears before its operands), rather than postfix like EREs.
I'm wondering does it support the interval expressions feature of type {n,m} i.e. if there is one number in the braces, the preceding regexp is repeated n times or if there are two numbers separated by a comma, the preceding regexp is repeated n to m times. I couldn't find a particular documentation that suggests this support enabled in extended glob.
Actual Question
I came across a requirement in one of the questions today, to remove only a pair of trailing zeroes in a string. Trying to solve this with the extended glob support in bash
Given some sample strings like
foobar0000
foobar00
foobar000
should produce
foobar00
foobar
foobar0
I tried using extended glob with parameter expansion to do
x='foobar000'
respectively. I tried using the interval expression as below which seemed obvious to me that it wouldn't work
echo ${x%%+([0]{2})}
i.e. similar using sed in ERE as sed -E 's/[0]{2}$//' or in BRE as sed 's/[0]\{2\}$//'
So my question being, is this possible using any of the extended glob operators? I'm looking for answers specific to using the extended glob support in bash would take 'No' if not possible too.

Somehow I managed to find a way to do this within the confinements of bash.
Are interval glob-expressions implemented in bash?
No! In contrast to other shells such as ksh and zsh, bash did not implement interval expressions for globbing.
Can we mimic interval expressions in bash?
Yes! However, it is not really practical and could sometimes benefit by using printf. The idea is to build the globular expression that mimics the {m,n} interval using the KSH-globs #(pattern) and ?(pattern).
In the explanation below, we assume that the pattern is stored in variable p
Match n occurrences of the given pattern ({n}):
The idea is to repeat the pattern n times. For large n you can use printf
$ var="foobar01010"
$ echo ${var%%#(0|1)#(0|1)}
foobar000
or
$ var="foobar01010"
$ p=$(printf "#(0|1)%.0s" {1..4})
$ echo ${var%%$p}
foobar0
Match at least m occurrences of the given pattern ({m,}):
It is the same as before, but with an additional *(pattern)
$ var="foobar01010"
$ echo ${var%%#(0|1)#(0|1)*(0|1)}
foobar
or
$ var="foobar01010"
$ p="(0|1)"
$ q=$(printf "#$p%.0s" {1..4})
$ echo ${var%%$q*$p}
foobar
Match from n to m occurrences of the given pattern ({m,n}):
The interval expression {n,m} implies we have for sure n appearances and m-n possible appearances. These can be constructed using the ksh-globs #(pat) n times and ?(pat) m-n times. For n=2 and m=3, this leads to:
$ var="foobar01010"
$ echo ${var%%#(0|1)#(0|1)?(0|1)}
foobar010
or
$ p="(0|1)"
$ q=$(printf "#$p%.0s" {1..n})$(printf "?$p%.0s" {n+1..m})
$ echo ${var%%$q}
foobar010
$ var="foobar00200"
foobar002
$ var="foobar00020"
foobar00020
Another way to construct the interval expression {n,m} is using the ksh-glob anything but pattern written as !(pat) which allows us to say: give me all, except...
man bash:
!(pattern-list): Matches anything except one of the given patterns
This way we can write
$ echo ${var%%!(!(*$p)|#$p#$p#$p+$p|?$p)}
or
$ p="(0|1)"
$ pn=$(printf "#$p%.0s" {1..n})
$ pm=$(printf "?$p%.0s" {1..m-1})
$ echo ${var%%!(!(*$p)|$pn+$p|$pm)}
note: you need to do a double exclusion here due to the or (|) in the pattern list.
What about other shells?
KSH93
The interval expression {n,m} has been implemented in ksh93:
man ksh:
{n}(pattern-list) Matches n occurrences of the given patterns.
{m,n}(pattern-list) Matches from m to n occurrences of the given patterns. If m is omitted, 0 will be used. If n is omitted at least m occurrences will be matched.
$ echo ${var%%{2,3}(0|1)}
ZSH
Also zsh has a form of interval expression. It is a globbing flag which is part of the EXTENDED_GLOB option:
man zshall:
(#cN,M) The flag (#cN,M) can be used anywhere that the # or ## operators can be used except in the expressions (*/)# and (*/)## in filename generation, where / has special meaning; it cannot be combined with other globbing flags and a bad pattern error occurs if it is misplaced. It is equivalent to the form {N,M} in regular expressions. The previous character or group is required to match between N and M times, inclusive. The form
(#cN) requires exactly N matches; (#c,M) is equivalent to specifying N as 0; (#cN,) specifies that there is no maximum limit on the number of matches.
$ echo ${var%%(0|1)(#c2,3)}

No
"Extended pattern matching features" is enabled using extglob (thus we call that extended glob). Extended pattern matching features are used in an operation called pattern matching. Pattern matching is used in filename expansion and in [[...]] conditional constructs when using = or != operators. Filename expansion is used in parameter expansion.
As you can see in pattern matching, extended glob or not, pattern matching does not support expressions like [set]{count}. We can for example match one or more occurrences with +(..) and so on, but specifying the number of occurrences of a pattern is not possible.
But this is bash and bash is powerful. We can specify the number of occurrences of a pattern simply by repeating the pattern. We cannot specify the ending or the beginning (I mean like using ^ and $ in regex), but we can use ${parameter%%word} parameter expansions to remove the trailing portion of the parameter. So this will work:
var='foobar000'
echo ${var%%[0][0]}
and, with some simple hacking, we can do this:
var='foobar000'
echo ${var%%$(yes '[0]' | head -n 2 | tr -d '\n')}
and this will remove two trailing zeros from the string.

Related

How to keep/remove numbers in a variable in shell?

I have a variable such as:
disk=/dev/sda1
I want to extract:
only the non numeric part (i.e. /dev/sda)
only the numeric part (i.e. 1)
I'm gonna use it in a script where I need the disk and the partition number.
How can I do that in shell (bash and zsh mostly)?
I was thinking about using Shell parameters expansions, but couldn't find working patterns in the documentation.
Basically, I tried:
echo ${disk##[:alpha:]}
and
echo ${disk##[:digit:]}
But none worked. Both returned /dev/sda1
With bash and zsh and Parameter Expansion:
disk="/dev/sda12"
echo "${disk//[0-9]/} ${disk//[^0-9]/}"
Output:
/dev/sda 12
The expansions kind-of work the other way round. With [:digit:] you will match only a single digit. You need to match everything up until, or from a digit, so you need to use *.
The following looks ok:
$ echo ${disk%%[0-9]*} ${disk##*[^0-9]}
/dev/sda 1
To use [:digit:] you need double braces, cause the character class is [:class:] and it itself has to be inside [ ]. That's why I prefer 0-9, less typing*. The following is the same as above:
echo ${disk%%[[:digit:]]*} ${disk##*[^[:digit:]]}
* - Theoretically they may be not equal, as [0-9] can be affected by the current locale, so it may be not equal to [0123456789], but to something different.
You have to be careful when using patterns in parameter substitution. These patterns are not regular expressions but pathname expansion patterns, or glob patterns.
The idea is to remove the last number, so you want to make use of Remove matching suffix pattern (${parameter%%word}). Here we remove the longest instance of the matched pattern described by word. Representing single digit numbers is easily done by using the pattern [0-9], however, multi-digit numbers is harder. For this you need to use extended glob expressions:
*(pattern-list): Matches zero or more occurrences of the given patterns
So if you want to remove the last number, you use:
$ shopt -s extglob
$ disk="/dev/sda1"
$ echo "${disk#${disk%%*([0-9])}} "${disk%%*([0-9])}"
1 dev/sda
$ disk="/dev/dsk/c0t2d0s0"
$ echo "${disk#${disk%%*([0-9])}} "${disk%%*([0-9])}"
0 /dev/dsk/c0t2d0s
We have to use ${disk#${disk%%*([0-9])}} to remove the prefix. It essentially searches the last number, removes it, uses the remainder and remove that part again.
You can also make use of pattern substitution (${parameter/pattern/string}) with the anchors % and # to anchor the pattern to the begin or end of the parameter. (see man bash for more information). This is completely equivalent to the previous solution:
$ shopt -s extglob
$ disk="/dev/sda1"
$ echo "${disk/${disk/%*([0-9])}/}" "${disk/%*([0-9])}"
1 dev/sda
$ disk="/dev/dsk/c0t2d0s0"
$ echo "${disk/${disk/%*([0-9])}/}" "${disk/%*([0-9])}"
0 /dev/dsk/c0t2d0s

Remove specified string pattern(s) from a string in bash

I found a good answer that explains how to remove a specified pattern from a string variable. In this case, to remove 'foo' we use the following:
string="fooSTUFF"
string="${string#foo}"
However, I would like to add the "OR" functionality that would be able to remove 'foo' OR 'boo' in the cases when my string starts with any of them, and leave the string as is, if it does not start with 'foo' or 'boo'. So, the modified script should look something like that:
string="fooSTUFF"
string="${string#(foo OR boo)}"
How could this be properly implemented?
If you have set the extglob (extended glob) shell option with
shopt -s extglob
Then you can write:
string="${string##(foo|boo)}"
The extended patterns are documented in the bash manual; they take the form:
?(pattern-list): Matches zero or one occurrence of the given patterns.
*(pattern-list): Matches zero or more occurrences of the given patterns.
+(pattern-list): Matches one or more occurrences of the given patterns.
#(pattern-list): Matches one of the given patterns.
!(pattern-list): Matches anything except one of the given patterns.
In all cases, pattern-list is a list of patterns separated by |
You need an extended glob pattern for that (enabled with shopt -s extglob):
$ str1=fooSTUFF
$ str2=booSTUFF
$ str3=barSTUFF
$ echo "${str1##(foo|boo)}"
STUFF
$ echo "${str2##(foo|boo)}"
STUFF
$ echo "${str3##(foo|boo)}"
barSTUFF
The #(pat1|pat2) matches one of the patterns separated by |.
#(pat1|pat2) is the general solution for your question (multiple patterns); in some simple cases, you can get away without extended globs:
echo "${str#[fb]oo}"
would work for your specific example, too.
You can use:
string=$(echo $string | tr -d "foo|boo")

bash file globbing anomaly

The bash manual (I'm using version 4.3.42 on OSX) states that the vertical bar '|' character is used as a separator for multiple file patterns in file globbing. Thus, the following should work on my system:
projectFiles=./config/**/*|./support/**/*
However, the second pattern gives a "Permission denied" on the last file that is in that directory structure so the pattern is never resolved into projectFiles. I've tried variations on this, including wrapping the patterns in parentheses,
projectFiles=(./config/**/*)|(./support/**/*)
which is laid out in the manual, but that doesn't work either.
Any suggestions on what I'm doing wrong?
You're probably referring to this part in man bash:
If the extglob shell option is enabled using the shopt builtin, several
extended pattern matching operators are recognized. In the following
description, a pattern-list is a list of one or more patterns separated
by a |. Composite patterns may be formed using one or more of the fol-
lowing sub-patterns:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns
The | separator works in pattern-lists as explained, but only when extglob is enabled:
shopt -s extglob
Try this:
projectFiles=*(./config/**/*|./support/**/*)
As #BroSlow pointed out in a comment:
Note that you can do this without extglob, ./{config,support}/**/*, which would just expand to the path with config and the path with support space delimited and then do pattern matching. Or ./#(config|support)/**/* with extglob. Either of which seems cleaner.
#chepner's comment is also worth mentioning:
Also, globbing isn't performed at all during a simple assignment; try foo=*, then compare echo "$foo" with echo $foo. Globbing does occur during array assignment; see foo=(*); echo "${foo[#]}"

Assign last character of string to variable in bash script [duplicate]

I found out that with ${string:0:3} one can access the first 3 characters of a string. Is there a equivalently easy method to access the last three characters?
Last three characters of string:
${string: -3}
or
${string:(-3)}
(mind the space between : and -3 in the first form).
Please refer to the Shell Parameter Expansion in the reference manual:
${parameter:offset}
${parameter:offset:length}
Expands to up to length characters of parameter starting at the character
specified by offset. If length is omitted, expands to the substring of parameter
starting at the character specified by offset. length and offset are arithmetic
expressions (see Shell Arithmetic). This is referred to as Substring Expansion.
If offset evaluates to a number less than zero, the value is used as an offset
from the end of the value of parameter. If length evaluates to a number less than
zero, and parameter is not ‘#’ and not an indexed or associative array, it is
interpreted as an offset from the end of the value of parameter rather than a
number of characters, and the expansion is the characters between the two
offsets. If parameter is ‘#’, the result is length positional parameters
beginning at offset. If parameter is an indexed array name subscripted by ‘#’ or
‘*’, the result is the length members of the array beginning with
${parameter[offset]}. A negative offset is taken relative to one greater than the
maximum index of the specified array. Substring expansion applied to an
associative array produces undefined results.
Note that a negative offset must be separated from the colon by at least one
space to avoid being confused with the ‘:-’ expansion. Substring indexing is
zero-based unless the positional parameters are used, in which case the indexing
starts at 1 by default. If offset is 0, and the positional parameters are used,
$# is prefixed to the list.
Since this answer gets a few regular views, let me add a possibility to address John Rix's comment; as he mentions, if your string has length less than 3, ${string: -3} expands to the empty string. If, in this case, you want the expansion of string, you may use:
${string:${#string}<3?0:-3}
This uses the ?: ternary if operator, that may be used in Shell Arithmetic; since as documented, the offset is an arithmetic expression, this is valid.
Update for a POSIX-compliant solution
The previous part gives the best option when using Bash. If you want to target POSIX shells, here's an option (that doesn't use pipes or external tools like cut):
# New variable with 3 last characters removed
prefix=${string%???}
# The new string is obtained by removing the prefix a from string
newstring=${string#"$prefix"}
One of the main things to observe here is the use of quoting for prefix inside the parameter expansion. This is mentioned in the POSIX ref (at the end of the section):
The following four varieties of parameter expansion provide for substring processing. In each case, pattern matching notation (see Pattern Matching Notation), rather than regular expression notation, shall be used to evaluate the patterns. If parameter is '#', '*', or '#', the result of the expansion is unspecified. If parameter is unset and set -u is in effect, the expansion shall fail. Enclosing the full parameter expansion string in double-quotes shall not cause the following four varieties of pattern characters to be quoted, whereas quoting characters within the braces shall have this effect. In each variety, if word is omitted, the empty pattern shall be used.
This is important if your string contains special characters. E.g. (in dash),
$ string="hello*ext"
$ prefix=${string%???}
$ # Without quotes (WRONG)
$ echo "${string#$prefix}"
*ext
$ # With quotes (CORRECT)
$ echo "${string#"$prefix"}"
ext
Of course, this is usable only when then number of characters is known in advance, as you have to hardcode the number of ? in the parameter expansion; but when it's the case, it's a good portable solution.
You can use tail:
$ foo="1234567890"
$ echo -n $foo | tail -c 3
890
A somewhat roundabout way to get the last three characters would be to say:
echo $foo | rev | cut -c1-3 | rev
Another workaround is to use grep -o with a little regex magic to get three chars followed by the end of line:
$ foo=1234567890
$ echo $foo | grep -o ...$
890
To make it optionally get the 1 to 3 last chars, in case of strings with less than 3 chars, you can use egrep with this regex:
$ echo a | egrep -o '.{1,3}$'
a
$ echo ab | egrep -o '.{1,3}$'
ab
$ echo abc | egrep -o '.{1,3}$'
abc
$ echo abcd | egrep -o '.{1,3}$'
bcd
You can also use different ranges, such as 5,10 to get the last five to ten chars.
1. Generalized Substring
To generalise the question and the answer of gniourf_gniourf (as this is what I was searching for), if you want to cut a range of characters from, say, 7th from the end to 3rd from the end, you can use this syntax:
${string: -7:4}
Where 4 is the length of course (7-3).
2. Alternative using cut
In addition, while the solution of gniourf_gniourf is obviously the best and neatest, I just wanted to add an alternative solution using cut:
echo $string | cut -c $((${#string}-2))-
Here, ${#string} is the length of the string, and the trailing "-" means cut to the end.
3. Alternative using awk
This solution instead uses the substring function of awk to select a substring which has the syntax substr(string, start, length) going to the end if the length is omitted. length($string)-2) thus picks up the last three characters.
echo $string | awk '{print substr($1,length($1)-2) }'

Match a range of file names with variable end, in a Bash script

Let's say I have a number of files named file1, file2, file3, and so on. I'm trying to find a way to match the first N files, in a Bash script, where N is a variable. Here are the options I've considered so far:
Brace expansion, i.e. file{1..3}, doesn't allow variable end. In other words, file{1..$N} doesn't work.
A range expression can be used to match numeric characters. It allows variable end, i.e. file[1-$N], but this works only until N > 9.
$(seq 1 $N) can be used to create a sequence of numbers, but it doesn't help since the problem is to match a sequence of numbers in a file name. Were the files name simply 1, 2, 3, and so on, this would work.
Here is another solution. I'm not advocating it, but then again there can be legitimate uses for eval ;) ...also I think not being able to use a variable in a range is an annoying/less intuitive shortcoming.
N=5
eval echo {1..$N}
So you could do
eval ls file{1..$N}
I found a solution using extended globs. They need to be enabled with shopt -s extglob command. #(...) can be used to match any of a set of patterns separated by | character, e.g. file#(1|2|3). Now I just need to generate the number sequence with | as the separator character instead of a newline:
shopt -s extglob
range=$(seq 1 $N)
ls file#(${range//$'\n'/|})
Could you simply do,
for file01.txt, file02.txt, file345.txt, file678.txt...
cat file*.txt > file_all.txt
or am I missing the point?

Resources