Remove specified string pattern(s) from a string in bash - bash

I found a good answer that explains how to remove a specified pattern from a string variable. In this case, to remove 'foo' we use the following:
string="fooSTUFF"
string="${string#foo}"
However, I would like to add the "OR" functionality that would be able to remove 'foo' OR 'boo' in the cases when my string starts with any of them, and leave the string as is, if it does not start with 'foo' or 'boo'. So, the modified script should look something like that:
string="fooSTUFF"
string="${string#(foo OR boo)}"
How could this be properly implemented?

If you have set the extglob (extended glob) shell option with
shopt -s extglob
Then you can write:
string="${string##(foo|boo)}"
The extended patterns are documented in the bash manual; they take the form:
?(pattern-list): Matches zero or one occurrence of the given patterns.
*(pattern-list): Matches zero or more occurrences of the given patterns.
+(pattern-list): Matches one or more occurrences of the given patterns.
#(pattern-list): Matches one of the given patterns.
!(pattern-list): Matches anything except one of the given patterns.
In all cases, pattern-list is a list of patterns separated by |

You need an extended glob pattern for that (enabled with shopt -s extglob):
$ str1=fooSTUFF
$ str2=booSTUFF
$ str3=barSTUFF
$ echo "${str1##(foo|boo)}"
STUFF
$ echo "${str2##(foo|boo)}"
STUFF
$ echo "${str3##(foo|boo)}"
barSTUFF
The #(pat1|pat2) matches one of the patterns separated by |.
#(pat1|pat2) is the general solution for your question (multiple patterns); in some simple cases, you can get away without extended globs:
echo "${str#[fb]oo}"
would work for your specific example, too.

You can use:
string=$(echo $string | tr -d "foo|boo")

Related

How to keep/remove numbers in a variable in shell?

I have a variable such as:
disk=/dev/sda1
I want to extract:
only the non numeric part (i.e. /dev/sda)
only the numeric part (i.e. 1)
I'm gonna use it in a script where I need the disk and the partition number.
How can I do that in shell (bash and zsh mostly)?
I was thinking about using Shell parameters expansions, but couldn't find working patterns in the documentation.
Basically, I tried:
echo ${disk##[:alpha:]}
and
echo ${disk##[:digit:]}
But none worked. Both returned /dev/sda1
With bash and zsh and Parameter Expansion:
disk="/dev/sda12"
echo "${disk//[0-9]/} ${disk//[^0-9]/}"
Output:
/dev/sda 12
The expansions kind-of work the other way round. With [:digit:] you will match only a single digit. You need to match everything up until, or from a digit, so you need to use *.
The following looks ok:
$ echo ${disk%%[0-9]*} ${disk##*[^0-9]}
/dev/sda 1
To use [:digit:] you need double braces, cause the character class is [:class:] and it itself has to be inside [ ]. That's why I prefer 0-9, less typing*. The following is the same as above:
echo ${disk%%[[:digit:]]*} ${disk##*[^[:digit:]]}
* - Theoretically they may be not equal, as [0-9] can be affected by the current locale, so it may be not equal to [0123456789], but to something different.
You have to be careful when using patterns in parameter substitution. These patterns are not regular expressions but pathname expansion patterns, or glob patterns.
The idea is to remove the last number, so you want to make use of Remove matching suffix pattern (${parameter%%word}). Here we remove the longest instance of the matched pattern described by word. Representing single digit numbers is easily done by using the pattern [0-9], however, multi-digit numbers is harder. For this you need to use extended glob expressions:
*(pattern-list): Matches zero or more occurrences of the given patterns
So if you want to remove the last number, you use:
$ shopt -s extglob
$ disk="/dev/sda1"
$ echo "${disk#${disk%%*([0-9])}} "${disk%%*([0-9])}"
1 dev/sda
$ disk="/dev/dsk/c0t2d0s0"
$ echo "${disk#${disk%%*([0-9])}} "${disk%%*([0-9])}"
0 /dev/dsk/c0t2d0s
We have to use ${disk#${disk%%*([0-9])}} to remove the prefix. It essentially searches the last number, removes it, uses the remainder and remove that part again.
You can also make use of pattern substitution (${parameter/pattern/string}) with the anchors % and # to anchor the pattern to the begin or end of the parameter. (see man bash for more information). This is completely equivalent to the previous solution:
$ shopt -s extglob
$ disk="/dev/sda1"
$ echo "${disk/${disk/%*([0-9])}/}" "${disk/%*([0-9])}"
1 dev/sda
$ disk="/dev/dsk/c0t2d0s0"
$ echo "${disk/${disk/%*([0-9])}/}" "${disk/%*([0-9])}"
0 /dev/dsk/c0t2d0s

bash script on specific URL string manipulation

I need to manipulate a string (URL) of which I don't know lenght.
the string is something like
https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring
I basically need a regular expression which returns this:
https://x.xx.xxx.xxx/keyword/restofstring
where the x is the current ip which can vary everytime and I don't know the number of dontcares.
I actually have no idea how to do it, been 2 hours on the problem but didn't find a solution.
thanks!
You can use sed as follows:
sed -E 's=(https://[^/]*).*(/keyword/.*)=\1\2='
s stands for substitute and has the form s=search pattern=replacement pattern=.
The search pattern is a regex in which we grouped (...) the parts you want to extract.
The replacement pattern accesses these groups with \1 and \2.
You can feed a file or stdin to sed and it will process the input line by line.
If you have a string variable and use bash, zsh, or something similar you also can feed that variable directly into stdin using <<<.
Example usage for bash:
input='https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring'
output="$(sed -E 's=(https://[^/]*).*(/keyword/.*)=\1\2=' <<< "$input")"
echo "$output" # prints https://x.xx.xxx.xxx/keyword/restofstring
echo "https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring" | sed "s/dontcare[0-9]\+\///g"
sed is used to manipulate text. dontcare[0-9]\+\///g is an escaped form of the regular expression dontcare[0-9]+/, which matches the word "dontcare" followed by 1 or more digits, followed by the / character.
sed's pattern works like this: s/find/replace/g, where g is a command that allowed you to match more than one instance of the pattern.
You can see that regular expression in action here.
Note that this assumes there are no dontcareNs in the rest of the string. If that's the case, Socowi's answer works better.
You could also use read with a / value for $IFS to parse out the trash.
$: IFS=/ read proto trash url trash trash trash keyword rest <<< "https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring"
$: echo "$proto//$url/$keyword/$rest"
https://x.xx.xxx.xxx/keyword/restofstring
This is more generalized when the dontcare... values aren't known and predictable strings.
This one is pure bash, though I like Socowi's answer better.
Here's a sed variation which picks out the host part and the last two components from the path.
url='http://example.com:1234/ick/poo/bar/quux/fnord'
newurl=$(echo "$url" | sed 's%\(https*://[^/?]*[^?/]\)[^ <>'"'"'"]*/\([^/ <>'"''"]*/^/ <>'"''"]*\)%\1\2%')
The general form is sed 's%pattern%replacement%' where the pattern matches through the end of the host name part (captured into one set of backslashed parentheses) then skips through the penultimate slash, then captures the remainder of the URL including the last slash; and the replacement simply recalls the two captured groups without the skipped part between them.

Using Interval expressions with bash extended globbing

I know for a fact, that bash supports extended glob with a regular expression like support for #(foo|bar), *(foo) and ?(foo). This syntax is quite unique i.e. different from that of EREs -- extended globs use a prefix notation (where the operator appears before its operands), rather than postfix like EREs.
I'm wondering does it support the interval expressions feature of type {n,m} i.e. if there is one number in the braces, the preceding regexp is repeated n times or if there are two numbers separated by a comma, the preceding regexp is repeated n to m times. I couldn't find a particular documentation that suggests this support enabled in extended glob.
Actual Question
I came across a requirement in one of the questions today, to remove only a pair of trailing zeroes in a string. Trying to solve this with the extended glob support in bash
Given some sample strings like
foobar0000
foobar00
foobar000
should produce
foobar00
foobar
foobar0
I tried using extended glob with parameter expansion to do
x='foobar000'
respectively. I tried using the interval expression as below which seemed obvious to me that it wouldn't work
echo ${x%%+([0]{2})}
i.e. similar using sed in ERE as sed -E 's/[0]{2}$//' or in BRE as sed 's/[0]\{2\}$//'
So my question being, is this possible using any of the extended glob operators? I'm looking for answers specific to using the extended glob support in bash would take 'No' if not possible too.
Somehow I managed to find a way to do this within the confinements of bash.
Are interval glob-expressions implemented in bash?
No! In contrast to other shells such as ksh and zsh, bash did not implement interval expressions for globbing.
Can we mimic interval expressions in bash?
Yes! However, it is not really practical and could sometimes benefit by using printf. The idea is to build the globular expression that mimics the {m,n} interval using the KSH-globs #(pattern) and ?(pattern).
In the explanation below, we assume that the pattern is stored in variable p
Match n occurrences of the given pattern ({n}):
The idea is to repeat the pattern n times. For large n you can use printf
$ var="foobar01010"
$ echo ${var%%#(0|1)#(0|1)}
foobar000
or
$ var="foobar01010"
$ p=$(printf "#(0|1)%.0s" {1..4})
$ echo ${var%%$p}
foobar0
Match at least m occurrences of the given pattern ({m,}):
It is the same as before, but with an additional *(pattern)
$ var="foobar01010"
$ echo ${var%%#(0|1)#(0|1)*(0|1)}
foobar
or
$ var="foobar01010"
$ p="(0|1)"
$ q=$(printf "#$p%.0s" {1..4})
$ echo ${var%%$q*$p}
foobar
Match from n to m occurrences of the given pattern ({m,n}):
The interval expression {n,m} implies we have for sure n appearances and m-n possible appearances. These can be constructed using the ksh-globs #(pat) n times and ?(pat) m-n times. For n=2 and m=3, this leads to:
$ var="foobar01010"
$ echo ${var%%#(0|1)#(0|1)?(0|1)}
foobar010
or
$ p="(0|1)"
$ q=$(printf "#$p%.0s" {1..n})$(printf "?$p%.0s" {n+1..m})
$ echo ${var%%$q}
foobar010
$ var="foobar00200"
foobar002
$ var="foobar00020"
foobar00020
Another way to construct the interval expression {n,m} is using the ksh-glob anything but pattern written as !(pat) which allows us to say: give me all, except...
man bash:
!(pattern-list): Matches anything except one of the given patterns
This way we can write
$ echo ${var%%!(!(*$p)|#$p#$p#$p+$p|?$p)}
or
$ p="(0|1)"
$ pn=$(printf "#$p%.0s" {1..n})
$ pm=$(printf "?$p%.0s" {1..m-1})
$ echo ${var%%!(!(*$p)|$pn+$p|$pm)}
note: you need to do a double exclusion here due to the or (|) in the pattern list.
What about other shells?
KSH93
The interval expression {n,m} has been implemented in ksh93:
man ksh:
{n}(pattern-list) Matches n occurrences of the given patterns.
{m,n}(pattern-list) Matches from m to n occurrences of the given patterns. If m is omitted, 0 will be used. If n is omitted at least m occurrences will be matched.
$ echo ${var%%{2,3}(0|1)}
ZSH
Also zsh has a form of interval expression. It is a globbing flag which is part of the EXTENDED_GLOB option:
man zshall:
(#cN,M) The flag (#cN,M) can be used anywhere that the # or ## operators can be used except in the expressions (*/)# and (*/)## in filename generation, where / has special meaning; it cannot be combined with other globbing flags and a bad pattern error occurs if it is misplaced. It is equivalent to the form {N,M} in regular expressions. The previous character or group is required to match between N and M times, inclusive. The form
(#cN) requires exactly N matches; (#c,M) is equivalent to specifying N as 0; (#cN,) specifies that there is no maximum limit on the number of matches.
$ echo ${var%%(0|1)(#c2,3)}
No
"Extended pattern matching features" is enabled using extglob (thus we call that extended glob). Extended pattern matching features are used in an operation called pattern matching. Pattern matching is used in filename expansion and in [[...]] conditional constructs when using = or != operators. Filename expansion is used in parameter expansion.
As you can see in pattern matching, extended glob or not, pattern matching does not support expressions like [set]{count}. We can for example match one or more occurrences with +(..) and so on, but specifying the number of occurrences of a pattern is not possible.
But this is bash and bash is powerful. We can specify the number of occurrences of a pattern simply by repeating the pattern. We cannot specify the ending or the beginning (I mean like using ^ and $ in regex), but we can use ${parameter%%word} parameter expansions to remove the trailing portion of the parameter. So this will work:
var='foobar000'
echo ${var%%[0][0]}
and, with some simple hacking, we can do this:
var='foobar000'
echo ${var%%$(yes '[0]' | head -n 2 | tr -d '\n')}
and this will remove two trailing zeros from the string.

bash file globbing anomaly

The bash manual (I'm using version 4.3.42 on OSX) states that the vertical bar '|' character is used as a separator for multiple file patterns in file globbing. Thus, the following should work on my system:
projectFiles=./config/**/*|./support/**/*
However, the second pattern gives a "Permission denied" on the last file that is in that directory structure so the pattern is never resolved into projectFiles. I've tried variations on this, including wrapping the patterns in parentheses,
projectFiles=(./config/**/*)|(./support/**/*)
which is laid out in the manual, but that doesn't work either.
Any suggestions on what I'm doing wrong?
You're probably referring to this part in man bash:
If the extglob shell option is enabled using the shopt builtin, several
extended pattern matching operators are recognized. In the following
description, a pattern-list is a list of one or more patterns separated
by a |. Composite patterns may be formed using one or more of the fol-
lowing sub-patterns:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns
The | separator works in pattern-lists as explained, but only when extglob is enabled:
shopt -s extglob
Try this:
projectFiles=*(./config/**/*|./support/**/*)
As #BroSlow pointed out in a comment:
Note that you can do this without extglob, ./{config,support}/**/*, which would just expand to the path with config and the path with support space delimited and then do pattern matching. Or ./#(config|support)/**/* with extglob. Either of which seems cleaner.
#chepner's comment is also worth mentioning:
Also, globbing isn't performed at all during a simple assignment; try foo=*, then compare echo "$foo" with echo $foo. Globbing does occur during array assignment; see foo=(*); echo "${foo[#]}"

Glob pattern to match non-hidden filenames that don't start with a particular string

I'm pretty inexperienced to globbing in general. How would one go about writing a glob pattern that matches filenames not starting with, say, "ab" but still need a length of at least 2? i.e. "start with something 2-letter string other than "ab"" This is a homework question, and only basic bash globs are allowed, and must work with "echo <glob here>".
Note: the question verbatim is
(Non-hidden) files in the current directory whose names contain at least two characters, but do not start with ab.
printed on paper. I'm pretty sure I didn't misunderstand anything. The requirements are
For each of the following file search criteria, provide a globbing pattern that matches
the criterion. Your answer in each case should be a text file with the following format:
echo <pattern>
My current attempt is echo {a[!b]*,[!a.]?*} but somehow it gets no points with the automatic grader which actually runs your file against a test case automatically without human intervention.
For a single letter, this would do:
$ echo [!a]?*
However, for 2 letters (and assuming files can also start with numbers or punctuation or all kinds of other things), I can only think of this without resorting to shopt:
$ GLOBIGNORE=ab*
$ echo *
Well, now, technically, this would work:
$ echo [!a]?* [a][!b]*
BUT this would leave a nasty [a][!b]* in our results if there are no files starting with an a+1 or more extra characters, which would not only be undesirable, but even considered a bug in any application, so on that grounds I would not consider it a valid answer. To omit that [a][!b]*, we have to resort to nullglob (and if extglob isn't allowed, nullglob probably isn't either):
$ shopt -s nullglob
$ echo [!a]?* [a][!b]*
Fwiw, extglob would be:
$ shopt -s extglob
$ echo !(ab*)
That previous answer would match files with less then 2 characters, so like #perreal says:
$ #([^a]?|?[^b])*
Starting with "a" or "b" OR with "ab"? For the later:
ab*
Needless to say, you have to specify a path that resolves (relative or absolute):
/path/to/ab*
To your updated question:
{b,c,d,e,f...}{a,c,d,e,f...}*
Should work, note that ... is not actually valid, but I won't write the whole alphabet here. :P
shopt -s extglob # turn on extended globbing
echo #([^a]?|?[^b])*
My original echo {a[!b]*,[!a.]?*} is correct works very well. The teacher actually set up the test cases wrong, so everybody got marked incorrectly and got a remark just now.

Resources