Bash - Extract numbers from String - bash

I got a string which looks like this:
"abcderwer 123123 10,200 asdfasdf iopjjop"
Now I want to extract numbers, following the scheme xx,xxx where x is a number between 0-9. E.g. 10,200. Has to be five digit, and has to contain ",".
How can I do that?
Thank you

You can use grep:
$ echo "abcderwer 123123 10,200 asdfasdf iopjjop" | egrep -o '[0-9]{2},[0-9]{3}'
10,200

In pure Bash:
pattern='([[:digit:]]{2},[[:digit:]]{3})'
[[ $string =~ $pattern ]]
echo "${BASH_REMATCH[1]}"

Simple pattern matching (glob patterns) is built into the shell. Assuming you have the strings in $* (that is, they are command-line arguments to your script, or you have used set on a string you have obtained otherwise), try this:
for token; do
case $token in
[0-9][0-9],[0-9][0-9][0-9] ) echo "$token" ;;
esac
done

Check out pattern matching and regular expressions.
Links:
Bash regular expressions
Patterns and pattern matching
SO question
and as mentioned above, one way to utilize pattern matching is with grep.
Other uses: echo supports patterns (globbing) and find supports regular expressions.

A slightly non-typical solution:
< input tr -cd [0-9,\ ] | tr \ '\012' | grep '^..,...$'
(The first tr removes everything except commas, spaces, and digits. The
second tr replaces spaces with newlines, putting each "number" on a separate
line, and the grep discards everything except those that match your criterion.)

The following example using your input data string should solve the problem using sed.
$ echo abcderwer 123123 10,200 asdfasdf iopjjop | sed -ne 's/^.*\([0-9,]\{6\}\).*$/\1/p'
10,200

Related

Escape "./" when using sed

I wanted to use grep to exclude words from $lastblock by using a pipeline, but I found that grep works only for files, not for stdout output.
So, here is what I'm using:
lastblock="./2.json"
echo $lastblock | sed '1,/firstmatch/d;/.json/,$d'
I want to exclude ./ and .json, keeping only what is between.
This sed command is correct for this purpose, but how to escape the ./ replacing firstmatch so it can work?
Thanks in advance!
Use bash's Parameter Substitution
lastblock="./2.json"
name="${lastblock##*/}" # strips from the beginning until last / -> 2.json
base="${name%.*}" # strips from the last . to the end -> 2
but I found that grep works only for files, not for stdout output.
here it is. (if your grep supports the -P flag.
lastblock="./2.json"
echo "$lastblock" | grep -Po '(?<=\./).*(?=\.)'
but how to escape the ./
With sed(1), escape it using a back slash \
lastblock="./2.json"
echo "$lastblock" | sed 's/^\.\///;s/\..*$//'
Or use a different delimiter like a pipe |
sed 's|^\./||;s|\..*$||'
with awk
lastblock="./2.json"
echo "$lastblock" | awk -F'[./]+' '{print $2}'
Starting from bashv3, regular expression pattern matching is supported using the =~ operator inside the [[ ... ]] keyword.
lastblock="./2.json"
regex='^\./([[:digit:]]+)\.json'
[[ $lastblock =~ $regex ]] && echo "${BASH_REMATCH[1]}"
Although a P.E. should suffice just for this purpose.
I wanted to use grep to exclude words from $lastblock by using a pipeline, but I found that grep works only for files, not for stdout output.
Nonsense. grep works the same for the same input, regardless of whether it is from a file or from the standard input.
So, here is what I'm using:
lastblock="./2.json"
echo $lastblock | sed '1,/firstmatch/d;/.json/,$d'
I want to exclude ./ and .json, keeping only what is between. This sed
command is correct for this purpose,
That sed command is nowhere near correct for the stated purpose. It has this effect:
delete every line from the very first one up to and including the next subsequent one that matches the regular expression /firstmatch/, AND
delete every line from the first one matching the regular expression /.json/ to the last one of the file (and note that . is a regex metacharacter).
To remove part of a line instead of deleting a whole line, use an s/// command instead of a d command. As for escaping, you can escape a character to sed by preceding it with a backslash (\), which itself must be quoted or escaped to protect it from interpretation by the shell. Additionally, most regex metacharacters lose their special significance when they appear inside a character class, which I find to be a more legible way to include them in a pattern as literals. For example:
lastblock="./2.json"
echo "$lastblock" | sed 's/^[.]\///; s/[.]json$//'
That says to remove the literal characters ./ appearing at the beginning of the (any) line, and, separately, to remove the literal characters .json appearing at the end of the line.
Alternatively, if you want to modify only those lines that both start with ./ and end with .json then you can use a single s command with a capturing group and a backreference:
lastblock="./2.json"
echo "$lastblock" | sed 's/^[.]\/\(.*\)[.]json$/\1/'
That says that on lines that start with ./ and end with .json, capture everything between those two and replace the whole line with the captured part alone.
You can use another character like '#' when you want to avoid slashes.
You can remember a part that matches and use it in the replacement.
Use [.] avoiding the dot to be any character.
echo "$lastblock" | sed -r 's#[.]/(.*)[.]json#\1#'
Solution!
Just discovered today the tr command thanks to this legendary, unrelated answer.
When searching all over Google for how to exclude "." and "/", 100% of StackOverflow answers didn't helped.
So, to escape characters from the output of a command, just append this pipe:
| tr -d "{character-emoji-anything-you-want-to-exclude}"
So, a full working and simple sample:
echo "./2.json" | tr -d "/" | tr -d "." | tr -d "json"
And done!

How to overcome greedy match everything when looking for a particular string later?

echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.*([0-9]+) guys.*/\1/g'
The above command currently outputs just 5. Essentially I'd like to parse the number of "guys" in a random sentence that could have numbers (or not.. I'd also like to parse just echo "365 guys") preceding the number of guys. My .* is matching the 36 and preventing it from appearing in the \1. How can I write a sed command (or any other regex/perl/awk) to accomplish what I want?
Use the "frugal" quantifier *? in Perl:
perl -pe 's/.*?([0-9]+) guys.*/$1/'
With GNU grep:
$ grep -Po '\b[0-9]+(?= guys\b)' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
-P actives support for PCREs, which enables advanced regex features.
-o specifies that only the matching parts of input lines should be printed.
\b matches only on a word boundary, including at the start of a line;
this prevents matching numbers that aren't stand-alone numbers but part of other words, such as in foo365 guys, and words that start with guys, such as guysanddolls.
(?= guys) is a look-ahead assertion that matches the enclosed subexpression without including it in the matched string returned.
As demonstrated, this may match multiple patterns on a given line, with each number extracted printed on its own output line.
If that is undesired, grep cannot be used, because -o invariably returns all of a line's matches; see the perl command below for a solution.
Inspired by Sobrique's comment on choroba's answer, here is the perl equivalent of the above grep command:
$ perl -lne 'print for m/\b(\d+) guys\b/g' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
Simply omit the g to only match at most 1 number per line.
Since your number is preceded by a blank, you can make it a part of the regex:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.* ([0-9]+) guys.*/\1/g'
# => 365
In Bash:
$ s="A number is about to show up 1 and now I want to parse 365 guys and some extra junk"
$ [[ $s =~ ([0-9]+)\ +guys.*$ ]] && echo ${BASH_REMATCH[1]}
365
Or, with awk:
$ echo "$s" | awk '/guys/{for (i=1;i<=NF;i++) if ($i=="guys" && $(i-1)+0==$(i-1)) print $(i-1)}'
365
with standard sed regex you can benefit from greedy match if you reverse the string and matching
echo ... | rev | sed -E 's/.*syug ([0-9]+).*/\1/g' | rev
obviously this is a hack, but desperate times...
#Andrew Cassidy: #try:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" |
awk '/guys/{print VAL;exit} {VAL=$0}' RS=" "
This might work for you (GNU sed):
sed -r 's/.*\b([0-9]+) guys.*/\1/' file
or perhaps:
sed -r 's/.*\<([0-9]+) guys.*/\1/' file
Make the numeric part of the pattern match a word boundary.

String substitute in Shell script

I suppose to strip down a substring in my shell script. I am trying as follows:
fileName="Test_VSS_TT.csv.old"
here i want to remove the string ".csv.old" and my
test=${fileName%.*}
but getting bad substitution error.
you are looking for test=${filename%%.*}
the doc for parameter expansion in bash here and in zsh here
%.* will match the first .* pattern, whereas %%.* will match the longest one
[edit]
if sed is available, you could try something like that : echo "filename.txt.bin" | sed "s/\..*//g" which yields filename
Here you go,
$ echo $f
Test_VSS_TT.csv.old
$ test=${f%%.*}
$ echo $test
Test_VSS_TT
%% will do a longest match. So it matches from the first dot upto the last and then removes the matched characters.
If your intention is to extract file name without extension, then how about this?
$ echo ${fileName}
Test_VSS_TT.csv.old
$ test=`echo ${fileName} |cut -d '.' -f1`
$ echo $test
Test_VSS_TT
echo "Test_VSS_TT.csv.old"| awk -F"." '{print $1}'

Remove a fixed prefix/suffix from a string in Bash

I want to remove the prefix/suffix from a string. For example, given:
string="hello-world"
prefix="hell"
suffix="ld"
How do I get the following result?
"o-wor"
$ prefix="hell"
$ suffix="ld"
$ string="hello-world"
$ foo=${string#"$prefix"}
$ foo=${foo%"$suffix"}
$ echo "${foo}"
o-wor
This is documented in the Shell Parameter Expansion section of the manual:
${parameter#word}
${parameter##word}
The word is expanded to produce a pattern and matched according to the rules described below (see Pattern Matching). If the pattern matches the beginning of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the # case) or the longest matching pattern (the ## case) deleted. […]
${parameter%word}
${parameter%%word}
The word is expanded to produce a pattern and matched according to the rules described below (see Pattern Matching). If the pattern matches a trailing portion of the expanded value of parameter, then the result of the expansion is the value of parameter with the shortest matching pattern (the % case) or the longest matching pattern (the %% case) deleted. […]
Using sed:
$ echo "$string" | sed -e "s/^$prefix//" -e "s/$suffix$//"
o-wor
Within the sed command, the ^ character matches text beginning with $prefix, and the trailing $ matches text ending with $suffix.
Adrian Frühwirth makes some good points in the comments below, but sed for this purpose can be very useful. The fact that the contents of $prefix and $suffix are interpreted by sed can be either good OR bad- as long as you pay attention, you should be fine. The beauty is, you can do something like this:
$ prefix='^.*ll'
$ suffix='ld$'
$ echo "$string" | sed -e "s/^$prefix//" -e "s/$suffix$//"
o-wor
which may be what you want, and is both fancier and more powerful than bash variable substitution. If you remember that with great power comes great responsibility (as Spiderman says), you should be fine.
A quick introduction to sed can be found at http://evc-cit.info/cit052/sed_tutorial.html
A note regarding the shell and its use of strings:
For the particular example given, the following would work as well:
$ echo $string | sed -e s/^$prefix// -e s/$suffix$//
...but only because:
echo doesn't care how many strings are in its argument list, and
There are no spaces in $prefix and $suffix
It's generally good practice to quote a string on the command line because even if it contains spaces it will be presented to the command as a single argument. We quote $prefix and $suffix for the same reason: each edit command to sed will be passed as one string. We use double quotes because they allow for variable interpolation; had we used single quotes the sed command would have gotten a literal $prefix and $suffix which is certainly not what we wanted.
Notice, too, my use of single quotes when setting the variables prefix and suffix. We certainly don't want anything in the strings to be interpreted, so we single quote them so no interpolation takes place. Again, it may not be necessary in this example but it's a very good habit to get into.
$ string="hello-world"
$ prefix="hell"
$ suffix="ld"
$ #remove "hell" from "hello-world" if "hell" is found at the beginning.
$ prefix_removed_string=${string/#$prefix}
$ #remove "ld" from "o-world" if "ld" is found at the end.
$ suffix_removed_String=${prefix_removed_string/%$suffix}
$ echo $suffix_removed_String
o-wor
Notes:
#$prefix : adding # makes sure that substring "hell" is removed only if it is found in beginning.
%$suffix : adding % makes sure that substring "ld" is removed only if it is found in end.
Without these, the substrings "hell" and "ld" will get removed everywhere, even it is found in the middle.
I use grep for removing prefixes from paths (which aren't handled well by sed):
echo "$input" | grep -oP "^$prefix\K.*"
\K removes from the match all the characters before it.
Do you know the length of your prefix and suffix? In your case:
result=$(echo $string | cut -c5- | rev | cut -c3- | rev)
Or more general:
result=$(echo $string | cut -c$((${#prefix}+1))- | rev | cut -c$((${#suffix}+1))- | rev)
But the solution from Adrian Frühwirth is way cool! I didn't know about that!
Small and universal solution:
expr "$string" : "$prefix\(.*\)$suffix"
Using the =~ operator:
$ string="hello-world"
$ prefix="hell"
$ suffix="ld"
$ [[ "$string" =~ ^$prefix(.*)$suffix$ ]] && echo "${BASH_REMATCH[1]}"
o-wor
NOTE: Not sure if this was possible back in 2013 but it's certainly possible today (10 Oct 2021) so adding another option ...
Since we're dealing with known fixed length strings (prefix and suffix) we can use a bash substring to obtain the desired result with a single operation.
Inputs:
string="hello-world"
prefix="hell"
suffix="ld"
Plan:
bash substring syntax: ${string:<start>:<length>}
skipping over prefix="hell" means our <start> will be 4
<length> will be total length of string (${#string}) minus the lengths of our fixed length strings (4 for hell / 2 for ld)
This gives us:
$ echo "${string:4:(${#string}-4-2)}"
o-wor
NOTE: the parens can be removed and still obtain the same result
If the values of prefix and suffix are unknown, or could vary, we can still use this same operation but replace 4 and 2 with ${#prefix} and ${#suffix}, respectively:
$ echo "${string:${#prefix}:${#string}-${#prefix}-${#suffix}}"
o-wor
Using #Adrian Frühwirth answer:
function strip {
local STRING=${1#$"$2"}
echo ${STRING%$"$2"}
}
use it like this
HELLO=":hello:"
HELLO=$(strip "$HELLO" ":")
echo $HELLO # hello

How to ensure I have exactly 2 spaces before string and zero spaces after

I get a string that can have from zero to multiple leading and trailing spaces.
I'm trying to get rid of them without lot of hackery but my code looks huge.
How to do this in a clean way?
as easy as:
$ src=" some text "
$ dst=" $(echo $src)"
$ echo ":$dst:"
: some text:
$(echo $src) will get rid of all around spaces.
than you simply add how much spaces you need before it.
How are you calling out the string? If it's an echo you can just put
Echo "<2 spaces>". "string";
if it's a normal string you just put 2 spaces between the first qoute and the string.
"<2spaces> string here"
One way using GNU sed:
sed 's/^[ \t]*/ /; s/[ \t]*$//' file.txt
You can apply this to a bash variable like this:
echo "$string" | sed 's/^[ \t]*/ /; s/[ \t]*$//'
And save it like this:
variable=$(echo "$string" | sed 's/^[ \t]*/ /; s/[ \t]*$//')
Explanation:
The first substitution will remove all leading whitespace and replace it with two spaces.
The second substitution will simply remove all lagging whitespace from a line.
The simplest is probably to use an external process.
value=$(echo "$value" | sed 's/^ *\(.*[^ ]\) *$/ \1/')
If you need to transform an empty string into two spaces, you'll need to modify the regex; and if you're not on Linux, your sed dialect may differ slightly. For maximum portability, switch to awk or Perl, or do it all in Bash. That gets a bit more complex, but for a start, trailing=${value##*[! ]} contains any trailing spaces, and you can trim them off with ${value%$trailing}, and similarly for leading spaces. See the section on variable substitution in the Bash manual for details.
You can use a regular expression to match everything between the leading and trailing spaces. The matched text is found in the BASH_REMATCH array (the text matching the first parentheses group is in element 1).
spcs='\ *'
text='.*[^ ]'
[[ $src =~ ^$spcs($text)$spcs$ ]]
dst=" ${BASH_REMATCH[1]}"

Resources