Extracting number from string in bash script [duplicate] - bash

This question already has answers here:
Extract substring in Bash
(26 answers)
Closed 5 months ago.
I have multiple s3 buckets in AWS whose names are in the following syntax:
resource-4511-deployment-1srsi6fjy9uuk
web-4533-logbucket-dogx6k0n8967
pcnfile6511
5399-bucket-6dehb5uuiwd
I'd like to extract the 4 digit number from each of these names preferably without using multiple if else loops which is the solution I can think of right now. The output should basically be
4511
4533
6511
5399

You can use parameter expansion. Prefix and suffix removal return the strings before and after the four digits, you can then use the removal again to remove the prefix and suffix:
#!/bin/bash
for name in resource-4511-deployment-1srsi6fjy9uuk \
web-4533-logbucket-dogx6k0n8967 \
pcnfile6511 \
5399-bucket-6dehb5uuiwd
do
after=${name#*[0-9][0-9][0-9][0-9]}
before=${name%%[0-9][0-9][0-9][0-9]*}
num=${name#$before}
num=${num%$after}
echo $num
done

I'd use regex matching here.
I was hoping the pattern would be cleaner, but the data forces this:
re='(^|[^[:digit:]])([[:digit:]]{4})($|[^[:digit:]])'
start of string or a non-digit
followed by 4 digits
followed by end of string or a non-digit
for name in resource-4511-deployment-1srsi6fjy9uuk \
web-4533-logbucket-dogx6k0n8967 \
pcnfile6511 \
5399-bucket-6dehb5uuiwd
do
[[ $name =~ $re ]] && echo ${BASH_REMATCH[2]}
done

Assuming there's only one set of 4-digits in each string, one bash idea using a regex and the BASH_REMATCH[] array:
regex='([0-9]{4})'
for string in resource-4511-deployment-1srsi6fjy9uuk web-4533-logbucket-dogx6k0n8967 pcnfile6511 5399-bucket-6dehb5uuiwd
do
[[ "${string}" =~ $regex ]] && echo "${BASH_REMATCH[1]}"
done
This generates:
4511
4533
6511
5399

printf "pcnfile6511" | grep "[0-9][0-9][0-9][0-9]"
That seems to work, although it only will for four digit numbers.
Also...
printf "resource-4511-deployment-1srsi6fjy9uuk" | cut -d'-' -f2
That will work when you have delimiters.
For numbers at the end of a line...
printf "pcnfile6511" | tail -c 4
And for numbers at the beginning...
printf "5399-bucket-6dehb5uuiwd" | head -c 4

Related

Identifying hash encoding

I am creating a function that will accept an input and determine if the value is a certain type of hash encoding (md5, sha1, sha256, and sha512). I have asked a few classmates and logically it makes sense, but clearly something is wrong.
#!/usr/bin/bash
function identify-hash() {
encryptinput=$(echo $1 | grep -E -i '^[a-z0-9=]+${32}')
if [[ -n $encryptinput ]]; then
echo "The $1 is a valid md5sum string"
exit
else
encryptinput=$(echo $1 | grep -E -i '^[a-z0-9=]+${40}')
if [[ -n $encryptinput ]]; then
echo "The $1 is a valid sha1sum string"
exit
else
encryptinput=$(echo $1 | grep -E -i '^[a-z0-9=]+${64}')
if [[ -n $encryptinput ]]; then
echo "The $1 is a valid sha256sum string"
exit
else
encryptinput=$(echo $1 | grep -E -i '^[a-z0-9=]+${128}')
if [[ -n $encryptinput ]]; then
echo "The $1 is a valid sha512sum string"
exit
else
echo "Unable to determine the hash function used to generate the input"
fi
fi
fi
fi
}
identify-hash $1
I know that hashes have a specific number of characters for them, but I don't know exactly why it's not working. Removing the {32} out of line 4 allows it to answer as a md5sum, but than it assumes everything is md5sum.
Suggestions?
Fixed your script. I advise you would have spotted most of the issues if you had used ShellCheck:
#!/usr/bin/env bash
identify_hash() {
# local variables
local -- encrypt_input
local -- sumname
# Regex capture the hexadecimal digits
if [[ "$1" =~ ([[:xdigit:]]+) ]]; then
encrypt_input="${BASH_REMATCH[1]}"
else
encrypt_input=''
fi
# Determine name of sum algorithm based on length of encrypt_input
case "${#encrypt_input}" in
32) sumname=md5sum ;;
40) sumname=sha1sum ;;
64) sumname=sha256sum ;;
128) sumname=sha512sum ;;
*) sumname=;;
esac
# If sum algorithm name found (sumname is not empty)
if [ -n "$sumname" ]; then
printf 'The %s is a valid %s string\n' "$encrypt_input" "$sumname"
else
printf 'Unable to determine the hash function used to generate the input\n' >&2
exit 1
fi
}
identify_hash "$1"
Something shorter, using bash:
checkHash() {
local -ar sumnames=([32]=md5sum [40]=sha1sum [64]=sha256sum [128]=sha512sum)
[[ "$1" =~ [[:xdigit:]]{32,129} ]]
echo "${sumnames[${#BASH_REMATCH}]+String $BASH_REMATCH could be }${sumnames[
${#BASH_REMATCH}]:-No hash tool match this string.}"
}
This will extract [:xdigit:] part out of any complete line:
checkHash 'Filename: 13aba32dbe4db7a7117ed40a25c29fa8 --'
String 13aba32dbe4db7a7117ed40a25c29fa8 could be md5sum
checkHash a32dba32dbe4db7a7117ed40a25c29fa8e4db7a7117ed40a25c29fa8
No hash tool match this string.
checkHash a32dba32dbe4db7a7117ed40a25c29fa8e4db7a7117ed40a25c29fa8da921adb
String a32dba32dbe4db7a7117ed40a25c29fa8e4db7a7117ed40a25c29fa8da921adb could be sha256sum
... then ${var+return this only if $var exist}
... and ${var:-return this if $var is empty}
Further explaining #Gordon Davissons' comment and some basics for anyone who stops by
NB This answer is extremely simplified to apply only to the current question. here's my preferred guide for more regex
Basics of regex
^ - start of a line
$ - end of a line
[...] - list of possible characters
has special sauce
a-z = all lowercase (English) letters; 0-9 = all digits; etc.
also accepts character classes - e.g [:xdigit:] for hexadecimal characters
the expression is now [[:xdigit:]] - i.e [:class:] inside [...]
{...} - number of times the preceding expression should be matched
^[a]{1}$ will match a but not aa
^f[o]{2}d$ will match food but not fod, foood, fooo*d
^[a-z]{4}$ will match
ball ✔️ but not buffalo ❌
cove ✔️ but not cover ❌
basically any line ( because of the ^...$) containing a string of exactly 4 (English) alphabetic characters
{1,5} - at least 1 and at most 5
* - shorthand for {0,} meaning 0 or any number of times
+ - shorthand for {1,} meaning at least 1; but no upper limit
? - shorthand for {1}
So ${32} is looking for 32 "end of line" \n in jargon and what you need is [a-z0-9=]{32} instead
BUT as also pointed out by Andrej Podzimek in the comments you need to match only hexadecimal [0-9a-f] characters which is the same as [:xdigit:]. Either can be used.
PS
more Basics
. (fullstop/period) matches ANY character including spaces and special characters
(...) is to match patterns
[a-z ]*(chicken).*
will match anything from chicken coop to chicken soup and please pass that chicken cookbook, Alex?
[.] means period/fullstop not any character
note the space after z this is to make space (ascii 32 ) a possible character
and . is case-insensituve
PPS if it's for homework/assignment/schoolwork, please specify so in your question :)

index of bash string item with ifs separator [duplicate]

This question already has answers here:
How do I split a string on a delimiter in Bash?
(37 answers)
Closed 1 year ago.
Let's say I have this string:
NAMES="Mike&George&Norma"
IFS=$'&'
for NAME in $NAMES
do
echo ${NAME}
done
So I can loop through the NAMES.
But what if I only need George, i.e. the name at index 1?
How can I get NAMES[1]?
If mapfile aka readarray is available/acceptable.
#!/usr/bin/env bash
names="Mike&George&Norma"
mapfile -td '&' name <<< "$names"
printf '%s\n' "${name[#]}"
Prints all elements/strings in between the &, so
printf '%s\n' "${name[0]}"
printf '%s\n' "${name[1]}"
printf '%s\n' "${name[2]}"
Should print them names one by one.
See the builtin section of the bash manual https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html
$ NAMES="Mike&George&Norma";
$ echo "$NAMES" | cut -d'&' -f2
George
field counting starts with 1, unlike array indexing.
Using OP's current code one idea would be to add a counter to the loop processing, eg:
NAMES="Mike&George&Norma"
loop_ctr=-1
match_ctr=1
origIFS="${IFS}" # save current IFS
IFS=$'&'
for NAME in $NAMES
do
((loop_ctr++))
[[ "${loop_ctr}" -ne "${match_ctr}" ]] && # if loop_ctr != match_ctr then skip to next pass through loop
continue
echo ${NAME}
done
IFS="${origIFS}" # reset to original IFS
This generates as output:
George
NOTE: My preference would be to parse the string into an array (via mapfile/readarray) ... and #jetchisel beat me to that idea :-)

Modify piped input

Think of strings, such as:
I have two apples
He has 4 apples
They have 10 pizzas
I would like to substitute every digit number I find with in a string with a different value, calculated with an external script. In my case, the python program digit_to_word.py convert a digit number to the alphabetic format, but anything will be ok so that I can get the process.
Expected output:
I have two apples
He has four apples
They have ten pizzas
Conceptually:
echo "He has four apples" |
while read word;
do
if [[ "$word" == +([0-9+]) ]]; then
NUM='${python digit_to_word.py "$word"}'
$word="$NUM"
fi
done |
other_operation... | etc..
I say conceptually because I did not get even close to make it work. It is hard to me to even find information on the issue, simply because I do not exactly know how to conceptualize it. At this point, I am mostly reasoning on process substitution, but I am afraid it is not the best way.
Any hint that could be really useful. Thanks in advance for sharing your knowledge with me!
regex='([[:space:]])([0-9]+)([[:space:]])'
echo "He has 4 apples" |
while IFS= read -r line; do
line=" ${line} " # pad with space so first and last words work consistently
while [[ $line =~ $regex ]]; do # loop while at least one replacement is pending
pre_space=${BASH_REMATCH[1]} # whitespace before the word, if any
word=${BASH_REMATCH[2]} # actual word to replace
post_space=${BASH_REMATCH[3]} # whitespace after the word, if any
replace=$(python digit_to_word.py "$word") # new word to use
in=${pre_space}${word}${post_space} # old word padded with whitespace
out=${pre_space}${replace}${post_space} # new word padded with whitespace
line=${line//$in/$out} # replace old w/ new, keeping whitespace
done
line=${line#' '}; line=${line%' '} # remove the padding we added earlier
printf '%s\n' "$line" # write the output line
done
This is careful to work even in some tricky corner cases:
4 score and 14 years ago only replaces the 4 in 4 score with four, and doesn't also modify the 4 in 14.
Input that mixes tabs and whitespaces generates output with the same kinds of whitespace; printf '1\t2 3\n' as your input, and you'll get a tab between one and two, but a space between two and three.
See this running at https://ideone.com/SOsuAD
I'd suggest this is a better job for perl.
To recreate the scenario:
$ cat digit_to_word.sh
case $1 in
4) echo four;;
8) echo eight;;
10) echo ten;;
*) echo "$1";;
esac
$ bash digit_to_word.sh 10
ten
Then this
perl -pe 's/(\d+)/ chomp($word = qx{bash digit_to_word.sh $1}); $word /ge' <<END
I have two apples
He has 4 apples
They have 10 pizzas but only 8 cookies
END
outputs
I have two apples
He has four apples
They have ten pizzas but only eight cookies
However, you've already got some python, why don't you implement the replacement part in python too?
Revision
This approach decomposes each line into two arrays - one for the words and one for the whitespace. Each line is then reconstructed by interleaving the array elements, with digits translated to words by the Python script. Thanks to #Charles Duffy for pointing out some common Bash pitfalls with my original answer.
while IFS= read -r line; do
# Decompose the line into an array of words delimited by whitespace
IFS=" " read -ra word_array <<< $(echo "$line" | sed 's/[[:space:]]/ /g')
# Invert the decomposition, creating an array of whitespace delimited by words
IFS="w" read -ra wspace_array <<< $(echo "$line" | sed 's/\S/w/g' | tr -s 'w')
# Interleave the array elements in the output, translating digits to text
for ((i=0; i<${#wspace_array[#]}; i++))
do
printf "%s" "${wspace_array[$i]}"
if [[ "${word_array[$i]}" =~ ^[0-9]+$ ]]; then
printf "%s" "$(digit_to_word.py ${word_array[$i]})"
else
printf "%s" "${word_array[$i]}"
fi
done
printf "\n"
done < sample.txt
You could use sed for this. Here's an example:
$ echo "He has 4 apples" | sed 's/4/four/'
He has four apples
Looking at the example data though, sed might not be a good fit. If you see "1", you want to replace with "one", but your example replaced "10" with "ten". Do you need to support multi-digit numbers, such as replacing "230" with "two hundred and thirty"?

How to pull a string apart by its contents

I have a string in a common pattern that I want to manipulate. I want to be able to turn string 5B299 into 5B300 (increment the last number by one).
I want to avoid blindly splicing the string by index, as the first number and letter can change in size. Essentially I want to be able to get the entire value of everything after the first character, increment it by one, and re-append it.
The only things I've found online so far show me how to cut by a delimiter, but I don't have a constant delimiter.
You could use the regex features supported by the bash shell with its ~ construct that supports basic Extended Regular Expression matching (ERE). All you need to do is define a regex and work on the captured groups to get the resulting string
str=5B299
re='^(.*[A-Z])([0-9]+)$'
Now use the ~ operator to do the regex match. The ~ operator populates an array BASH_REMATCH with the captured groups if regex match was successful. The first part (5B in the example) would be stored in the index 0 and the next one at 1. We increment the value at index 1 with the $((..)) operator.
if [[ $str =~ $re ]]; then
result="${BASH_REMATCH[1]}$(( BASH_REMATCH[2] + 1 ))"
printf '%s\n' "$result"
fi
The POSIX version of the regex, free of the locale dependency would be to use character classes instead of range expressions as
posix_re='^(.*[[:alpha:]])([[:digit:]]+)$'
You can do what you are attempting fairly easily with the bash parameter-expansion for string indexes along with the POSIX arithmetic operator. For instance you could do:
#!/bin/bash
[ -z "$1" ] && { ## validate at least 1 argument provided
printf "error: please provide a number.\n" >&2
exit 1
}
[[ $1 =~ [^0-9][^0-9]* ]] && { ## validate all digits in argument
printf "error: input contains non-digit characters.\n" >&2
exit 1
}
suffix=${1:1} ## take all character past 1st as suffix
suffix=$((suffix + 1)) ## increment suffix by 1
result=${1:0:1}$suffix ## append suffent to orginal 1st character
echo "$result" ## output
exit 0
Which will leave the 1st character alone while incrementing the remaining characters by 1 and then joining again with the original 1st digit, while validating that the input consisted only of digits, e.g.
Example Use/Output
$ bash prefixsuffix.sh
error: please provide a number.
$ bash prefixsuffix.sh 38a900
error: input contains non-digit characters.
$ bash prefixsuffix.sh 38900
38901
$ bash prefixsuffix.sh 39999
310000
Look things over and let me know if that is what you intended.
You can use sed in conjunction with awk:
increment() {
echo $1 | sed -r 's/([0-9]+[a-zA-Z]+)([0-9]+)/\1 \2/' | awk '{printf "%s%d", $1, ++$2}'
}
echo $(increment "5B299")
echo $(increment "127ABC385")
echo $(increment "7cf999")
Output:
5B300
127ABC386
7cf1000

case insensitive string comparison in bash

The following line removes the leading text before the variable $PRECEDING
temp2=${content#$PRECEDING}
But now i want the $PRECEDING to be case-insensitive. This works with sed's I flag. But i can't figure out the whole cmd.
No need to call out to sed or use shopt. The easiest and quickest way to do this (as long as you have Bash 4):
if [ "${var1,,}" = "${var2,,}" ]; then
echo "matched"
fi
All you're doing there is converting both strings to lowercase and comparing the results.
Here's a way to do it with sed:
temp2=$(sed -e "s/^.*$PRECEDING//I" <<< "$content")
Explanation:
^.*$PRECEDING: ^ means start of string, . means any character, .* means any character zero or more times. So together this means "match any pattern from start of string that is followed by (and including) string stored in $PRECEDING.
The I part means case-insensitive, the g part (if you use it) means "match all occurrences" instead of just the 1st.
The <<< notation is for herestrings, so you save an echo.
The only bash way I can think of is to check if there's a match (case-insensitively) and if yes, exclude the appropriate number of characters from the beginning of $content:
content=foo_bar_baz
PRECEDING=FOO
shopt -s nocasematch
[[ $content == ${PRECEDING}* ]] && temp2=${content:${#PRECEDING}}
echo $temp2
Outputs: _bar_baz
your examples have context-switching techniques.
better is (bash v4):
VAR1="HELLoWORLD"
VAR2="hellOwOrld"
if [[ "${VAR1^^}" = "${VAR2^^}" ]]; then
echo MATCH
fi
link: Converting string from uppercase to lowercase in Bash
If you don't have Bash 4, I find the easiest way is to first convert your string to lowercase using tr
VAR1=HelloWorld
VAR2=helloworld
VAR1_LOWER=$(echo "$VAR1" | tr '[:upper:]' '[:lower:]')
VAR2_LOWER=$(echo "$VAR2" | tr '[:upper:]' '[:lower:]')
if [ "$VAR1_LOWER" = "$VAR2_LOWER" ]; then
echo "Match"
else
echo "Invalid"
fi
This also makes it really easy to assign your output to variables by changing your echo to OUTPUT="Match" & OUTPUT="Invalid"

Resources