Split a string to print first two characters delimited by "-" In Bash - bash

I am listing the AWS region names.
us-east-1
ap-southeast-1
I want to split the string to print specific first characters delimited by - i.e. 'two characters'-'one character'-'one character'. So us-east-1 should be printed as use1 and ap-southeast-1 should be printed as aps1
I have tried this and it's giving me expected results. I was thinking if there is a shorter way to achieve this.
region=us-east-1
regionlen=$(echo -n $region | wc -m)
echo $region | sed 's/-//' | cut -c 1-3,expr $regionlen - 2-expr $regionlen - 1

How about using sed:
echo "$region" | sed -E 's/^(.[^-]?)[^-]*-(.)[^-]*-(.).*$/\1\2\3/'
Explanation: the s/pattern/replacement/ command picks out the relevant parts of the region name, replacing the entire name with just the relevant bits. The pattern is:
^ - the beginning of the string
(.[^-]?) - the first character, and another (if it's not a dash)
[^-]* - any more things up to a dash
- - a dash (the first one)
(.) - The first character of the second word
[^-]*- - the rest of the second word, then the dash
(.) - The first character of the third word
.*$ - Anything remaining through the end
The bits in parentheses get captured, so \1\2\3 pulls them out and replaces the whole thing with just those.

IFS influencing field splitting step of parameter expansion:
$ str=us-east-2
$ IFS=- eval 'set -- $str'
$ echo $#
3
$ echo $1
us
$ echo $2
east
$ echo $3
No external utilities; just processing in the language.
This is how smartly written build configuration scripts parse version numbers like 1.13.4 and architecture strings like i386-gnu-linux.
The eval can be avoided, if we save and restore IFS.
$ save_ifs=$IFS; set -- $str; IFS=$save_ifs

Using bash, and assuming that you need to distinguish between things like southwest and southeast:
s=ap-southwest-1
a=${s:0:2}
b=${s#*-}
b=${b%-*}
c=${s##*-}
bb=
case "$b" in
south*) bb+=s ;;&
north*) bb+=n ;;&
*east*) bb+=e ;;
*west*) bb+=w ;;
esac
echo "$a$bb$c"

How about:
region="us-east-1"
echo "$region" | (IFS=- read -r a b c; echo "$a${b:0:1}${c:0:1}")
use1

A simple sed -
$: printf "us-east-1\nap-southeast-1\n" |
sed -E 's/-(.)[^-]*/\1/g'
To keep noncardinal specifications like southeast distinct from south at the cost of adding an optional additional character -
$: printf "us-east-1\nap-southeast-1\n" |
sed -E '
s/north/n/;
s/south/s/;
s/east/e/;
s/west/w/;
s/-//g;'
If you could have south-southwest, add g to those directional reductions.
if you MUST have exactly 4 characters of output, I recommend mapping the eight or 16 map directions to specific characters, so that north is N, northeast is maybe O and northwest M... that sort of thing.

Related

In bash how can I get the last part of a string after the last hyphen [duplicate]

I have this variable:
A="Some variable has value abc.123"
I need to extract this value i.e abc.123. Is this possible in bash?
Simplest is
echo "$A" | awk '{print $NF}'
Edit: explanation of how this works...
awk breaks the input into different fields, using whitespace as the separator by default. Hardcoding 5 in place of NF prints out the 5th field in the input:
echo "$A" | awk '{print $5}'
NF is a built-in awk variable that gives the total number of fields in the current record. The following returns the number 5 because there are 5 fields in the string "Some variable has value abc.123":
echo "$A" | awk '{print NF}'
Combining $ with NF outputs the last field in the string, no matter how many fields your string contains.
Yes; this:
A="Some variable has value abc.123"
echo "${A##* }"
will print this:
abc.123
(The ${parameter##word} notation is explained in §3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)
Some examples using parameter expansion
A="Some variable has value abc.123"
echo "${A##* }"
abc.123
Longest match on " " space
echo "${A% *}"
Some variable has value
Longest match on . dot
echo "${A%.*}"
Some variable has value abc
Shortest match on " " space
echo "${A%% *}"
some
Read more Shell-Parameter-Expansion
The documentation is a bit painful to read, so I've summarised it in a simpler way.
Note that the '*' needs to swap places with the ' ' depending on whether you use # or %. (The * is just a wildcard, so you may need to take off your "regex hat" while reading.)
${A% *} - remove shortest trailing * (strip the last word)
${A%% *} - remove longest trailing * (strip the last words)
${A#* } - remove shortest leading * (strip the first word)
${A##* } - remove longest leading * (strip the first words)
Of course a "word" here may contain any character that isn't a literal space.
You might commonly use this syntax to trim filenames:
${A##*/} removes all containing folders, if any, from the start of the path, e.g.
/usr/bin/git -> git
/usr/bin/ -> (empty string)
${A%/*} removes the last file/folder/trailing slash, if any, from the end:
/usr/bin/git -> /usr/bin
/usr/bin/ -> /usr/bin
${A%.*} removes the last extension, if any (just be wary of things like my.path/noext):
archive.tar.gz -> archive.tar
How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.:
B=$(echo "$A" | cut -d ' ' -f 5-)
This uses the cut command to slice out part of the line, using a simple space as the word delimiter.
As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring.
A="Some variable has value abc.123"
echo "$A" | rev | cut -d ' ' -f 1 | rev
# abc.123
More ways to do this:
(Run each of these commands in your terminal to test this live.)
For all answers below, start by typing this in your terminal:
A="Some variable has value abc.123"
The array example (#3 below) is a really useful pattern, and depending on what you are trying to do, sometimes the best.
1. with awk, as the main answer shows
echo "$A" | awk '{print $NF}'
2. with grep:
echo "$A" | grep -o '[^ ]*$'
the -o says to only retain the matching portion of the string
the [^ ] part says "don't match spaces"; ie: "not the space char"
the * means: "match 0 or more instances of the preceding match pattern (which is [^ ]), and the $ means "match the end of the line." So, this matches the last word after the last space through to the end of the line; ie: abc.123 in this case.
3. via regular bash "indexed" arrays and array indexing
Convert A to an array, with elements being separated by the default IFS (Internal Field Separator) char, which is space:
Option 1 (will "break in mysterious ways", as #tripleee put it in a comment here, if the string stored in the A variable contains certain special shell characters, so Option 2 below is recommended instead!):
# Capture space-separated words as separate elements in array A_array
A_array=($A)
Option 2 [RECOMMENDED!]. Use the read command, as I explain in my answer here, and as is recommended by the bash shellcheck static code analyzer tool for shell scripts, in ShellCheck rule SC2206, here.
# Capture space-separated words as separate elements in array A_array, using
# a "herestring".
# See my answer here: https://stackoverflow.com/a/71575442/4561887
IFS=" " read -r -d '' -a A_array <<< "$A"
Then, print only the last elment in the array:
# Print only the last element via bash array right-hand-side indexing syntax
echo "${A_array[-1]}" # last element only
Output:
abc.123
Going further:
What makes this pattern so useful too is that it allows you to easily do the opposite too!: obtain all words except the last one, like this:
array_len="${#A_array[#]}"
array_len_minus_one=$((array_len - 1))
echo "${A_array[#]:0:$array_len_minus_one}"
Output:
Some variable has value
For more on the ${array[#]:start:length} array slicing syntax above, see my answer here: Unix & Linux: Bash: slice of positional parameters, and for more info. on the bash "Arithmetic Expansion" syntax, see here:
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Arithmetic-Expansion
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Shell-Arithmetic
You can use a Bash regex:
A="Some variable has value abc.123"
[[ $A =~ [[:blank:]]([^[:blank:]]+)$ ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
Prints:
abc.123
That works with any [:blank:] delimiter in the current local (Usually [ \t]). If you want to be more specific:
A="Some variable has value abc.123"
pat='[ ]([^ ]+)$'
[[ $A =~ $pat ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'

I want to extract the strings from file name

one_two_three_four_five.rtf
I need five in A variable
I need four in B variable
And remaining in C variable
Should read from the last character
Note after 2 underscore from the last. There could be many underscores but should take has C variable.
Is it possible?
For example using parameter expansion
#!/bin/ksh
string="one_two_three_four_five.rtf"
base=${string%.rtf}
a=${base##*_}; base=${base%_$a}
b=${base##*_}; base=${base%_$b}
c=$base
echo "$a - $b - $c"
s="one_two_three_four_five.rtf"
source <(sed -r 's/(.*)_([^_]*)_([^_]*)[.].*/C="\1"; B="\2";A="\3"/' <<< "${s}")
# Result:
echo "A=$A, B=$B, C=$C"
A=five, B=four, C=one_two_three
Explanation:
sed -r No need for escaping backslashes
(.*)_ Matches largest string until underscore with the condition that there are underscores left for matching the remaining string
([^_]*) String without underscore
[.] A dot without special meaning
"\1" First remembered string
<<< "${s}" Input for sed is like echo "${s}" | sed ...
<(..) Simulates a file, so sourcing these will execute the commands.

How to increment a string variable within a for loop

I want a loop that can find the letter that ends words most frequently in multiple languages and output the data in columns.
So far I have
count="./wordlist/french/fr.txt ./wordlist/spanish/es.txt ./wordlist/german/de.$
lang="French Spanish German Portuguese Italian"
(
echo -e "Language Letter Count"
for i in $count
do
(for j in {a..z}
do
echo -e "LANG" $j $(grep -c $j\> $i)
done
) | sort -k3 -rn | head -1
done
) | column -t
I want it to output as shown:
Language Letter Count
French e 196195
Spanish a 357193
German e 251892
Portuguese a 217178
Italian a 216125
Instead I get:
Language Letter Count
LANG z 0
LANG z 0
LANG z 0
LANG z 0
LANG z 0
The words files have the format:
Word Freq(#) where the word and its frequency are delimited by a space.
This means I have 2 problems;
First, the grep command is not handling the argument $j\> to find a character at the end of a word. I have tried using grep -E $j\> and grep '$j\>' and neither worked.
The second problem is that I don't know how to output the name of the language (in the variable lang). Nesting another for loop did not work when I tried it like this (or with i and k in the opposite order):
(
for i in $count
do
for k in $lang
do
for j in {a..z}
do
echo -e $k $j $(grep -c $j\> $i)
done
) | sort -k3 -rn | head -1
done
done
) | column -t
Since this outputs multiples of the name of the language "$k" in places where it does not belong.
I know that I can just copy and paste the loop for each language, but I would like to extend this to every language.
Thanks in advance!
grep word boundaries
To make special delimiters (e.g. \> for word-end) work with egrep when being called from the shell, you should put them into "quotes".
count=$(egrep -c "${char}\>" "${file}")
Btw, you really should use double quote ("), because single quotes will prevent variable-expansion. (e.g. in j="foo"; k='$j\>', the first character of k's value will be $ rather than f)
Language name display
Getting the right language string is a bit more tricky; here's a few suggestions:
Derive the displayed language from the path of the wordlist:
lang=${file%/*}
lang=${lang##*/}
With bash (though not with dash and some other shells) you might even do lang=${lang^} to capitalize the string.
Lookup the proper language name in a dictionary. Bash-4 has dictionaries built in, but you can also use filebased dicts:
$ cat languagues.txt
./wordlist/french/fr.txt Français
./wordlist/english/en.txt English
./wordlist/german/de.txt Deutsch
$ file=./wordlist/french/fr.txt
$ lang=$(egrep "^${file}/>" languages.txt | awk '{print $2}')
You can also iterate over file,lang pairs, e.g.
languages="french/fr,French spanish/es,Español german/de,Deutsch"
for l in $languages; do
file=./wordlist/${l%,*}.txt
lang=${l#*,}
# ...
done
Taking word frequencies into account
The third problem I see (though I might misunderstand the problem), is that you are not taking the word frequency into account. e.g. a word A that is used 1000 times more often than the word B will only get counted once (just like B).
You can use awk to sum up the word frequencies of matching words:
count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')
All Together Now
So a full solution to the problem could look like:
languages="french/fr,French spanish/es,Español german/de,Deutsch"
(
echo -e "Language Letter Count"
for l in ${languages}; do
file=./wordlist/${l%,*}.txt
lang=${l#*,}
for char in {a..z}; do
#count=$(egrep -c "${char}\>" "${file}")
count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')
echo ${file} ${char} ${count}
done | sort -k3 -rn | head -1
done
) | column -t

Extract version number from file in shell script

I'm trying to write a bash script that increments the version number which is given in
{major}.{minor}.{revision}
For example.
1.2.13
Is there a good way to easily extract those 3 numbers using something like sed or awk such that I could increment the {revision} number and output the full version number string.
$ v=1.2.13
$ echo "${v%.*}.$((${v##*.}+1))"
1.2.14
$ v=11.1.2.3.0
$ echo "${v%.*}.$((${v##*.}+1))"
11.1.2.3.1
Here is how it works:
The string is split in two parts.
the first one contains everything but the last dot and next characters: ${v%.*}
the second one contains everything but all characters up to the last dot: ${v##*.}
The first part is printed as is, followed by a plain dot and the last part incremented using shell arithmetic expansion: $((x+1))
Pure Bash using an array:
version='1.2.33'
a=( ${version//./ } ) # replace points, split into array
((a[2]++)) # increment revision (or other part)
version="${a[0]}.${a[1]}.${a[2]}" # compose new version
I prefer "cut" command for this kind of things
major=`echo $version | cut -d. -f1`
minor=`echo $version | cut -d. -f2`
revision=`echo $version | cut -d. -f3`
revision=`expr $revision + 1`
echo "$major.$minor.$revision"
I know this is not the shortest way, but for me it's simplest to understand and to read...
Yet another shell way (showing there's always more than one way to bugger around with this stuff...):
$ echo 1.2.3 | ( IFS=".$IFS" ; read a b c && echo $a.$b.$((c + 1)) )
1.2.4
So, we can do:
$ x=1.2.3
$ y=`echo $x | ( IFS=".$IFS" ; read a b c && echo $a.$b.$((c + 1)) )`
$ echo $y
1.2.4
Awk makes it quite simple:
echo "1.2.14" | awk -F \. {'print $1,$2, $3'} will print out 1 2 14.
flag -F specifies separator.
If you wish to save one of the values:
firstVariable=$(echo "1.2.14" | awk -F \. {'print $1'})
I use the shell's own word splitting; something like
oIFS="$IFS"
IFS=.
set -- $version
IFS="$oIFS"
although you need to be careful with version numbers in general due to alphabetic or date suffixes and other annoyingly inconsistent bits. After this, the positional parameters will be set to the components of $version:
$1 = 1
$2 = 2
$3 = 13
($IFS is a set of single characters, not a string, so this won't work with a multicharacter field separator, although you can use IFS=.- to split on either . or -.)
Inspired by the answer of jlliagre I made my own version which supports version numbers just having a major version given. jlliagre's version will make 1 -> 1.2 instead of 2.
This one is appropriate to both styles of version numbers:
function increment_version()
local VERSION="$1"
local INCREMENTED_VERSION=
if [[ "$VERSION" =~ .*\..* ]]; then
INCREMENTED_VERSION="${VERSION%.*}.$((${VERSION##*.}+1))"
else
INCREMENTED_VERSION="$((${VERSION##*.}+1))"
fi
echo "$INCREMENTED_VERSION"
}
This will produce the following outputs:
increment_version 1 -> 2
increment_version 1.2 -> 1.3
increment_version 1.2.9 -> 1.2.10
increment_version 1.2.9.101 -> 1.2.9.102
Small variation on fgm's solution using the builtin read command to split the string into an array. Note that the scope of the IFS variable is limited to the read command (so no need to store & restore the current IFS variable).
version='1.2.33'
IFS='.' read -r -a a <<<"$version"
((a[2]++))
printf '%s\n' "${a[#]}" | nl
version="${a[0]}.${a[1]}.${a[2]}"
echo "$version"
See: How do I split a string on a delimiter in Bash?
I'm surprised no one suggested grep yet.
Here's how to get the full version (not limited to the length of x.y.z...) from a file name:
filename="openshift-install-linux-4.12.0-ec.3.tar.gz"
find -name "$filename" | grep -Eo '([0-9]+)(\.?[0-9]+)*' | head -1
# 4.12.0

how to chop last n bytes of a string in bash string choping?

for example qa_sharutils-2009-04-22-15-20-39, want chop last 20 bytes, and get 'qa_sharutils'.
I know how to do it in sed, but why $A=${A/.\{20\}$/} does not work?
Thanks!
If your string is stored in a variable called $str, then this will get you give you the substring without the last 20 digits in bash
${str:0:${#str} - 20}
basically, string slicing can be done using
${[variableName]:[startIndex]:[length]}
and the length of a string is
${#[variableName]}
EDIT:
solution using sed that works on files:
sed 's/.\{20\}$//' < inputFile
similar to substr('abcdefg', 2-1, 3) in php:
echo 'abcdefg'|tail -c +2|head -c 3
using awk:
echo $str | awk '{print substr($0,1,length($0)-20)}'
or using strings manipulation - echo ${string:position:length}:
echo ${str:0:$((${#str}-20))}
In the ${parameter/pattern/string} syntax in bash, pattern is a path wildcard-style pattern, not a regular expression. In wildcard syntax a dot . is just a literal dot and curly braces are used to match a choice of options (like the pipe | in regular expressions), so that line will simply erase the literal string ".20".
There are several ways to accomplish the basic task.
$ str="qa_sharutils-2009-04-22-15-20-39"
If you want to strip the last 20 characters. This substring selection is zero based:
$ echo ${str::${#str}-20}
qa_sharutils
The "%" and "%%" to strip from the right hand side of the string. For instance, if you want the basename, minus anything that follows the first "-":
$ echo ${str%%-*}
qa_sharutils
only if your last 20 bytes is always date.
$ str="qa_sharutils-2009-04-22-15-20-39"
$ IFS="-"
$ set -- $str
$ echo $1
qa_sharutils
$ unset IFS
or when first dash and beyond are not needed.
$ echo ${str%%-*}
qa_sharutils

Resources