Cut string of numbers at letter in bash - bash

I have a string such as plantford1775.274.284b63.11.
I have been using identity=$( echo "$identity" | cut -d'.' -f3) to cut at each dot, and then choose the third section. I am left with 284b63.
The format of this part is always a letter, sandwiched by varying amounts of numbers. I would like to take the first few numbers before the letter. An example code line would be this:
identity=$( echo "$identity" | cut -d'anyletter' -f1)
What do I replace anyletter with to cut at whatever letter is listed there, so that I end with a string of 284?

This could be done in single awk, please try following written and tested with your shown samples.
echo "$identity" | awk -F'.' '{sub(/[^0-9].*/,"",$3);print $3}'
Explanation: simple explanation would be, passing echo command's output as a standard input to awk code. In awk program, setting field separator as . for values. Then in 3rd field substituting(using sub function of awk) everything apart from digits with NULL in 3rd field, then printing it.

Try:
echo plantford1775.274.284b63.11 | cut -d. -f3 | sed 's/[a-z].*//'

Or a slight variation on the REGEX, with [[...]] in bash:
v="plantford1775.274.284b63.11"
[[ $v =~ ^[^.]+.[^.]+.([^.]+).*$ ]] && echo ${BASH_REMATCH[1]}
Output
284b63
Or if you are only interested in the digits before the letter:
[[ $v =~ ^[^.]+.[^.]+.([[:digit:]]+)[^.]+.*$ ]] && echo ${BASH_REMATCH[1]}
Output
284

With bash, using the =~ operator :
[[ $identity =~ [^.]*.[^.]*.([0-9]+) ]] && identity=${BASH_REMATCH[1]}
or, in POSIX shell:
identity=${identity#*.*.}
identity=${identity%%[^0-9]*}
or, using sed:
identity=$(sed 's/[^.]*.[^.]*.\([0-9]*\).*/\1/' <<< "$identity")

Maybe you can use a bash regex and get the result from $BASH_REMATCH.
[[ "$identity" =~ ([0-9]+)[a-z][0-9]+ ]] && identity="${BASH_REMATCH[1]}"

Say we have
identity=284b63
then you can do a
lead=${identity%[a-z]*}
to set lead to 284. Feel free to adapt the pattern to upper case letters and/or other separators.

If the format of this part is always a letter, sandwiched by varying amounts of numbers, and you want to match this format, you might also use gnu awk, setting the field separator to . and use a pattern with a capture group for the 3rd field.
The pattern captures 1 or more digits from the start of the string, and match one of more chars [a-z] after it followed by a digit.
echo "$identity" | awk -F'.' 'match($3, /^([0-9]+)[a-z]+[0-9]/, ary) {print ary[1]}'
Output
284
Or using sed with a pattern matching the first 2 dots and the capture group after the 2nd dot:
identity=$(sed 's/^[^.]\+\.[^\.]\+\.\([0-9]\+\)[a-z]\+[0-9].*/\1/' <<< "$identity")

Related

How to add a hyphen after every fifth character of a word in bash

Given "ABCDEFGHIJKLMOPQRSTUVWXY"
How does one achieve this outcome? "ABCDE-FGHIJ-KLMNO-PQRST-UVWXY"
With sed you can do this by first adding a - after every 5 characters, then removing the trailing - at the end of the line:
$ sed -E 's/.{5}/&-/g; s/-$//' <<<"ABCDEFGHIJKLMNOPQRSTUVWXY"
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY
In extended (-E) mode:
.{5} matches any 5 characters
&- replaces with the whole match (the 5 characters) plus -
Then the second substitution command matches - at the end of the line ($) and replaces with nothing.
With GNU awk, one option would be to use FPAT to define the way the line is interpreted as a series of fields, then add - between each field:
$ awk -v FPAT='.{5}' -v OFS='-' '{ $1 = $1 } 1' <<<"ABCDEFGHIJKLMNOPQRSTUVWXY"
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY
The field pattern FPAT is defined as any 5 characters and the Output Field Separator OFS is defined as -. $1 = $1 "touches" every line, causing it to be reformatted (without this part, nothing would happen). 1 is the shortest true condition causing each line to be printed.
It's not too difficult to do this in bash either:
#!/bin/bash
input="ABCDEFGHIJKLMNOPQRSTUVWXY"
parts=()
# build an array from slices of length 5
for (( i = 0; i < ${#input}; i += 5 )) do
parts+=( "${input:i:5}" )
done
# join the array on IFS (use a subshell to avoid modifying IFS for rest of script)
( IFS=-; echo "${parts[*]}" )
Could you please try following.
echo "ABCDEFGHIJKLMOPQRSTUVWXY" | sed 's/...../&-/g;s/-$//'
A simple solution for only letters will be
sed -E 's/[A-Z]{4}./&-/g' file.txt
The output will be:
ABCDE-FGHIJ-KLMOP-QRSTU-VWXY
if you want them to include more than capital letters just do a:
sed -E 's/[A-Za-z]{4}./&-/g' file.txt
Try this
#!/bin/bash
s="ABCDEFGHIJKLMNOPQRSTUVWXY"
a=($(echo ${s} | grep -o .))
o=""
i=0
while [[ ${i} -lt ${#a[#]} ]]; do
o="${o}${a[${i}]}"
(( i++ ))
[[ $(( i % 5 )) -eq 0 ]] && [[ ${i} -ne ${#a[#]} ]] && o="${o}-"
done
echo ${o}
exit 0
another solution with fold/paste
$ echo {A..Y} | tr -d ' ' | # this is to generate the string
fold -w5 | paste -sd-
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY
This might work for you (GNU sed):
sed 's/.\{5\}\B/&-/g' file
Insert a hyphen every five characters as long as the fifth character is inside a word.
Yet another choice
perl -pe 's/(.{5})(?=.)/$1-/g' file
Match 5 characters that are followed by another character (to avoid the trailing hyphen problem)

bash check if a string has a character more than once

The title actually almost explains it all. I would like to check if a string contains a letter (not a specific letter, really any letter) more than once.
for example:
user:
test.sh this list
script:
if [ "$1" has some letter more then once ]
then
do something
fi
Use a Posix character class:
if [[ $1 =~ [[:alpha:]].*[[:alpha:]] ]]; then
echo "more than one letter"
fi
This regex (in bash) will tell you the first lower case letter that is repeated.
And which is it:
#!/bin/bash
regex="([a-z]).*\1"
if [[ $1 =~ $regex ]]; then
echo "more than one letter ${BASH_REMATCH[1]}"
fi
Call as:
$ script.sh "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZz"
more than one letter "z"
Of course, the range of letters could be changed to lower and upper:
[a-zA-Z]
But only if the LC_COLLATE is set to "C", if that is set to UTF-8, then also accented characters could be included in the a-z range. As this may show:
$ ./sc.sh abcdefghijklémnopéqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZz
more than one letter "é"
This will keep letters as what ASCII believe a letter is:
$ LC_COLLATE=C ./sc.sh abcdefghijklémnopéqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZz
more than one letter "z"
The range of characters could be some of the POSIX character ranges:
[[:word:]] [[:alpha:]] [[:lower:]] [[:upper:]]
Please note that what those ranges mean is also changed by the character set in use.
If you want to go by using just basic commands, you can use something like this ...
#!/bin/bash
PATH=/bin/:/usr/bin/:$PATH
if [ `echo $* | tr -d ' ' | sed 's/\(.\)/\1\n/g' | sort | uniq -c | tr -s ' ' | sort -n | grep -v '^ 1 ' | wc -l` -ge 1 ]
then
echo "Input contains duplicate characters"
fi
In case it is unclear, it will be easy to try it out each step on the command line like this ... echo test input | tr -d ' 'see the output, then add the sed part to it and so on and so forth.
The first tr -d ' ' will ensure spaces from your input are not counted as duplicates. For example, if the input is "abcd efgh ijkl", the only character repeating is the space. If you keep tr -d ' ' in there, the script will not count the input to be having duplicate characters, if you remove it, the script will count the input to be having duplicate characters.
Cheers.
-- Parag

stripping the first part of piped output

I have a bash script that outputs the following:
SUM = 137892134.0000000
I need to strip off the first part of the string, leaving only the number, formatted as an integer if possible. I'm assuming I need to use sed but I seem to have zero capacity to learn it.
I need to be able to write a conditional statement that can operate if the value is less than 100. I don't know if I can do this in a bash script, but that will be the second part of my challenge.
The basic form of a substitution with sed is:
s/replace this/with this/
Where "replace this" and "with this" are both regular expressions. In your case, you want to completely get rid of the literal string "SUM = " at the beginning and the decimal at the end. So:
#!/bin/bash
sum=$(your_script.sh | sed 's/^SUM = //' | sed 's/\..*//')
if ! egrep -q '^[0-9]+$' <<< $sum; then
echo "your_script.sh printed unexpected output!"
exit 1
fi
if [ $sum -lt 100 ]; then
echo "$sum is less than 100"
else
echo "$sum is not less than 100"
fi
The first line is what turns "SUM = 137892134.0000000" into "137892134". The first sed replaces "SUM = " at the beginning of the string (^) with nothing (i.e., deletes it). The second sed finds the first period character (\.) and replaces it and everything after it (.*) with nothing. The resulting string is then saved to the variable $sum using $(...).
The if-statement that uses egrep is checking to make sure that the value we saved in $sum is actually an integer, and bails if it's not.
The second if-statement compares the value of $sum, which we now know is an integer, with 100.
It's not clear to me how you want to handle "123.789" (whether you would print 124 or 123 when printing as an integer). Consider:
if $( echo SUM = 137892134.0000000 | awk '{printf "%d", $3}' ) -lt 100; then
echo the value is less than 100!!
fi
You can also do:
if echo SUM = 137892134.0000000 | awk '$3 >= 100 { exit 1}'; then
echo the value is less than 100!!
fi
or
if ! echo SUM = 137892134.0000000 | awk '{exit $3 < 100}'; then
echo the value is less than 100!!
fi
Note that the logic is a little convoluted as awk returning 1 evaluates to failure, so the comparison operator is the inverse of what might be expected.
Here is one way to use sed to do this:
echo 'SUM = 137892134.0000000' | sed 's/[^0-9.]//g' | sed 's/\..*//g'
This is what the output should look like: 137892134.
Some explanation on the commands:
sed 's/[^0-9.]//g' tells sed to remove any characters that are not numbers (0-9) or periods (.)
sed 's/\..*//g' tells sed to remove any characters (.*) after a decimal (\.)
Also, instead of using echo, you can use the output from your original script for that first part... and then it can be piped into sed to eventually get the final "int" that you want.
Note: this does not take into account any rounding issues as brought up by William Pursell.
I suggest:
bashScript | sed 's/.* \(.*\)\.0*/\1/'
In English: "Take a bunch of stuff followed by a space, followed by something, followed by a dot and maybe some zeroes, and replace all of that with the something."

Extract numbers from strings

I have a file containing on each line a string of the form
string1.string2:\string3{string4}{number}
and what I want to extract is the number. I've searched and tried for a while to get this done using sed or bash, but failed. Any help would be much appreciated.
Edit 1: The strings may contains numbers.
$ echo 'string1.string2:\string3{string4}{number}' |\
cut -d'{' -f3 | cut -d'}' -f 1
number
Using sed:
sed 's/[^}]*}{\([0-9]*\)}/\1/' input_file
Description:
[^}]*} : match anything that is not } and the following }
{\([0-9]*\)}: capture the following digits within {...}
/\1/ : substitute all with the captured number
Use grep:
grep -o '\{[0-9]\+\}' | tr -d '[{}]'
In bash:
sRE='[[:alnum:]]+'
nRE='[[:digit:]]+'
[[ $str =~ $sRE\.$sRE:\\$sRE\{$sRE\}\{($nRE)\} ]] && number=${BASH_REMATCH[1]}
You can drop the first part of the regular expression, if your text file is sufficiently uniform:
[[ $str =~ \\$sRE{$sRE}{($nRE)} ]] && number=${BASH_REMATCH[1]}
or even
[[ $str =~ {$sRE}{($nRE)} ]] && number=${BASH_REMATCH[1]}

bash script to extract ALL matches of a regex pattern

I found this but it assumes the words are space separated.
result="abcdefADDNAME25abcdefgHELLOabcdefgADDNAME25abcdefgHELLOabcdefg"
for word in $result
do
if echo $word | grep -qi '(ADDNAME\d\d.*HELLO)'
then
match="$match $word"
fi
done
POST EDITED
Re-naming for clarity:
data="abcdefADDNAME25abcdefgHELLOabcdefgADDNAME25abcdefgHELLOabcdefg"
for word in $data
do
if echo $word | grep -qi '(ADDNAME\d\d.*HELLO)'
then
match="$match $word"
fi
done
echo $match
Original left so comments asking about result continue to make sense.
Use grep -o
-o, --only-matching show only the part of a line matching PATTERN
Edit: answer to edited question:
for string in "$(echo $result | grep -Po "ADDNAME[0-9]{2}.*?HELLO")"; do
match="${match:+$match }$string"
done
Original answer:
If you're using Bash version 3.2 or higher, you can use its regex matching.
string="string to search 99 with 88 some 42 numbers"
pattern="[0-9]{2}"
for word in $string; do
[[ $word =~ $pattern ]]
if [[ ${BASH_REMATCH[0]} ]]; then
match="${match:+$match }${BASH_REMATCH[0]}"
fi
done
The result will be "99 88 42".
Not very elegant - and there are problems because of greedy matching - but this more or less works:
data="abcdefADDNAME25abcdefgHELLOabcdefgADDNAME25abcdefgHELLOabcdefg"
for word in $data \
"ADDNAME25abcdefgHELLOabcdefgADDNAME25abcdefgHELLOabcdefg" \
"ADDNAME25abcdefgHELLOabcdefgADDNAME25abcdefgHELLO"
do
echo $word
done |
sed -e '/ADDNAME[0-9][0-9][a-z]*HELLO/{
s/\(ADDNAME[0-9][0-9][a-z]*HELLO\)/ \1 /g
}' |
while read line
do
set -- $line
for arg in "$#"
do echo $arg
done
done |
grep "ADDNAME[0-9][0-9][a-z]*HELLO"
The first loop echoes three lines of data - you'd probably replace that with cat or I/O redirection. The sed script uses a modified regex to put spaces around the patterns. The last loop breaks up the 'space separated words' into one 'word' per line. The final grep selects the lines you want.
The regex is modified with [a-z]* in place of the original .* because the pattern matching is greedy. If the data between ADDNAME and HELLO is unconstrained, then you need to think about using non-greedy regexes, which are available in Perl and probably Python and other modern scripting languages:
#!/bin/perl -w
while (<>)
{
while (/(ADDNAME\d\d.*?HELLO)/g)
{
print "$1\n";
}
}
This is a good demonstration of using the right too for the job.

Resources