Trying to retrieve first 5 characters (only number & alphabet) from string in bash - bash

I have a string like that
1-a-bc-dxyz
I'd want to get 1-a-bc-d ( first 5 characters, only number and alphabet)
Thanks

With gawk:
awk '{ for ( i=1;i<=length($0);i++) { if ( match(substr($0,i,1),/[[:alnum:]]/)) { cnt++;if ( cnt==5) { print substr($0,1,i) } } } }' <<< "1-a-bc-dxyz"
Read each character one by one and then if there is a pattern match for an alpha-numeric character (using the match function), increment a variable cnt. When cnt gets to 5, print the string we have seen so far (using the substr function)
Output:
1-a-bc-d

a='1-a-bc-dxyz'
count=0
for ((i=0;i<${#a};i++)); do
if [[ "${a:$i:1}" =~ [0-9]|[a-Z] ]] && [[ $((++count)) -eq 5 ]]; then
echo "${a:0:$((i+1))}"
exit
fi
done
You can further shrink this as;
a='1-a-bc-dxyz'
count=0
for ((i=0;i<${#a};i++)); do [[ "${a:$i:1}" =~ [0-9]|[a-Z] ]] && [[ $((++count)) -eq 5 ]] && echo "${a:0:$((i+1))}"; done

Using GNU awk:
$ echo 1-a-bc-dxyz | \
awk -F '' '{b=i="";while(gsub(/[0-9a-z]/,"&",b)<5)b=b $(++i);print b}'
1-a-bc-d
Explained:
awk -F '' '{ # separate each char to its own field
b=i="" # if you have more than one record to process
while(gsub(/[0-9a-z]/,"&",b)<5) # using gsub for counting (adjust regex if needed)
b=b $(++i) # gather buffer
print b # print buffer
}'

GNU sed supports an option to replace the k-th occurrence and all after that.
echo "1-a-bc-dxyz" | sed 's/[^a-zA-Z0-9]*[a-zA-Z0-9]//g6'

Using Combination of sed & AWK
echo 1-a-bc-dxyz | sed 's/[-*%$##]//g' | awk -F '' {'print $1$2$3$4$5'}
You can use for loop for printing character as well.

echo '1-a-bc-dxyz' | grep -Eo '^[[:print:]](-*[[:print:]]){4}'
That is pretty simple.
Neither sed nor awk.

Related

how to change words with the same words but with number at the back bash

I have a file for example with the name file.csv and content
adult,REZ
man,BRB
women,SYO
animal,HIJ
and a line that is nor a directory nor a file
file.csv BRB1 REZ3 SYO2
And what I want to do is change the content of the file with the words that are on the line and then get the nth letter of that word with the number at the end of the those words in capital
and the output should then be
umo
I know that I can get over the line with
for i in "${#:2}"
do
words+=$(echo "$i ")
done
and then the output is
REZ3 BRB1 SYO2
Using awk:
Pass the string of values as an awk variable and then split them into an array a. For each record in file.csv, iterate this array and if the second field of current record matches the first three characters of the current array value, then strip the target character from the first field of the current record and append it to a variable. Print the value of the aggregated variable.
awk -v arr="BRB1 REZ3 SYO2" -F, 'BEGIN{split(arr,a," ")} {for (v in a) { if ($2 == substr(a[v],0,3)) {n=substr(a[v],length(a[v]),1); w=w""substr($1,n,1) }}} END{print w}' file.csv
umo
You can also put this into a script:
#!/bin/bash
words="${2}"
src_file="${1}"
awk -v arr="$words" -F, 'BEGIN{split(arr,a," ")} \
{for (v in a) { \
if ($2 == substr(a[v],0,3)) { \
n=substr(a[v],length(a[v]),1); \
w=w""substr($1,n,1);
}
}
} END{print w}' "$src_file"
Script execution:
./script file.csv "BRB1 REZ3 SYO2"
umo
This is a way using sed.
Create a pattern string from command arguments and convert lines with sed.
#!/bin/bash
file="$1"
pat='s/^/ /;Te;'
for i in ${#:2}; do
pat+=$(echo $i | sed 's#^\([^0-9]*\)\([0-9]*\)$#s/.\\{\2\\}\\(.\\).*,\1$/\\1/;#')
done
pat+='Te;H;:e;${x;s/\n//g;p}'
eval "sed -n '$pat' $file"
Try this code:
#!/bin/bash
declare -A idx_dic
filename="$1"
pattern_string=""
for i in "${#:2}";
do
pattern_words=$(echo "$i" | grep -oE '[A-Z]+')
index=$(echo "$i" | grep -oE '[0-9]+')
pattern_string+=$(echo "$pattern_words|")
idx_dic["$pattern_words"]="$index"
done
pattern_string=${pattern_string%|*}
while IFS= read -r line
do
line_pattern=$(echo $line | grep -oE $pattern_string)
[[ -n $line_pattern ]] && line_index="${idx_dic[$line_pattern]}" && echo $line | awk -v i="$line_index" '{split($0, chars, ""); printf("%s", chars[i]);}'
done < $filename
first find the capital words pattern and catch the index corresponding
then construct the hole pattern words string which connect with |
at last, iterate the every line according to the pattern string, and find the letter by the index
Execute this script.sh like:
bash script.sh file.csv BRB1 REZ3 SYO2

Getting word index by a delimiter in a variable

Given line, delimiter, and word I want to get the index place of that word in the line based on the delimiter. As simple/short as possible. So for:
line="this-is-a-line_with-some.txt"
delimiter="-"
word="some"
echo <code goes here>
# should come out as 4
Of course I can split it with an array, and print the first occurrence of the word with a for loop, as follows:
line="this-is-a-line_with-some.txt"
delimiter="-"
word="some"
index=0
IFS="$delimiter" read -ra ary <<<"$line"
for i in "${ary[#]}"; do
if [[ $i == ${word}* ]]; then echo $index ; break ; fi
index=$((index+1))
done
But I'm sure there is a simpler solution.
simpler solution.
Replace delimiter with newline and get line numbers with grep.
<<<"$line" tr "$delimiter" '\n' | grep -n "$word" | cut -d: -f1
Minus 1:
<<<"$line" tr "$delimiter" '\n' | grep -n "$word" | cut -d: -f1 | awk '{print $1 - 1}'
# shorter
<<<"$line" tr "$delimiter" '\n' | grep -n "$word" | awk -F: '{print $1-1}'
Or really anyway just awk:
<<<"$line" awk -v RS="$delimiter" -v word="$word" '$0 ~ word{print NR-1}'
Understanding from OP's code and/or comments:
looking for the first occurrence of a ${delimiter}-delimited field that starts with ${word}
location index is 0-based
if ${word} is not found we generate no output
OP's code can be further reduced by using the array's 0-based index (ie, eliminate the need for the index variable):
IFS="$delimiter" read -ra ary <<<"$line"
for i in "${!ary[#]}"
do
[[ "${ary[i]}" == ${word}* ]] && echo "${i}" && break
done
# line="this-is-a-line_with-some.txt"
4
# line="a-some_def-xy-some.pdf"
1
NOTE: if ${word} is not found this will generate no output
A variation on this paramater substitution solution from superuser:
newline="${line%%${word}*}" # truncate string from 1st occurrence of ${word}
if [[ "${newline}" != "{line}" ]] # if strings are different then we found ${word}
then
IFS="${delimiter}" words_before=( ${newline} ) # break remaining string by "${delimiter}" and
# store in array words_before[]
echo "${#words_before[#]}" # number of array entries == index of 1st occurrence of ${word}
fi
# line="this-is-a-line_with-some.txt"
4
# line="a-some_def-xy-some.pdf"
1
NOTE: if ${word} is not found this will generate no output
One awk idea:
awk -F"${delimiter}" -v ptn="${word}" '{for (i=1;i<=NF;i++) if (index($i,ptn) == 1) {print i-1; exit}}' <<< "${line}"
# line="this-is-a-line_with-some.txt"
4
# line="a-some_def-xy-some.pdf"
1
Or using an inline replacement for ptn/${word}:
awk -F"${delimiter}" '{for (i=1;i<=NF;i++) if ($i ~ /^'"${word}"'/) {print i-1; exit}}' <<< "${line}"
# line="this-is-a-line_with-some.txt"
4
# line="a-some_def-xy-some.pdf"
1
NOTE: if ${word} is not found these awk scripts will generate no output
To get ideas for the truly shortest piece of code OP could try posting # codegolf, though the really short answers will likely require locating/installing new software (libs and/or binaries)
A solution without loop or external tool :
line="$delimiter$line"; lin2="${line%$delimiter$word*}"
if test "$lin2" != "$line"; then
IFS="$delimiter" read -ra ary <<<"${lin2#$delimiter}"
echo ${#ary[#]}
fi

How to add a hyphen after every fifth character of a word in bash

Given "ABCDEFGHIJKLMOPQRSTUVWXY"
How does one achieve this outcome? "ABCDE-FGHIJ-KLMNO-PQRST-UVWXY"
With sed you can do this by first adding a - after every 5 characters, then removing the trailing - at the end of the line:
$ sed -E 's/.{5}/&-/g; s/-$//' <<<"ABCDEFGHIJKLMNOPQRSTUVWXY"
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY
In extended (-E) mode:
.{5} matches any 5 characters
&- replaces with the whole match (the 5 characters) plus -
Then the second substitution command matches - at the end of the line ($) and replaces with nothing.
With GNU awk, one option would be to use FPAT to define the way the line is interpreted as a series of fields, then add - between each field:
$ awk -v FPAT='.{5}' -v OFS='-' '{ $1 = $1 } 1' <<<"ABCDEFGHIJKLMNOPQRSTUVWXY"
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY
The field pattern FPAT is defined as any 5 characters and the Output Field Separator OFS is defined as -. $1 = $1 "touches" every line, causing it to be reformatted (without this part, nothing would happen). 1 is the shortest true condition causing each line to be printed.
It's not too difficult to do this in bash either:
#!/bin/bash
input="ABCDEFGHIJKLMNOPQRSTUVWXY"
parts=()
# build an array from slices of length 5
for (( i = 0; i < ${#input}; i += 5 )) do
parts+=( "${input:i:5}" )
done
# join the array on IFS (use a subshell to avoid modifying IFS for rest of script)
( IFS=-; echo "${parts[*]}" )
Could you please try following.
echo "ABCDEFGHIJKLMOPQRSTUVWXY" | sed 's/...../&-/g;s/-$//'
A simple solution for only letters will be
sed -E 's/[A-Z]{4}./&-/g' file.txt
The output will be:
ABCDE-FGHIJ-KLMOP-QRSTU-VWXY
if you want them to include more than capital letters just do a:
sed -E 's/[A-Za-z]{4}./&-/g' file.txt
Try this
#!/bin/bash
s="ABCDEFGHIJKLMNOPQRSTUVWXY"
a=($(echo ${s} | grep -o .))
o=""
i=0
while [[ ${i} -lt ${#a[#]} ]]; do
o="${o}${a[${i}]}"
(( i++ ))
[[ $(( i % 5 )) -eq 0 ]] && [[ ${i} -ne ${#a[#]} ]] && o="${o}-"
done
echo ${o}
exit 0
another solution with fold/paste
$ echo {A..Y} | tr -d ' ' | # this is to generate the string
fold -w5 | paste -sd-
ABCDE-FGHIJ-KLMNO-PQRST-UVWXY
This might work for you (GNU sed):
sed 's/.\{5\}\B/&-/g' file
Insert a hyphen every five characters as long as the fifth character is inside a word.
Yet another choice
perl -pe 's/(.{5})(?=.)/$1-/g' file
Match 5 characters that are followed by another character (to avoid the trailing hyphen problem)

Extracting a substring from a variable using bash script

I have a bash variable with value something like this:
10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
There are no spaces within value. This value can be very long or very short. Here pairs such as 65:3.0 exist. I know the value of a number from the first part of pair, say 65. I want to extract the number 3.0 or pair 65:3.0. I am not aware of the position (offset) of 65.
I will be grateful for a bash-script that can do such extraction. Thanks.
Probably awk is the most straight-forward approach:
awk -F: -v RS=',' '$1==65{print $2}' <<< "$var"
3.0
Or to get the pair:
$ awk -F: -v RS=',' '$1==65' <<< "$var"
65:3.0
Here's a pure Bash solution:
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
while read -r -d, i; do
[[ $i = 65:* ]] || continue
echo "$i"
done <<< "$var,"
You may use break after echo "$i" if there's only one 65:... in var, or if you only want the first one.
To get the value 3.0: echo "${i#*:}".
Other (pure Bash) approach, without parsing the string explicitly. I'm assuming you're only looking for the first 65 in the string, and that it is present in the string:
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
value=${var#*,65:}
value=${value%%,*}
echo "$value"
This will be very slow for long strings!
Same as above, but will output all the values corresponding to 65 (or none if there are none):
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
tmpvar=,$var
while [[ $tmpvar = *,65:* ]]; do
tmpvar=${tmpvar#*,65:}
echo "${tmpvar%%,*}"
done
Same thing, this will be slow for long strings!
The fastest I can obtain in pure Bash is my original answer (and it's fine with 10000 fields):
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
IFS=, read -ra ary <<< "$var"
for i in "${ary[#]}"; do
[[ $i = 65:* ]] || continue
echo "$i"
done
In fact, no, the fastest I can obtain in pure Bash is with this regex:
var=10:3.0,16:4.0,32:4.0,39:2.0,65:3.0,95:4.0,110:4.0,111:4.0,2312:1.0
[[ ,$var, =~ ,65:([^,]+), ]] && echo "${BASH_REMATCH[1]}"
Test of this vs awk,
where the 65:3.0 is at the end:
printf -v var '%s:3.0,' {100..11000}
var+=65:42.0
time awk -F: -v RS=',' '$1==65{print $2}' <<< "$var"
shows 0m0.020s (rough average) whereas:
time { [[ ,$var, =~ ,65:([^,]+), ]] && echo "${BASH_REMATCH[1]}"; }
shows 0m0.008s (rough average too).
where the 65:3.0 is not at the end:
printf -v var '%s:3.0,' {1..10000}
time awk -F: -v RS=',' '$1==65{print $2}' <<< "$var"
shows 0m0.020s (rough average) and with early exit:
time awk -F: -v RS=',' '$1==65{print $2;exit}' <<< "$var"
shows 0m0.010s (rough average) whereas:
time { [[ ,$var, =~ ,65:([^,]+), ]] && echo "${BASH_REMATCH[1]}"; }
shows 0m0.002s (rough average).
With grep:
grep -o '\b65\b[^,]*' <<<"$var"
65:3.0
Or
grep -oP '\b65\b:\K[^,]*' <<<"$var"
3.0
\K option ignores everything before matched pattern and ignore pattern itself. It's Perl-compatibility(-P) for grep command .
Here is an gnu awk
awk -vRS="(^|,)65:" -F, 'NR>1{print $1}' <<< "$var"
3.0
try
echo $var | tr , '\n' | awk '/65/'
where
tr , '\n' turn comma to new line
awk '/65/' pick the line with 65
or
echo $var | tr , '\n' | awk -F: '$1 == 65 {print $2}'
where
-F: use : as separator
$1 == 65 pick line with 65 as first field
{ print $2} print second field
Using sed
sed -e 's/^.*,\(65:[0-9.]*\),.*$/\1/' <<<",$var,"
output:
65:3.0
There are two different ways to protect against 65:3.0 being the first-in-line or last-in-line. Above, commas are added to surround the variable providing for an occurrence regardless. Below, the Gnu extension \? is used to specify zero-or-one occurrence.
sed -e 's/^.*,\?\(65:[0-9.]*\),\?.*$/\1/' <<<$var
Both handle 65:3.0 regardless of where it appears in the string.
Try egrep like below:
echo $myvar | egrep -o '\b65:[0-9]+.[0-9]+' |

How to check if word is in alphabetical order

I 'd like to find a bash only (no sed, awk, perl, ...) for finding out if a word is in alphabetical order, in other words every letter is.
example:
bdjkz is true,
ahjmno is true,
sdgla is false.
I'm already struggling just comparing ascii values for characters, so if anyone could point me in the right direction for that it would help a lot!
Thanks
Pure bash solution (no external tool used), using Parameter Expansion to address characters inside strings:
function compare () {
word=$1
for (( pos=0; pos<${#word}-1; pos++ )) ; do
[[ ${word:pos:1} < ${word:pos+1:1} ]] || return 1
done
return 0
}
Tested with
for word in bdjkz ahjmno sdgla ; do
if compare $word ; then
echo $word ordered
else
echo $word not ordered
fi
done
If you can utilize other command line tools (but not awk, sed, perl), you can try:
[[ "YOURSTRING" = "$(echo "YOURSTRING" | grep -o '.' | sort -n |tr -d '\n')" ]] && \
echo "Alphabetic order"
[[ ... ]] is testing the expresion
"YOURSTRING" = string comparison
"$( ... )" capture the inner workings output in a string
echo "YOURSTRING" | grep -o '.' print every character on a line from "YOURSTRING" (-o '.': print only the matches for any single character - NOTE: you might need a new version of grep for this option)
... sort -n | sort the output from 4.
... tr -d '\n' rejoin the characters from 5. (by deleting the trailing new line characters)
You can use:
p='bdjkz'
q=$(fold -w1 <<< "$p"|sort|tr -d "\n")
[[ "$p" == "$q" ]] && echo "in alphabetical order" || echo "not in alphabetical order"
s=($(echo "existingString" | grep -o .)) # put each character of input string in an array.
k=($(printf '%s\n' "${s[#]}" | sort)) # sorts the input string
if [[ "${s[*]}" == "${k[*]}" ]]; then # comparing the input string array with sorted array
echo "alphabetical"
else
echo "not alphabetical"
fi

Resources