bash find keyword in an associative array - bash

I have incoming messages from a chat server that need to be compared against a list of keywords. I was using regular arrays, but would like to switch to associative arrays to try to increase the speed of the processing.
The list of words would be in an array called aWords and the values would be a 'type' indicator, i.e. aWords[damn]="1", with 1 being swear word in a legend to inform the user.
The issue is that I need to compare every index value with the input $line looking for substrings. I'm trying to avoid a loop thru each index value if at all possible.
From http://tldp.org/LDP/abs/html/string-manipulation.html, I'm thinking of the Substring Removal section.
${string#substring}
Deletes shortest match of $substring from front of $string.
A comparison of the 'removed' string from the $line, may help, but will it match also words in the middle of other words? i.e. matching the keyword his inside of this.
Sorry for the long-winded post, but I tried to cover all of what I'm attempting to accomplish as best I could.

# create a colon-separated string of the array keys
# you can do this once, after the array is created.
keys=$(IFS=:; echo "${!aWords[*]}")
if [[ ":$keys:" == *:"$word":* ]]; then
# $word is a key in the array
case ${aWords[$word]} in
1) echo "Tsk tsk: $word is a swear word" ;;
# ...
esac
fi

This is the first time I heard of associative arrays in bash. It inspired me to also try to add something, with the chance ofcourse that I completely miss the point.
Here is a code snippet. I hope I understood how it works:
declare -A SWEAR #create associative array of swearwords (only once)
while read LINE
do
[ "$LINE"] && SWEAR["$LINE"]=X
done < "/path/to/swearword/file"
while :
do
OUTGOING="" #reset output "buffer"
read REST #read a sentence from stdin
while "$REST" #evaluate every word in the sentence
do
WORD=${REST%% *}
REST=${REST#* }
[ ${SWEAR[$WORD]} ] && WORD="XXXX"
OUTGOING="$OUTGOING $WORD"
done
echo "$OUTGOING" #output to stdout
done

Related

Process contents in array based on type in shellscript

I have an array that has three types of data in it, integer, integer/integer, and the string value.
I have shown a sample below.
myarr = (2301/2320,Team Lifeline, 2311, 7650/7670, 232)
I have the following algorithm that I want to come up with.
For index in myarr
if index contains data as number1/number2; then
create an array, "mynumbers" to hold all the numbers starting from number1 to number2
else if index is a string
add it in "mystrarr"
else
add it in "myintarr"
done
For the first case, if I have an enter in the myarr as 2301/2320,
then the mynumbers as shown in the pseudocode will have entries from {2301, 2302, ... , 2320}. I am not able to understand on how to parse the entry in myarr and identify that it has a / in the array.
For the second situation, I am also not sure on how to identify if the entry in the myarr and know it is a string. mystrarr should have {Team Lifeline}.
For the final case, the myintarr should have {2311, 232}.
Any help would be appreciated. I am very new to shell script.
Stack Overflow is not a coding service.... but I was bored so here you go...
#!/bin/bash
myarr=(2301/2320 'Team Lifeline' 2311 7650/7670 232)
for element in "${myarr[#]}"; do
if [[ $element =~ ^[0-9]+/[0-9]+$ ]]; then
range="{${element%/*}..${element##*/}}"
mynumbers=( $(eval "echo $range") )
elif [ $element -eq $element ] 2>> /dev/null; then
intarr+=( $element )
else
strarr+=( "$element" )
fi
done
echo "mynumbers = ${mynumbers[*]}"
echo "intarr = ${intarr[*]}"
echo "strarr = ${strarr[*]}"
A lot to unpack here for inexperienced. So ask questions where I didn't cover anything. Things to note:
All assignments there are no spaces around =.
Array assignments are of the format ( element1 element2 ... )
Appending to arrays with +=(...) format
Looping through array elements for element in "${myarr[#]}"
Note that the array generated by 7650/7670 will overwrite the array generated by 2301/2320. I assume you have some kind of plan for this array, so I didn't do anything to stop it from being overwritten.
More details
This line is validating the format for 111/222:
if [[ $element =~ ^[0-9]+/[0-9]+$ ]]; then
[[ x =~ x ]] performs a regex comparison and this regex essentially just means:
^ - beginning of the string
[0-9]+ - Atleast 1 number
/ - character literal
$ - end of string
These lines are expanding your beginning and ending numbers:
range="{${element%/*}..${element##*/}}"
mynumbers=( $(eval "echo $range") )
This is maybe more complicated than it needs to be as most people try to avoid eval in general for security reasons. I'm leveraging bash's brace expansion. If you run echo {5..9}, it will output 5 6 7 8 9. This does not trigger with variables, so I cheated and used eval.
This line is checking if we are dealing with an integer:
[ $element -eq $element ] 2>> /dev/null
This works by running an integer -eq (equals) comparison on the variable against itself. This will actually fail and throw an error message on anything but an integer. This is not the way it was designed to be used which is why we discard all the error messages (2>> /dev/null).
This is a nice succinct script, but is using some unconventional practices. A longer more verbose version may be better for a beginner.
You can use regular expressions to match elements that are nothing but digits, or digits/digits, and assume everything else is a string:
#!/bin/bash
myarr=(2301/2320 "Time Lifeline" 2311 7650/7670 232)
declare -a mynumbers mystrarr myintarr
for elem in "${myarr[#]}"; do
if [[ $elem =~ ^([0-9]+)/([0-9]+)$ ]]; then
mynumbers+=($(seq ${BASH_REMATCH[1]} ${BASH_REMATCH[2]}))
elif [[ $elem =~ ^[0-9]+$ ]]; then
myintarr+=($elem)
else
mystrarr+=("$elem")
fi
done
echo mynumbers is "${mynumbers[#]}"
echo myintarr is "${myintarr[#]}"
echo mystrarr is "${mystrarr[*]}"
Jason explained a lot in his (very similar; there's only so many obvious ways to do this) answer, so to expand on where ours are different:
We both use regular expressions to match the integer/integer case, but he then goes on to extract the two numbers using parameter expansion with pattern removal options, while mine captures the two integers in the regular expression, and uses the BASH_REMATCH array to access their values as well as the seq command to generate the numbers between the two.

Iterate a user string in bash to add vowels to string

So I have a word list containing over 30,000 words. My goals is to make a script that takes in a word without constants in it (example: mbnt), and somehow add constants and compare to the word list to find atleast the word "ambient", though it will also find other words that would read as "mbnt" if you were to take out all of their vowels.
So far this is my bash script
f=/wordList
anyVowel=[aAeEiIoOuU]
nonVowel=[^aAeEiIoOuU]
input=$1
for (( i=0; i<${#input}; i++ ));
do
grep "${input:$i:1}$nonVowel" $f | head -10
done
however this will just return a just a normal list of words with some of the characters the user inputs. Any thoughts on what I might be doing wrong?
awk to the rescue!
$ awk -v w=whr '{a=tolower($0);
gsub(/[^a-z]/,"",a);
gsub(/[aeiou]/,"",a)}
a==w' words
where
looking for the vowels dropped word "whr" in the words (make up a custom dict). Convert to lowercase, filter out non alphas and remove vowels, finally look for a match with the given word.
Note that this is very inefficient if you're looking for many words, but perhaps can be a template for your solution.
Try
wordsfile=wordList
consonants=$1
# Create a regular expression that matches the input consonants with
# any number of vowels before, after, or between them
regex='^[[:space:]]*[aeiou]*'
for (( i=0; i<${#consonants}; i++ )) ; do
regex+="${consonants:i:1}[aeiou]*"
done
regex+='[[:space:]]*$'
grep -i -- "$regex" "$wordsfile"

shell script to add leading zeros in middle of file name

I have files with names like "words_transfer1_morewords.txt". I would like to ensure that the number after "transfer" is five digits, as in "words_transfer00001_morewords.txt". How would I do this with a ksh script? Thanks.
This will work in any Bourne-type/POSIX shell as long as your words and morewords don't contain numbers:
file=words_transfer1_morewords.txt
prefix=${file%%[0-9]*} # words_transfer
suffix=${file##*[0-9]} # _morewords.txt
num=${file#$prefix} # 1_morewords.txt
num=${num%$suffix} # 1
file=$(printf "%s%05d%s" "$prefix" "$num" "$suffix")
echo "$file"
Use ksh's regular expression matching operation to break the filename down into separate parts, them put them back together again after formatting the number.
pre="[^[:digit:]]+" # What to match before the number
num="[[:digit:]]+" # The number to match
post=".*" # What to match after the number
[[ $file =~ ($pre)($num)($post) ]]
new_file=$(printf "%s%05d%s\n" "${.sh.match[#]:1:3}")
Following a successful match with =~, the special array parameter .sh.match contains the full match in element 0, and each capture group in order starting with element 1.

How to remove an element from a bash array without flattening the array

I would like to make a function that takes a bash array like this one:
a=("element zero" "element one" "element two")
and removes one element like "element one" and leaves a the array like this:
a=("element zero" "element two")
such that echo $a[1] will print out element two and not zero.
I've seen several attempts at this, but haven't found one that did it cleanly or without breaking elements that have spaces into multiple elements. Or just setting the element to be blank (i.e. not shifting the indexes of subsequent array elements).
# initial state
a=( "first element" "second element" "third element" )
# to remove
unset a[0]
# to reindex, such that a[0] is the old a[1], rather than having the array
# start at a[1] with no a[0] entry at all
a=( "${a[#]}" )
# to print the array with its indexes, to check its state at any stage
declare -p a
...now, for a function, if you have bash 4.3, you can use namevars to do this without any eval at all:
remove() {
local -n _arr=$1 # underscore-prefixed name to reduce collision likelihood
local idx=$2
unset _arr[$idx] # remove the undesired item
_arr=( "${_arr[#]}" ) # renumber the indexes
}
For older versions of bash, it's a bit stickier:
remove() {
local cmd
unset "$1[$2]"
printf -v cmd '%q=( "${%q[#]}" )' "$1" "$1" && eval "$cmd"
}
The use of printf with %q format strings is a bit of paranoia -- it makes it harder for maliciously chosen values (in this case, variable names) to perform actions of their choice, as opposed to simply failing with no effect.
All that said -- it's better if you don't renumber your arrays. If you leave off the renumbering step, such that after deleting entry a[1] you simply have a sparse array with no content at that index (which is different from an empty string at that index -- bash "arrays" are actually stored as linked lists or hash tables [in the associative case], not as arrays at all, so sparse arrays are memory-efficient), the delete operation is much faster.
This doesn't break your ability to iterate over your arrays if you retrieve ask the array for its keys rather than supplying them externally, as in:
for key in "${!a[#]}"; do
value="${a[$key]}"
echo "Entry $key has value $value"
done
remove() {
eval "$1=( \"\${$1[#]:0:$2}\" \"\${$1[#]:$(($2+1))}\" )"
}
Can be called like this:
a=("element zero" "element one" "element two")
remove a 1
echo ${a[#]} #element zero element two
echo ${a[1]} #element two
This will leave blank array elements.
a=("element zero" "element one" "" "element three")
remove a 1
echo ${a[#]} #element zero element two
echo ${a[1]} #
echo ${a[2]} #element three
This will flatten unset elements in sparse arrays.

stopping 'sed' after match found on a line; don't let sed keep checking all lines to EOF

I have a text file in which each a first block of text on each line is separated by a tab from a second block of text like so:
VERBS, AUXILIARY. "Be," subjunctive and quasi-subjunctive Be, Beest, &c., was used in A.-S. (beon) generally in a future sense.
In case it is hard to tell, tab is long space between "quasi-subjunctive" and "Be".
So I am thinking off the top of my head a 'for' loop in which a var is set using 'sed' to read the first block of text of a line, upto and including the tab (or not, doesn't really matter) and then the 'var' is used to find subsequent matches adding a "(x)" right before the tab to make the line unique. The 'x' of course would be a running counter numbering the first instance '1' incrementing and then each subsequent match one number higher.
One problem I see is stopping 'sed' after each subsequent match so the counter can be incremented. Is there a way to do this, since it is "sed's" normal behaviour to continue on thru without stop (as far as I know) until all lines are processed.
You can set the IFS to TAB character and read the line into variables. Something like:
$ while IFS=$'\t' read block1 block2;do
echo "block1 is $block1"
echo "block2 is $block2"
done < file
block1 is VERBS, AUXILIARY. "Be," subjunctive and quasi-subjunctive
block2 is Be, Beest, &c., was used in A.-S. (beon) generally in a future sense.
Ok so I got the job done with this little (or perhaps big if too much overkill?) script I whipped up:
#!/bin/bash
sedLnCnt=1
while [[ "$sedLnCnt" -lt 521 ]] ; do
lN=$(sed -n "${sedLnCnt} p" sGNoSecNums.html|sed -r 's/^([^\t]*\t).*$/\1/') #; echo "\$lN: $lN"
lnNum=($(grep -n "$lN" sGNoSecNums.html|sed -r 's/^([0-9]+):.*$/\1/')) #; echo "num of matches: ${#lnNum[#]}"
if [[ "${#lnNum[#]}" -gt 1 ]] ; then #'if'
lCnt="${#lnNum[#]}"
((eleN = $lCnt-1)) #; echo "\$eleN: ${eleN}" # var $eleN needs to be 1 less than total line count of zero-based array
while [[ "$lCnt" -gt 0 ]] ; do
sed -ri "${lnNum[$eleN]}s/^([^\t]*)\t/\1 \(${lCnt}\)\t/" sGNoSecNums.html
((lCnt--))
((eleN--))
done
fi
((sedLnCnt++))
done
Grep was the perfect way to find line numbers of matches, jamming them into an array and then editing each line appending the unique identifier.

Resources