Iterate a user string in bash to add vowels to string - bash

So I have a word list containing over 30,000 words. My goals is to make a script that takes in a word without constants in it (example: mbnt), and somehow add constants and compare to the word list to find atleast the word "ambient", though it will also find other words that would read as "mbnt" if you were to take out all of their vowels.
So far this is my bash script
f=/wordList
anyVowel=[aAeEiIoOuU]
nonVowel=[^aAeEiIoOuU]
input=$1
for (( i=0; i<${#input}; i++ ));
do
grep "${input:$i:1}$nonVowel" $f | head -10
done
however this will just return a just a normal list of words with some of the characters the user inputs. Any thoughts on what I might be doing wrong?

awk to the rescue!
$ awk -v w=whr '{a=tolower($0);
gsub(/[^a-z]/,"",a);
gsub(/[aeiou]/,"",a)}
a==w' words
where
looking for the vowels dropped word "whr" in the words (make up a custom dict). Convert to lowercase, filter out non alphas and remove vowels, finally look for a match with the given word.
Note that this is very inefficient if you're looking for many words, but perhaps can be a template for your solution.

Try
wordsfile=wordList
consonants=$1
# Create a regular expression that matches the input consonants with
# any number of vowels before, after, or between them
regex='^[[:space:]]*[aeiou]*'
for (( i=0; i<${#consonants}; i++ )) ; do
regex+="${consonants:i:1}[aeiou]*"
done
regex+='[[:space:]]*$'
grep -i -- "$regex" "$wordsfile"

Related

Shell Bash Replace or remove part of a number or string

Good day.
Everyday i receive a list of numbers like the example below:
11986542586
34988745236
2274563215
4532146587
11987455478
3652147859
As you can see some of them have a 9(11 digits total) in as the third digit and some dont(10 digits total, that`s because the ones with an extra 9 are the new Brazilian mobile number format and the ones without it are in the old format.
The thing is that i have to use the numbers in both formats as a parameter for another script and i usually have do this by hand.
I am trying to create a script that reads the length of a mobile number and check it`s and add or remove the "9" of a number or string if the digits condition is met and save it in a separate file condition is met.
So far i am only able to check its length but i don`t know how to add or remove the "9" in the third digit.
#!/bin/bash
Numbers_file="/FILES/dir/dir2/Numbers_File.txt"
while read Numbers
do
LEN=${#Numbers}
if [ $LEN -eq "11" ]; then
echo "lenght = "$LEN
elif [ $LEN -eq "10" ];then
echo "lenght = "$LEN
else
echo "error"
fi
done < $Numbers_file
You can delete the third character of any string with sed as follows:
sed 's/.//3'
Example:
echo "11986542586" | sed 's/.//3'
1186542586
To add a 9 in the third character:
echo "2274563215" | sed 's/./&9/3'
22794563215
If you are absolutely sure about the occurrence happening only at the third position, you can use an awk statement as below,
awk 'substr($0,3,1)=="9"{$0=substr($0,1,2)substr($0,4,length($0))}1' file
1186542586
3488745236
2274563215
4532146587
1187455478
3652147859
Using the POSIX compliant substr() function, process only the lines having 9 at the 3rd position and move around the record not considering that digit alone.
substr(s, m[, n ])
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s
There are lots of text manipulation tools that will do this, but the lightest weight is probably cut because this is all it does.
cut only supports a single range but does have an invert function so cut -c4 would give you just the 4th character, but add in --complement and you get everything but character 4.
echo 1234567890 | cut -c4 --complement
12356789

Extract substrings with certain length randomly from a file with Bash

I have multiple text files, and from each file I need to extract random contiguous substrings with a certain length.
For example, I need to extract Five random substrings that consist of 3 contiguous characters each, or 4 random substrings that consist of 20 characters each.
In practice, let's assume this is the content of one of the files
Welcome to stackoverflow the best technical resource ever
so if I want Five random substrings that consist of 3 characters each, I expect an output that looks like this for example:
elc
sta
tec
res
rce
Your help would be much appreciated.
awk to the rescue!
awk -v n=5 -v s=3 'BEGIN {srand()}
{len=length($0);
for(i=1;i<=n;i++)
{k=rand()*(len-s)+1; printf "%s\t", substr($0,k,s)}
print ""}' file
there may be spaces in the extracted substrings
Create a function to pick a random substring:
random_string() {
line=$1
length=$2
# make sure we start at a random position that guarantees a substring of given length
start=$((RANDOM % ((${#line} - $length))))
# use Bash brace expansion to extract substring
printf '%s' "${line:$start:$length}"
}
Use the function in a loop:
#!/bin/bash
while IFS= read -r line; do
random1=$(random_string "$line" 3)
random2=$(random_string "$line" 20)
printf 'random1=[%s], random2=[%s]\n' "$random1" "$random2"
done < file
Sample output with the content Welcome to stackoverflow the best technical resource ever in file:
random1=[hni], random2=[low the best technic]
random1=[sta], random2=[e best technical res]
random1=[ove], random2=[ackoverflow the best]
random1=[rfl], random2=[echnical resource ev]
random1=[ech], random2=[est technical resour]
random1=[cal], random2=[ome to stackoverflow]
random1=[tec], random2=[o stackoverflow the ]
random1=[l r], random2=[come to stackoverflo]
random1=[erf], random2=[ stackoverflow the b]
random1=[me ], random2=[ the best technical ]
random1=[est], random2=[ckoverflow the best ]
random1=[tac], random2=[tackoverflow the bes]
random1=[e t], random2=[o stackoverflow the ]
random1=[al ], random2=[come to stackoverflo]

Replace and increment letters and numbers with awk or sed

I have a string that contains
fastcgi_cache_path /var/run/nginx-cache15 levels=1:2 keys_zone=MYSITEP:100m inactive=60m;
One of the goals of this script is to increment nginx-cache two digits based on the value find on previous file. For doing that I used this code:
# Replace cache_path
PREV=$(ls -t /etc/nginx/sites-available | head -n1) #find the previous cache_path number
CACHE=$(grep fastcgi_cache_path $PREV | awk '{print $2}' |cut -d/ -f4) #take the string to change
SUB=$(echo $CACHE |sed "s/nginx-cache[0-9]*[0-9]/&#/g;:a {s/0#/1/g;s/1#/2/g;s/2#/3/g;s/3#/4/g;s/4#/5/g;s/5#/6/g;s/6#/7/g;s/7#/8/g;s/8#/9/g;s/9#/#0/g;t a};s/#/1/g") #increment number
sed -i "s/nginx-cache[0-9]*/$SUB/g" $SITENAME #replace number
Maybe not so elegant, but it works.
The other goal is to increment last letter of all occurrences of MYSITEx (MYSITEP, in that case, should become MYSITEQ, after MYSITEP, etc. etc and once MYSITEZ will be reached add another letter, like MYSITEAA, MYSITEAB, etc. etc.
I thought something like:
sed -i "s/MYSITEP[A-Z]*/MYSITEGG/g" $SITENAME
but it can't works cause MYSITEGG is a static value and can't be used.
How can I calculate the last letter, increment it to the next one and once the last Z letter will be reached, add another letter?
Thank you!
Perl's autoincrement will work on letters as well as digits, in exactly the manner you describe
We may as well tidy your nginx-cache increment as well while we're at it
I assume SITENAME holds the name of the file to be modified?
It would look like this. I have to assign the capture $1 to an ordinary variable $n to increment it, as $1 is read-only
perl -i -pe 's/nginx-cache\K(\d+)/ ++($n = $1) /e; s/MYSITE\K(\w+)/ ++($n = $1) /e;' $SITENAME
If you wish, this can be done in a single substitution, like this
perl -i -pe 's/(?:nginx-cache|MYSITE)\K(\w+)/ ++($n = $1) /ge' $SITENAME
Note: The solution below is needlessly complicated, because as Borodin's helpful answer demonstrates (and #stevesliva's comment on the question hinted at), Perl directly supports incrementing letters alphabetically in the manner described in the question, by applying the ++ operator to a variable containing a letter (sequence); e.g.:
$ perl -E '$letters = "ZZ"; say ++$letters'
AAA
The solution below may still be of interest as an annotated showcase of how Perl's power can be harnessed from the shell, showing techniques such as:
use of s///e to determine the replacement string with an expression.
splitting a string into a character array (split //, "....")
use of the ord and chr functions to get the codepoint of a char., and convert a(n incremented) codepoint back to a char.
string replication (x operator)
array indexing and slices:
getting an array's last element ($chars[-1])
getting all but the last element of an array (#chars[0..$#chars-1])
A perl solution (in effect a re-implementation of what ++ can do directly):
perl -pe 's/\bMYSITE\K([A-Z]+)/
#chars = split qr(), $1; $chars[-1] eq "Z" ?
"A" x (1 + scalar #chars)
:
join "", #chars[0..$#chars-1], chr (1 + ord $chars[-1])
/e' <<'EOF'
...=MYSITEP:...
...=MYSITEZP:...
...=MYSITEZZ:...
EOF
yields:
...=MYSITEQ:... # P -> Q
...=MYSITEZQ:... # ZP -> ZQ
...=MYSITEAAA:... # ZZ -> AAA
You can use perl's -i option to replace the input file with the result
(perl -i -pe '...' "$SITENAME").
As Borodin's answer demonstrates, it's not hard to solve all tasks in the question using perl alone.
The s function's /e option allows use of a Perl expression for determining the replacement string, which enables sophisticated replacements:
$1 references the current MYSITE suffix in the expression.
#chars = split qr(), $1 splits the suffix into a character array.
$chars[-1] eq "Z" tests if the last suffix char. is Z
If so: The suffix is replaced with all As, with an additional A appended
("A" x (1 + scalar #chars)).
Otherwise: The last suffix char. is replaced with the following letter in the alphabet
(join "", #chars[0..$#chars-1], chr (1 + ord $chars[-1]))

Split filename and get the element between first and last occurrence of underscore

I am trying to split many folder names in a for loop and extract the element between first and last underscore of filename. Filenames can look like ENCSR000AMA_HepG2_CTCF or ENCSR000ALA_endothelial_cell_of_umbilical_vein_CTCF.
My problem is that folder names differ form each other in the total number of underscores, so I cannot use something like:
IN=$d
folderIN=(${IN//_/ })
tf_name=${folderIN[-1]%/*} #get last element which is the TF name
cell_line=${folderIN[-2]%/*}; #get second last element which is the cell line
dataset_name=${folderIN[0]%/*}; #get first element which is the dataset name
cell_line can be one or more words separated by underscore but it's allways between 1st and last underscore.
Any help?
Just do this in a two step bash parameter expansion ONLY because bash does not support nested parameter expansion unlike zsh or other shells.
"${string%_*}" to strip the everything after the last occurrence of '_' and "${tempString#*_}" to strip everything from beginning to first occurrence of '_'
string="ENCSR000ALA_endothelial_cell_of_umbilical_vein_CTCF"
tempString="${string%_*}"
printf "%s\n" "${tempString#*_}"
endothelial_cell_of_umbilical_vein
Another example,
string="ENCSR000AMA_HepG2_CTCF"
tempString="${string%_*}"
printf "%s\n" "${tempString#*_}"
HepG2
You can modify this logic to apply on each of the file-names in your folder.
Could use regex.
extract_words() {
[[ "$1" =~ ^([^_]+)_(.*)_([^_]+)$ ]] && echo "${BASH_REMATCH[2]}"
}
while read -r from_line
do
extracted=$(extract_words "$from_line")
echo "$from_line" "[$extracted]"
done < list_of_filenames.txt
EDIT: I moved the "extraction" into an alone bash function for reuse and easy modification for more complex cases, like:
extract_words() {
perl -lnE 'say $2 if /^([^_]+)_(.*)_([^_]+)$/' <<< "$1"
}

bash find keyword in an associative array

I have incoming messages from a chat server that need to be compared against a list of keywords. I was using regular arrays, but would like to switch to associative arrays to try to increase the speed of the processing.
The list of words would be in an array called aWords and the values would be a 'type' indicator, i.e. aWords[damn]="1", with 1 being swear word in a legend to inform the user.
The issue is that I need to compare every index value with the input $line looking for substrings. I'm trying to avoid a loop thru each index value if at all possible.
From http://tldp.org/LDP/abs/html/string-manipulation.html, I'm thinking of the Substring Removal section.
${string#substring}
Deletes shortest match of $substring from front of $string.
A comparison of the 'removed' string from the $line, may help, but will it match also words in the middle of other words? i.e. matching the keyword his inside of this.
Sorry for the long-winded post, but I tried to cover all of what I'm attempting to accomplish as best I could.
# create a colon-separated string of the array keys
# you can do this once, after the array is created.
keys=$(IFS=:; echo "${!aWords[*]}")
if [[ ":$keys:" == *:"$word":* ]]; then
# $word is a key in the array
case ${aWords[$word]} in
1) echo "Tsk tsk: $word is a swear word" ;;
# ...
esac
fi
This is the first time I heard of associative arrays in bash. It inspired me to also try to add something, with the chance ofcourse that I completely miss the point.
Here is a code snippet. I hope I understood how it works:
declare -A SWEAR #create associative array of swearwords (only once)
while read LINE
do
[ "$LINE"] && SWEAR["$LINE"]=X
done < "/path/to/swearword/file"
while :
do
OUTGOING="" #reset output "buffer"
read REST #read a sentence from stdin
while "$REST" #evaluate every word in the sentence
do
WORD=${REST%% *}
REST=${REST#* }
[ ${SWEAR[$WORD]} ] && WORD="XXXX"
OUTGOING="$OUTGOING $WORD"
done
echo "$OUTGOING" #output to stdout
done

Resources