Finding dictionary words within a source text, using Ruby

Finding dictionary words within a source text, using Ruby - ruby

Using Ruby, I need to output a list of words, found in a dictionary, that can be formed by eliminating letters from a source text.
E.g., if I input the source text "crazed" I want to get not only words like "craze" and "razed", whose letters are in the same order AND whose letters are adjacent to each other within the source text, but ALSO words like "rad" and "red", because those words exist and can be found by eliminating select letters from "crazed" AND the output words retain letter order. BUT, words like "dare" or "race" should not be in the output list, because the letter order of the letters in "dare" or "race" are not the same as those letters found in "crazed". (If "raed" or "crae" were words in the dictionary, they WOULD be part of the output.)
My thought was to go through the source text in a binary manner
(for "crazed", we'd get:
000001 = "d";
000010 = "e";
000011 = "ed";
000100 = "z";
000101 = "zd";
000111 = "zed";
001000 = "a";
001001 = "ad"; etc.)
and compare each result with words in a dictionary, though I don't know how to code that, nor whether that is most efficient. This is where I would greatly benefit from your help.
Also, the length of the source text would be variable; it wouldn't necessarily be six letters long (like "crazed"). Inputs would potentially be much larger (20-30 characters, possibly more).
I've searched here and found questions about anagrams and about words that can be in any letter order, but not specifically what i'm looking for. Is this even possible in Ruby? Thank you.

First let's read the words of a dictionary into an array, after chomping, downcasing and removing duplicates (if, for example, the dictionary contains both "A" and "a", as does the dictionary on my Mac that I've used below).
DICTIONARY = File.readlines("/usr/share/dict/words").map { |w| w.chomp.downcase }.uniq
#=> ["a", "aa", "aal", "aalii",..., "zyzomys", "zyzzogeton"]
DICTIONARY.size
#=> 234371
The following method generates all combinations of one or more characters of a given word, respecting order, and for each, joins the characters to form a string, checks to see if the string is in the dictionary, and if it is, saves the string to an array.
To check if a string matches a word in the dictionary I perform a binary search, using the method Array#bsearch. This makes use of the fact that the dictionary is already sorted in alphabetical order.
def subwords(word)
arr = word.chars
(1..word.size).each.with_object([]) do |n,a|
arr.combination(n).each do |comb|
w = comb.join
a << w if DICTIONARY.bsearch { |dw| w <=> dw }
end
end
end
subwords "crazed"
# => ["c", "r", "a", "z", "e", "d",
# "ca", "ce", "ra", "re", "ae", "ad", "ed",
# "cad", "rad", "red", "zed",
# "raze", "craze", "crazed"]
Yes, that particular dictionary contains all those strings (such as "z") that don't appear to be English words.
Another example.
subwords "importance"
#=> ["i", "m", "p", "o", "r", "t", "a", "n", "c", "e",
# "io", "it", "in", "ie", "mo", "mr", "ma", "me", "po", "pa", "or",
# "on", "oe", "ra", "re", "ta", "te", "an", "ae", "ne", "ce",
# "imp", "ima", "ion", "ira", "ire", "ita", "ian", "ice", "mor", "mot",
# "mon", "moe", "man", "mac", "mae", "pot", "poa", "pon", "poe", "pan",
# "pac", "ort", "ora", "orc", "ore", "one", "ran", "tan", "tae", "ace",
# "iota", "ione", "iran", "mort", "mora", "morn", "more", "mote",
# "moan", "mone", "mane", "mace", "port", "pore", "pote", "pone",
# "pane", "pace", "once", "rane", "race", "tane",
# "impot", "moran", "morne", "porta", "ponce", "rance",
# "import", "impone", "impane", "prance",
# "portance",
# "importance"]

An extensive solution set that comprises words that can be obtained from using letters in any order is below. The catch with using combination to find possible subwords is that the permutations of the combinations are missed. eg: drawing from 'importance', the combination of 'mpa' will arise at some point. since this isn't a dictionary word, it'll be skipped. thereby costing us, the permutation 'map'-- dictionary subword of 'importance'. below is an extensive solution that finds more possible dictionary words. I agree that my method can be optimized for speed.
#steps
#split string at ''
#find combinations for n=2 all the way to n=word.size
#for each combination
#find the permutations of all the arrangements
#then
#join the array
#check to see if word is in dictionary
#and it's not already collected
#if it is, add to collecting array
require 'set'
Dictionary=File.readlines('dictionary.txt').map(&:chomp).to_set
Dictionary.size #39501
def subwords(word)
#split string at ''
arr=word.split('')
#excluding single letter words
#you can change 2 to 1 in line below to select for single letter words too
(2..word.size).each_with_object([]) do |n,a|
#find combinations for n=2 all the way to n=word.size
arr.combination(n).each do |comb|
#for each combination
#find the permutations of all the arrangements
comb.permutation(n).each do |perm|
#join the array
w=perm.join
#check to see if word is in dictionary and it's not already collected
if Dictionary.include?(w) && !a.include?(w)
#if it is, add to collecting array
a<<w
end
end
end
end
end
p subwords('crazed')
#["car", "arc", "rec", "ace", "cad", "are", "era", "ear", "rad", "red", "adz", "zed", "czar", "care", "race", "acre", "card", "dace", "raze", "read", "dare", "dear", "adze", "daze", "craze", "cadre", "cedar", "crazed"]
p subwords('battle')
#["bat", "tab", "alb", "lab", "bet", "tat", "ate", "tea", "eat", "eta", "ale", "lea", "let", "bate", "beat", "beta", "abet", "bale", "able", "belt", "teat", "tale", "teal", "late", "bleat", "table", "latte", "battle", "tablet"]

Related

How to split a string of repeated characters with uneven amounts? Ruby

If I have a string such as "aabbbbccdddeffffgg" and I wanted to split the string into this array: ["aa", "bbbb", "cc", "ddd", "e", "ffff", "gg"], how would I go about that?
I know of string.split/.../ < or however many period you put there, but it doesn't account for if the strings are uneven. The point of the problem I'm working on is to take two strings and see if there are three characters in a row of one string and two in a row in the other. I tried
`letter_count_1 = {}
str1.each_char do |let|
letter_count_1[let] = str1.count(let)
end`
But that gives the count for the total amount of each character in the string, and some of the inputs are randomized with the same letter in multiple places, like, "aabbbacccdba"
So how do you split the string up by character?

You can use a regex with a back reference and the scan() method:
str = "aabbbbccdddeffffgg"
groups = []
str.scan(/((.)\2*)/) { |x| groups.push(x[0]) }
groups will look like this afterwards:
["aa", "bbbb", "cc", "ddd", "e", "ffff", "gg"]

Here is a non-regexp version
str = "aabbbbccdddeffffgg"
p str.chars.chunk(&:itself).map{|x|x.last.join} #=> ["aa", "bbbb", "cc", "ddd", "e", "ffff", "gg"]

Format data in string to array?

I need to convert data from a string to an array. The string looks like this:
{a,b,c{1,2,3},d,e,f{11,22,33},g}
The array that I want to receive should look like this:
[a, b, c1, c2, c3, d, e, f11, f22, f33, g]
I tried to use the split method but it works poorly.
arr = str.split(' ');
keys = arr[0][2..-2]
keys = keys.split(',')
Do you have any ideas how it could be implemented?

Here's what I'd use:
string = '{a,b,c{1,2,3},d,e,f{11,22,33},g}'
array = string.scan(/[a-z](?:{.+?})?/).flat_map{ |s|
if s['{']
prefix = s[0]
values = s.scan(/\d+/)
([prefix] * values.size).zip(values).map(&:join)
else
s
end
}
array # => ["a", "b", "c1", "c2", "c3", "d", "e", "f11", "f22", "f33", "g"]
Here's how it works:
string.scan(/[a-z](?:{.+?})?/) # => ["a", "b", "c{1,2,3}", "d", "e", "f{11,22,33}", "g"]
returns the string broken into chunks, looking for a single letter followed by an optional string of { with some text then }.
values = s.scan(/\d+/) # => ["1", "2", "3"], ["11", "22", "33"]
As it's running in flat_map, if { is found, the numbers are scanned out.
([prefix] * values.size).zip(values).map(&:join) # => ["c1", "c2", "c3"], ["f11", "f22", "f33"]
And then an array of the prefix, with the same number of elements as there are values is created and zipped together, resulting in:
[["c", "1"], ["c", "2"], ["c", "3"]], [["f", "11"], ["f", "22"], ["f", "33"]]
The join glues those sub-arrays together. And flat_map flattens any subarrays created so the resulting output is a single array.

You need to arr = str.split(',') in the first step, because there is no whitespace between the values.
Also keep in mind you have {} to handle too.

This worked for me with simple regex and gsubing (though Tin Man's solution is better ruby):
def my_string_to_array(input_string)
groups = input_string.scan(/\w+\{.*?\}/)
groups.each do |group|
modified = group.gsub(',', ",#{group.match(/\w+/)[0]}").delete("{}")
input_string.gsub!(group, modified)
end
created_array = input_string.delete("{}").split(',')
end
string = '{a,b,c{1,2,3},d,e,f{11,22,33},g}'
my_string_to_array(string)
=> ["a", "b", "c1", "c2", "c3", "d", "e", "f11", "f22", "f33", "g"]
The way it works is that it first finds the groups having alphabets followed by braces and digits (like c{1,2,3})
For each such group, it modifies it by gsubing ',' with ',<alphabet>' and removing the braces.
Next, it replaces these groups with the modified ones in the original string.
And finally it removes the starting and ending braces in the original string, and converts it into an array.

Partition/split a string by character set in Ruby

How can I separate different character sets in my string? For example, if I had these charsets:
[a-z]
[A-Z]
[0-9]
[\s]
{everything else}
And this input:
thisISaTEST***1234pie
Then I want to separate the different character sets, for example, if I used a newline as the separating character:
this
IS
a
TEST
***
1234
pie
I've tried this regex, with a positive lookahead:
'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")
But apparently the +s aren't being greedy, because I'm getting:
t
h
# (snip)...
S
T***
1
# (snip)...
e
I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.
How can I do this? It does not necessarily have to be by regex. Splitting them into an array would work too.

The difficult part is to match whatever that does not match the rest of the regex. Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.
"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]

In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \p{punct}.
To split your string into sequences of a single category, you can write
str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)
output
["this", "IS", "a", "TEST", "***", "1234", "pie"]
Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this
p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)

Here a two solutions.
String#scan with a regular expression
str = "thisISa\n TEST*$*1234pie"
r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
#=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]
Because of ^ at the beginning of [^a-zA-Z\d\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.
Use Enumerable#slice_when1
First, a helper method:
def type(c)
case c
when /[a-z]/ then 0
when /[A-Z]/ then 1
when /\d/ then 2
when /\s/ then 3
else 4
end
end
For example,
type "f" #=> 0
type "P" #=> 1
type "3" #=> 2
type "\n" #=> 3
type "*" #=> 4
Then
str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]
1. slich_when made its debut in Ruby v2.4.

Non-word, non-space chars can be covered with [^\w\s], so:
"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

Match sequences of consecutive characters in a string

I have the string "111221" and want to match all sets of consecutive equal integers: ["111", "22", "1"].
I know that there is a special regex thingy to do that but I can't remember and I'm terrible at Googling.

Using regex in Ruby 1.8.7+:
p s.scan(/((\d)\2*)/).map(&:first)
#=> ["111", "22", "1"]
This works because (\d) captures any digit, and then \2* captures zero-or-more of whatever that group (the second opening parenthesis) matched. The outer (…) is needed to capture the entire match as a result in scan. Finally, scan alone returns:
[["111", "1"], ["22", "2"], ["1", "1"]]
…so we need to run through and keep just the first item in each array. In Ruby 1.8.6+ (which doesn't have Symbol#to_proc for convenience):
p s.scan(/((\d)\2*)/).map{ |x| x.first }
#=> ["111", "22", "1"]
With no Regex, here's a fun one (matching any char) that works in Ruby 1.9.2:
p s.chars.chunk{|c|c}.map{ |n,a| a.join }
#=> ["111", "22", "1"]
Here's another version that should work even in Ruby 1.8.6:
p s.scan(/./).inject([]){|a,c| (a.last && a.last[0]==c[0] ? a.last : a)<<c; a }
# => ["111", "22", "1"]

"111221".gsub(/(.)(\1)*/).to_a
#=> ["111", "22", "1"]
This uses the form of String#gsub that does not have a block and therefore returns an enumerator. It appears gsub was bestowed with that option in v2.0.

I found that this works, it first matches each character in one group, and then it matches any of the same character after it. This results in an array of two element arrays, with the first element of each array being the initial match, and then the second element being any additional repeated characters that match the first character. These arrays are joined back together to get an array of repeated characters:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
repeated_chars = input.scan(/(.)(\1*)/)
# => [["W", "W"], ["B", ""], ["W", "WWW"], ["B", "BB"], ["W", "WWWWWW"], ["B", ""], ["3", "333"], ["!", "!!!"]]
repeated_chars.map(&:join)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
As an alternative I found that I could create a new Regexp object to match one or more occurrences of each unique characters in the input string as follows:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
regexp = Regexp.new("#{input.chars.uniq.join("+|")}+")
#=> regexp created for this example will look like: /W+|B+|3+|!+/
and then use that Regex object as an argument for scan to split out all the repeated characters, as follows:
input.scan(regexp)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]

you can try is
string str ="111221";
string pattern =#"(\d)(\1)+";
Hope can help you

Determining if a prefix exists in a set

Given a set of strings, say:
"Alice"
"Bob"
"C"
"Ca"
"Car"
"Carol"
"Caroling"
"Carousel"
and given a single string, say:
"Carolers"
I would like a function that returns the smallest prefix not already inside the array.
For the above example, the function should return: "Caro". (A subsequent call would return "Carole")
I am very new to Ruby, and although I could probably hack out something ugly (using my C/C++/Objective-C brain), I would like to learn how to properly (elegantly?) code this up.

There's a little known magical module in Ruby called Abbrev.
require 'abbrev'
abbreviations = Abbrev::abbrev([
"Alice",
"Bob",
"C",
"Ca",
"Car",
"Carol",
"Caroling",
"Carousel"
])
carolers = Abbrev::abbrev(%w[Carolers])
(carolers.keys - abbreviations.keys).sort.first # => "Caro"
Above I took the first element but this shows what else would be available.
pp (carolers.keys - abbreviations.keys).sort
# >> ["Caro", "Carole", "Caroler", "Carolers"]
Wrap all the above in a function, compute the resulting missing elements, and then iterate over them yielding them to a block, or use an enumerator to return them one-by-one.
This is what is generated for a single word. For an array it is more complex.
require 'pp'
pp Abbrev::abbrev(['cat'])
# >> {"ca"=>"cat", "c"=>"cat", "cat"=>"cat"}
pp Abbrev::abbrev(['cat', 'car', 'cattle', 'carrier'])
# >> {"cattl"=>"cattle",
# >> "catt"=>"cattle",
# >> "cat"=>"cat",
# >> "carrie"=>"carrier",
# >> "carri"=>"carrier",
# >> "carr"=>"carrier",
# >> "car"=>"car",
# >> "cattle"=>"cattle",
# >> "carrier"=>"carrier"}

Your question still doesn't match what you are expecting as a result. It seems that you need prefixes, not the substrings (as "a" would be the shortest substring not already in the array). For searching the prefix, this should suffice:
array = [
"Alice",
"Bob",
"C",
"Ca",
"Car",
"Carol",
"Caroling",
"Carousel",
]
str = 'Carolers'
(0..str.length).map{|i|
str[0..i]
}.find{|s| !array.member?(s)}

I am not a Ruby expert, but I think you may want to approach this problem by converting your set into a trie. Once you have the trie constructed, your problem can be solved simply by walking down from the root of the trie, following all of the edges for the letters in the word, until you either find a node that is not marked as a word or walk off the trie. In either case, you've found a node that isn't part of any word, and you have the shortest prefix of your word in question that doesn't already exist inside of the set. Moreover, this would let you run any number of prefix checks quickly, since after you've built up the trie the algorithm takes time at most linear in the length of the string.
Hope this helps!

I'm not really sure what you're asking for other than an example of some Ruby code to find common prefixes. I'll assume you want to find the smallest string which is a prefix of the most number of strings in the given set. Here's an example implementation:
class PrefixFinder
def initialize(words)
#words = Hash[*words.map{|x|[x,x]}.flatten]
end
def next_prefix
max=0; biggest=nil
#words.keys.sort.each do |word|
0.upto(word.size-1) do |len|
substr=word[0..len]; regex=Regexp.new("^" + substr)
next if #words[substr]
count = #words.keys.find_all {|x| x=~regex}.size
max, biggest = [count, substr] if count > max
#puts "OK: s=#{substr}, biggest=#{biggest.inspect}"
end
end
#words[biggest] = biggest if biggest
biggest
end
end
pf = PrefixFinder.new(%w(C Ca Car Carol Caroled Carolers))
pf.next_prefix # => "Caro"
pf.next_prefix # => "Carole"
pf.next_prefix # => "Caroler"
pf.next_prefix # => nil
No comment on the performance (or correctness) of this code but it does show some Ruby idioms (instance variables, iteration, hashing, etc).

=> inn = ["Alice","Bob","C","Ca","Car","Carol","Caroling","Carousel"]
=> y = Array.new
=> str="Carolers"
Split the given string to an array
=> x=str.split('')
# ["C","a","r","o","l","e","r","s"]
Form all the combination
=> x.each_index {|i| y << x.take(i+1)}
# [["c"], ["c", "a"], ["c", "a", "r"], ["c", "a", "r", "o"], ["c", "a", "r", "o", "l"], ["c", "a", "r", "o", "l", "e"], ["c", "a", "r", "o", "l", "e", "r"], ["c", "a", "r", "o", "l", "e", "r", "s"]]
Using Join to concatenate the
=> y = y.map {|s| s.join }
# ["c", "ca", "car", "caro", "carol", "carole", "caroler", "carolers"]
Select the first item from the y thats not available in the input Array
=> y.select {|item| !inn.include? item}.first
You will get "caro"
Putting together all
def FindFirstMissingItem(srcArray,strtocheck)
y=Array.new
x=strtocheck.split('')
x.each_index {|i| y << x.take(i+1)}
y=y.map {|s| s.join}
y.select {|item| !srcArray.include? item}.first
end
And call
=> inn = ["Alice","Bob","C","Ca","Car","Carol","Caroling","Carousel"]
=> str="Carolers"
FindFirstMissingItem inn,str

Very simple version (but not very Rubyish):
str = 'Carolers'
ar = %w(Alice Bob C Ca Car Carol Caroling Carousel)
substr = str[0, n=1]
substr = str[0, n+=1] while ar.include? substr
puts substr

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Finding dictionary words within a source text, using Ruby - ruby

Related

How to split a string of repeated characters with uneven amounts? Ruby

Format data in string to array?

Partition/split a string by character set in Ruby

Match sequences of consecutive characters in a string

Determining if a prefix exists in a set

Categories

Resources