Extract individual existing words in domain names - ruby

I'm looking for a Ruby gem (preferably) that will cut domain names up into their words.
whatwomenwant.com => 3 words, "what", "women", "want".
If it can ignore things like numbers and gibberish then great.

You'll need a word list such as those produced by Project Gutenberg or available in the source for ispell &c. Then you can use the following code to decompose a domain into words:
WORD_LIST = [
'experts',
'expert',
'exchange',
'sex',
'change',
]
def words_that_phrase_begins_with(phrase)
WORD_LIST.find_all do |word|
phrase.start_with?(word)
end
end
def phrase_to_words(phrase, words = [], word_list = [])
if phrase.empty?
word_list << words
else
words_that_phrase_begins_with(phrase).each do |word|
remainder = phrase[word.size..-1]
phrase_to_words(remainder, words + [word], word_list)
end
end
word_list
end
p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]
If given a phrase that has any unrecognized words, it returns an empty array:
p phrase_to_words('expertsfoo')
# => []
If the word list is long, this will be slow. You can make this algorithm faster by preprocessing the word list into a tree. The preprocessing itself will take time, so whether it's worth it will depend upon how many domains you want to test.
Here's some code to turn the word list into a tree:
def add_word_to_tree(tree, word)
first_letter = word[0..0].to_sym
remainder = word[1..-1]
tree[first_letter] ||= {}
if remainder.empty?
tree[first_letter][:word] = true
else
add_word_to_tree(tree[first_letter], remainder)
end
end
def make_word_tree
root = {}
WORD_LIST.each do |word|
add_word_to_tree(root, word)
end
root
end
def word_tree
#word_tree ||= make_word_tree
end
This produces a tree that looks like this:
{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, :e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word=>true}}}}}}}}
It looks like Lisp, doesn't it? Each node in the tree is a hash. Each hash key is either a letter, with the value being another node, or it is the symbol :word with the value being true. Nodes with :word are words.
Modifying words_that_phrase_begins_with to use the new tree structure will make it faster:
def words_that_phrase_begins_with(phrase)
node = word_tree
words = []
phrase.each_char.with_index do |c, i|
node = node[c.to_sym]
break if node.nil?
words << phrase[0..i] if node[:word]
end
words
end

I don't know gems for this, but if I had to solve this problem, I would download some english words dictionary and read about text searching algorythms.
When you have more than one variant to divide letters (like in sepp2k's expertsexchange), than you can have two hints:
Your dictionary is sorted by... for example, popularity of a word. So dividings with most popular words will be more valuable.
You can go to the main page of site with domain you are anazyling and just read the content, searching your words. I don't think that you'll find sex on a page for some experts. But... hm... experts can be so different ,.)

Update
I've been working with this challenge and came up with the following code.
Please refactor if I'm doing something wrong :-)
Benchmark:
Runtime: 11 sec.
f- file: 13.000 lines of domain names
w- file: 2000 words (to check against)
Code:
f = File.open('resource/domainlist.txt', 'r')
lines = f.readlines
w = File.open('resource/commonwords.txt', 'r')
words = w.readlines
results = {}
lines.each do |line|
# Start with words from 2 letters on, so ignoring 1 letter words like 'a'
word_size = 2
# Only get the .com domains
if line =~ /^.*,[a-z]+\.com.*$/i then
# Strip the .com off the domain
line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
# If the domain name is between 3 and 12 characters
if line.size > 3 and line.size < 15 then
# For the length of the string run ...
line.size.times do |n|
# Set the counter
i = 0
# As long as we're within the length of the string
while i <= line.size - word_size do
# Get the word in proper DRY fashion
word = line[i,word_size]
# Check the word against our list
if words.include?(word)
results[line] = [] unless results[line]
# Add all the found words to the hash
results[line] << word
end
i += 1
end
word_size += 1
end
end
end
end
p results

Related

Could someone explain line 2 to line 9 of this Ruby code?

def caesar_cipher(string, shift_factor)
string.length.times do |i|
if string[i].ord >= 97 && (string[i].ord + shift_factor) <= 122 || string[i].ord >= 65 && (string[i].ord + shift_factor) <= 90
string[i] = (string[i].ord + shift_factor).chr
elsif string[i].ord >= 97 && string[i].ord <= 122 || string[i].ord >= 65 && string[i].ord <= 90
string[i] = (string[i].ord + shift_factor - 122 + 96).chr
end
end
string
end
puts "Enter a string:"
string_input = gets.chomp
puts "Enter shift factor:"
shift_factor_input = gets.chomp.to_i
result_string = caesar_cipher(string_input, shift_factor_input)
puts result_string
https://github.com/OlehSliusar/caesar_cipher
A command line Caesar Cipher that takes in a string and the shift factor and then outputs the modified string.
I am unable to understand code line 2 to line 9. I am confused on how the .times method is used in this context. Could someone explain to me what is he doing from line 2 to line 9? How I understand .times method is that it act as a iterator as iterate based on the number time stated.
So say 5.times { puts "Dog" } = > will result in putting "Dog" five time. Hence my understanding on the method .times is very different from the way the author used it.
This is an extended comment which does not answer the question (so no upvotes please).
That piece of code is ugly and arcane, not at all Ruby-like. Here's a another way that makes better use of Ruby's tools and is effectively self-documenting.
Code
def caesar_cipher_encrypt(string, shift_size)
mapping = Hash.new { |h,k| k }.
merge(make_map('a', shift_size)).
merge(make_map('A', shift_size))
string.gsub(/./, mapping)
end
def make_map(first_char, shift_size)
base = first_char.ord
26.times.with_object({}) { |i,h| h[(base+i).chr] = (base+((i+shift_size) % 26)).chr }
end
Example
shift_size = 2
encrypted_str = caesar_cipher_encrypt("Mary said to Bob, 'Get lost!'.", shift_size)
#=> "Octa uckf vq Dqd, 'Igv nquv!'."
Explanation
The first step is to create a hash that maps letters into their shifted counterparts. We begin with
h = Hash.new { |h,k| k }
#= {}
This creates an empty hash with a default value given by the block. That means that if h does not have a key k, h[k] returns k. Since all keys of 'h' will be letters, this means the value of a digit, space, punctuation mark or any other non-letter will be itself. See Hash::new.
We then have
f = make_map('a',2)
#=> {"a"=>"c", "b"=>"d", "c"=>"e",..., "x"=>"z", "y"=>"a", "z"=>"b"}
g = h.merge(f)
#=> {"a"=>"c", "b"=>"d", "c"=>"e",..., "y"=>"a", "z"=>"b"}
f = make_map('A',2)
#=> {"A"=>"C", "B"=>"D", "C"=>"E",..., "X"=>"Z", "Y"=>"A", "Z"=>"B"}
mapping = g.merge(f)
#=> {"a"=>"c", "b"=>"d", "c"=>"e",..., "y"=>"a", "z"=>"b",
# "A"=>"C", "B"=>"D", "C"=>"E",..., "Y"=>"A", "Z"=>"B"}
mapping['!']
#=> "!"
We may now simply use the form of String#gsub that uses a hash to perform substitutions.
"Mary said to Bob, 'Get lost!'.".gsub(/./, mapping)
#=> "Octa uckf vq Dqd, 'Igv nquv!'."
Decrypting
The receiver of an encrypted message can decrypt it as follows.
def caesar_cipher_decrypt(string, shift_size)
caesar_cipher_encrypt(string, -shift_size)
end
caesar_cipher_decrypt(encrypted_str, shift_size)
#=> "Mary said to Bob, 'Get lost!'."
.times do means "execute this code a certain number of times" as you said.
.times do |i| loops a certain number of times and counts each time in i
string.length gets the number of characters in the string.
string.length.times executes a block of code a number of times equal to the number of characters in the string.
string[i] accesses the i-th character in the string.
Putting it all together:
string.length.times do |i|
do_stuff_with string[i]
end
you have code which iterates through each character in the string and does something to it. in this case, the code shifts each character according to caesar cipher.
When you use iterators like string.each_char in ruby or foreach(item in items) in other languages, you're not generally allowed to modify the collection while you iterate. Using .times and string[i] lets the code modify the string while it iterates. because the loop doesn't keep track of string, it just knows that it needs to execute some number of times.
As others have pointed out, there are more elegant, more ruby-like ways to do what this code does, but the writer of the code chose .times because it acts just like a for-loop, which is a common programming paradigm.
Perhaps this will explain it:
string = 'foo'
string.length.times {|i| puts string[i]}
Its a way of iterating through each letter in the string. They could probably do the same thing via:
string.chars.collect{|character| p(character)}.join
and have cleaner code as a result (where p(character) would be replaced by the required manipulation of the current character)
For example this code:
'foo'.chars.collect{|c| (c.ord + 1).chr}.join
Iterates through the string and returns a new string with each character replaced with the next one in the alphabet. That is: "gpp"

How can I return a randomly selected word from the dict file?

The dict txt file is located at: /usr/share/dict/words
I need to access the words on that list and randomly puts a word that contains between 4-9 letters for the user.
pick = words.select { |w| w.size > 3 && w.size < 10 }.sample
Assuming that the words file contains one word per line:
puts File.read('/usr/share/dict/words').lines.select {|l| (4..9).cover?(l.strip.size)}.sample.strip
This should do it for you:
file_contents = File.read("/usr/share/dict/words")
words = file_contents.split("\n")
puts words[rand(0..words.size-1)]
pick = ar.sample until pick.to_s.size.between?(4, 9)
Most efficient is to use Random:
rng = Random.new
words = File.readlines("/usr/share/dict/words")
words.select!{ |e| s = e.size; s >= 4 && s <= 9 }
pick = words[rng(words.size)]
You can also use #shuffle but this generates a new list so is certainly heavy:
words = File.readlines("/usr/share/dict/words")
words.select!{ |e| s = e.size; s >= 4 && s <= 9 }
pick = words.shuffle.first
If words are not separated by newlines, use split:
words = File.read("/usr/share/dict/words").split
If you don't need to randomly select a large number of words in a given size range, here is a method built for speed.
Code
def random_word(fname, size_range, window)
f = File.open(fname, 'r')
fsize = f.size
loop do
f.seek(rand(fsize)) # Random start byte, likely within a word
next if f.eof?
f.readline # Go next word
window.times do
break if f.eof
w = f.readline.strip
if size_range.cover? w.size
f.close
return w
end
end
end
end
Examples
words =
'Now
is
the
time
for
all
Rubyists
to
do
some
coding'
FNAME = 'dict'
File.write(FNAME,words)
random_word(FNAME, (4..6), 3) #=> "some"
random_word(FNAME, (4..6), 3) #=> "time"
random_word(FNAME, (4..6), 3) #=> "some"
random_word(FNAME, (2..3), 5) #=> "the"
random_word(FNAME, (2..3), 5) #=> "for"
random_word(FNAME, (2..3), 5) #=> "is"
random_word(FNAME, (6..8), 3) #=> "Rubyists"
random_word(FNAME, (6..8), 3) #=> "coding"
random_word(FNAME, (6..8), 3) #=> "coding"
Explanation
There are two situation to consider.
A very large number of random words in a given length range must be generated at one time. Here it would make sense to construct an array with qualifying words and then just select offsets at random. If the dictionary did not change over times, these arrays could be written to file, on for each size range of interest.
Only a modest number of random words are to be generated at a time, or the desired size range frequently changed. In these cases, reading all the words in the dictionary into an array before selecting one (meeting the length requirement) at random would be highly inefficient. It is this latter situation that I have addressed.
The algorithm is very simple:
Generate a random by offset within the dictionary file.
Move the file pointer to that byte, which likely will be in the middle of a word.
Read the rest of that line (unless already at the end of the file).
Read up to a specified number of words (or until the end of the file is reached), searching for one within the specified size range.
If a word is found, it is returned, else the file pointer is moved to a new random offset and the process is repeated.
The reason for searching for at most a given number of words from a given random file pointer offset (before randomly moving the file pointer) is to maintain randomness, but is probably not very important. If this parameter were effectively +infinity, it would search to from the random location to the end of the file, which would slightly bias the selection of words closer to the end of the file.
This method does have a flaw: in several dictionaries it is biased against the selection of the word "aardvark".
dictionary = File.readlines(filename)
sample_word = dictionary.sample.chomp
until sample_word.length.between?(5, 12)
sample_word = dictionary.sample.chomp
end
puts sample_word
chomp also removes carriage return characters (that is it will remove \n, \r, and \r\n).

How do I make multiple combinations with a string in ruby?

Input should be a string:
"abcd#gmail.com"
Output should be an Array of strings:
["abcd#gmail.com",
"a.bcd#gmail.com",
"ab.cd#gmail.com",
"abc.d#gmail.com",
"a.b.cd#gmail.com",
"a.bc.d#gmail.com",
"a.b.c.d#gmail.com"]
The idea: "Make every possible combination in the first string part ("abcd") with a dot. Consecutive dots are not allowed. There are no dots allowed in the beginning and in the end of the first string part ("abcd")"
This is what I've came up with so far:
text,s = "abcd".split""
i=0
def first_dot(text)
text.insert 1,"."
end
def set_next_dot(text)
i = text.rindex(".")
text.delete_at i
text.insert(i+1,".")
end
My approach was
write a function, that sets the first dot
write a function that sets the next dot
...(magic)
I do not know how to put the pieces together. Any Idea? Or perhaps a better way?
thanx in advance
edit:
I think I found the solution :)
I will post it in about one hour (it's brilliant -> truth tables, binary numbers, transposition)
...and here the solution
s = "abc"
states = s.length
possibilites = 2**states
def set_space_or_dot(value)
value.gsub("0","").gsub("1",".")
end
def fill_with_leading_zeros(val, states)
if val.length < states
"0"*(states-val.length)+val
else
val
end
end
a = Array.new(possibilites,s)
a = a.map{|x| x.split ""}
b = [*0...possibilites].map{|x| x.to_s(2).to_s}
b = b.map{|x| fill_with_leading_zeros x,states}
b = b.map{|x| x.split ""}
c = []
for i in 0 ... a.size
c[i] = (set_space_or_dot (a[i].zip b[i]).join).strip
end
Changing pduersteler answer a little bit:
possibilities = []
string = "abcd#example.com"
(string.split('#')[0].size-1).times do |pos|
possibility = string.dup
possibilities << possibility.insert(pos+1, '.')
end
How about this (probably needs a bit more fine-tuning to suit your needs):
s = "abcd"
(0..s.size-1).map do |i|
start, rest = [s[0..i], s[(i+1)..-1]]
(0..rest.size-1).map { |j| rest.dup.insert(j, '.') }.map { |s| "#{start}#{s}"}
end.flatten.compact
#=> ["a.bcd", "ab.cd", "abc.d", "ab.cd", "abc.d", "abc.d"]
An option would be to iterate n times through your string moving the dot, where n is the amount of chars minus 1. This is what you're doing right now, but without defining two methods.
Something like this:
possibilities = []
string = "abcd#example.com"
(string.split('#')[0].size-1).times do |pos|
possibilities << string.dup.insert(pos+1, '.')
end
edit
Now tested. THanks to the comments, you need to call .dup on the string before the insert. Otherwise, the dot gets inserted into the string and will stay there for each iteration causing a mess. Calling .dup onthe string will copy the string and works on the copy instead, leaving the original string untouched.

Loop returns only the last item

I'm new to Ruby, so the answer is probably pretty simple. Not to me though
I am taking an array of strings (A) and matching it against another array of strings (B) to see if a given string from (A) exists as a substring within a string from B.
The compare seems to work however, I only get back a result from the last (A) string compared.
What might this be?
def checkIfAvailableOnline(film)
puts "Looking for " + film
lowerCaseFilm = film.downcase
#iterate through the linesarray scanning for the film in question
for line in #linesArray
#get the line in lowercase
lowerCaseLine = line.downcase
#look for the film name as a substring within the line
results = lowerCaseLine.scan(lowerCaseFilm)
if results.length > 0
#availableOnlineArray << results
end
end
end
#-----------------------------------------
listFilmsArray.each {|line| checkIfAvailableOnline(line)}
Given a list of film names:
FILM_NAMES = [
'Baked Blue Tomatoes',
'Fried Yellow Tomatoes',
'The thing that ate my homework',
'In a world where',
]
Then to find all film names containing a substring, ignoring case:
def find_films_available_online(partial_film_name)
FILM_NAMES.find_all do |film_name|
film_name.downcase[partial_film_name.downcase]
end
end
p find_films_available_online('tomatoes')
# => ["Baked Blue Tomatoes", "Fried Yellow Tomatoes"]
p find_films_available_online('godzooka')
# => []
To find out if a film name is available online:
def available_online?(partial_film_name)
!find_films_available_online(partial_film_name).empty?
end
p available_online?('potatoes') # => false
p available_online?('A World') # => true
To find out which of a list of partial film names are available online:
def partial_film_names_available_online(partial_film_names)
partial_film_names.find_all do |partial_film_name|
available_online?(partial_film_name)
end
end
p partial_film_names_available_online [
'tomatoes',
'potatoes',
'A World',
]
# => ["tomatoes", "A World"]
A more rubyish way to do this is:
Given an array of films we are looking for:
#films = ["how to train your dragon", "kung fu panda", "avatar"]
Given an array of lines that may contain the films we are looking for:
#lines_array = ["just in kung fu panda", "available soon how to train your dragon"]
Return the film name early if it exists in a line or false if it doesn't after searching all the lines:
def online_available(film)
#lines_array.each do |l|
l.downcase.include?(film) ? (return film) : false
end
false
end
Check for the films in the lines rejecting the ones that returned false, print them and ultimately return an array of the matches we found:
def films_available
available = #films.collect{ |x| p "Looking for: #{x}"; online_available(x) }
.reject{ |x| x == false }
available.each{|x| p "Found: #{x}"}
available
end
It is considered bad style to use camel-case in method names with Ruby but you know what they say about opinions.
.each is an internal iterator and I'm pretty sure the "for" loop will run slower than the enumerable each method that arrays inherit.

Reading strings from one file and adding to another file with suffix to make unique

I am processing documents in ruby.
I have a document I am extracting specific strings from using regexp and then adding them to another file. When added to the destination file they must be made unique so if that string already exists in the destination file I'am adding a simple suffix e.g. <word>_1. Eventually I want to be referencing the strings by name so random number generation or string from the date is no good.
At present I am storing each word added in an array and then everytime I add a word I check the string doesn't exist in an array which is fine if there is only 1 duplicate however there might be 2 or more so I need to check for the initial string then loop incrementing the suffix until it doesn't exist, (I have simplified my code so there may be bugs)
def add_word(word)
if #added_words include? word
suffix = 1
suffixed_word = word
while added_words include? suffixed_word
suffixed_word = word + "_" + suffix.to_s
suffix += 1
end
word = suffixed_word
end
#added_words << word
end
It looks messy, is there a better algorithm or ruby way of doing this?
Make #added_words a Set (don't forget to require 'set'). This makes for faster lookup as sets are implemented with hashes, while still using include? to check for set membership. It's also easy to extract the highest used suffix:
>> s << 'foo'
#=> #<Set: {"foo"}>
>> s << 'foo_1'
#=> #<Set: {"foo", "foo_1"}>
>> word = 'foo'
#=> "foo"
>> s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }
#=> "foo_1"
>> s << 'foo_12' #=>
#<Set: {"foo", "foo_1", "foo_12"}>
>> s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }
#=> "foo_12"
Now to get the next value you can insert, you could just do the following (imagine you already had 12 foos, so the next should be a foo_13):
>> s << s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }.next
#=> #<Set: {"foo", "foo_1", "foo_12", "foo_13"}
Sorry if the examples are a bit confused, I had anesthesia earlier today. It should be enough to give you an idea of how sets could potentially help you though (most of it would work with array too, but sets have faster lookup).
Change #added_words to a Hash with a default of zero. Then you can do:
#added_words = Hash.new(0)
def add_word( word)
#added_words[word] += 1
end
# put it to work:
list = %w(test foo bar test bar bar)
names = list.map do |w|
"#{w}_#{add_word(w)}"
end
p #added_words
#=> {"test"=>2, "foo"=>1, "bar"=>3}
p names
#=>["test_1", "foo_1", "bar_1", "test_2", "bar_2", "bar_3"]
In that case, I'd probably use a set or hash:
#in your class:
require 'set'
require 'forwardable'
extend Forwardable #I'm just including this to keep your previous api
#elsewhere you're setting up your instance_var, it's probably [] at the moment
def initialize
#added_words = Set.new
end
#then instead of `def add_word(word); #added_words.add(word); end`:
def_delegator :added_words, :add_word, :add
#or just change whatever loop to use ##added_words.add('word') rather than self#add_word('word')
##added_words.add('word') does nothing if 'word' already exists in the set.
If you've got some attributes that you're grouping via these sections, then a hash might be better:
#elsewhere you're setting up your instance_var, it's probably [] at the moment
def initialize
#added_words = {}
end
def add_word(word, attrs={})
#added_words[word] ||= []
#added_words[word].push(attrs)
end
Doing it the "wrong way", but in slightly nicer code:
def add_word(word)
if #added_words.include? word
suffixed_word = 1.upto(1.0/0.0) do |suffix|
candidate = [word, suffix].join("_")
break candidate unless #added_words.include?(candidate)
end
word = suffixed_word
end
#added_words << word
end

Resources