What is the best way to validate a gets input against a very long word list (a list of all the English words available)?
I am currently playing with readlines to manipulate the text, but before there's any manipulation, I would like to first validate the entry against the list.
The simplest way, but by no means the fastest, is to simply search against the word list each time. If the word list is in an array:
if word_list.index word
#manipulate word
end
If, however, you had the word list as a separate file (with each word on a separate line), then we'll use File#foreach to find it:
if File.foreach("word.list") {|x| break x if x.chomp == word}
#manipulate word
end
Note that foreach does not strip off the trailing newline character(s), so we get rid of them with String#chomp.
Here's a simple example using a Set, though Mark Johnson is right,
a bloom filter would be more efficient.
require 'set'
WORD_RE = /\w+/
# Read in the default dictionary (from /usr/share/dict/words),
# and put all the words into a set
WORDS = Set.new(File.read('/usr/share/dict/words').scan(WORD_RE))
# read the input line by line
STDIN.each_line do |line|
# find all the words in the line that aren't contained in our dictionary
unrecognized = line.scan(WORD_RE).find_all { |term| not WORDS.include? term }
# if none were found, the line is valid
if unrecognized.empty?
puts "line is valid"
else # otherwise, the line contains some words not in our dictionary
puts "line is invalid, could not recognize #{unrecognized.inspect}"
end
end
are you reading the list from a file?
can't you have it all in memory?
maybe a finger tree may help you
if not, there's not more than "read a chunk of data from the file and grep into"
Read the word list into memory, and for each word, make an entry into a hash table:
def init_word_tester
#words = {}
File.foreach("word.list") {|word|
#words[word.chomp] = 1
}
end
now you can just check every word against your hash:
def test_word word
return #words[word]
end
Related
How do I get the first word from each line? Thanks to help from someone on Stack Overflow, I am working with the code below:
File.open("pastie.rb", "r") do |file|
while (line = file.gets)
next if (line[0,1] == " ")
labwords = line.split.first
print labwords.join(' ')
end
end
It extracts the first word from each line, but it has problems with spaces. I need help adjusting it. I need to use the first method, but I don't know how to use it.
If you want the first word from each line from a file:
first_words = File.read(file_name).lines.map { |l| l.split(/\s+/).first }
It's pretty simple. Let's break it apart:
File.read(file_name)
Reads the entire contents of the file and returns it as a string.
.lines
Splits a string by newline characters (\n) and returns an array of strings. Each string represents a "line."
.map { |l| ... }
Array#map calls the provided block passing in each item and taking the return value of the block to build a new array. Once Array#map finishes it returns the array containing new values. This allows you to transform the values. In the sample block here |l| is the block params portion meaning we're taking one argument and we'll reference it as l.
|l| l.split(/\s+/).first
This is the block internal, I've gone ahead and included the block params here too for completeness. Here we split the line by /\s+/. This is a regular expression, the \s means any whitespace (\t \n and space) and the + following it means one or more so \s+ means one or more whitespace character and of course, it will try to match as many consecutive whitespace characters as possible. Passing this to String#split will return an array of substrings that occur between the seperator given. Now, our separator was one or more whitespace so we should get everything between whitespace. If we had the string "A list of words" we'll get ["A", "list", "of", "words"] after the split call. It's very useful. Finally, we call .first which returns the first element of an array (in this case "the first word").
Now, in Ruby, the evaluated value of the last expression in a block is automatically returned so our first word is returned and given that this block is passed to map we should get an array of the first words from a file. To demonstrate, let's take the input (assuming our file contains):
This is line one
And line two here
Don't forget about line three
Line four is very board
Line five is the best
It all ends with line six
Running this through the line above we get:
["This", "And", "Don't", "Line", "Line", "It"]
Which is the first word from each line.
Consider this:
def first_words_from_file(file_name)
lines = File.readlines(file_name).reject(&:empty?)
lines.map do |line|
line.split.first
end
end
puts first_words_from_file('pastie.rb')
I am writing a matching algorithm that checks a user-entered word against a huge list of english words to see how many matches it can find. Everything works, except I have two lines of code that are essentially meant to not pick the same letters twice, and they make the whole thing just return a single letter. Here is what I've done:
word_array = []
File.open("wordsEn.txt").each do |line|
word_array << line.chomp
end
puts "Please enter a string of characters with no spaces:"
user_string = gets.chomp.downcase
user_string_array = user_string.split("")
matching_words = []
word_array.each do |word|
one_array = word.split("")
tmp_user_string_array = user_string_array
letter_counter = 0
for i in 0...word.length
if tmp_user_string_array.include? one_array[i]
letter_counter += 1
string_index = tmp_user_string_array.index(one_array[i])
tmp_user_string_array.slice!(string_index)
end
end
if letter_counter == word.length
matching_words << word
end
end
puts matching_words
This part here is what breaks it:
string_index = tmp_user_string_array.index(one_array[i])
tmp_user_string_array.slice!(string_index)
Can anyone see an issue here? It all makes sense to me.
I see what's happening. You're eliminating letters for non-matching words, which prevents matching words from being found.
For example, take this word list:
ant
bear
cat
dog
emu
And this input to your program:
catdog
The first word you look for is ant, which causes the a and t to be sliced out of catdog, leaving cdog. Now the word cat can no longer be found.
The cure is to make sure that your tmp_user_string_array really is a temporary array. Currently it's a reference to the original user_string_array, which means that you're destructively modifying the user input. You should make a copy of it before you start slicing and dicing.
Once you've got that working, you might like to think about more efficient approaches that don't require duplicating and slicing arrays. Consider this: what if you were to sort each word of your lexicon as well as the input string before starting to look for a match? This would turn the word cat into act and the input acatdog into aacdgot. Do you see how you could traverse the sorted word and the sorted input in search of a match without the need to do any slicing?
I have a file like this:
some content
some oterh
*********************
useful1 text
useful3 text
*********************
some other content
How do I get the content of the file within between two stars line in an array. For example, on processing the above file the content of array should be like this
a=["useful1 text" , "useful2 text"]
A really hack solution is to split the lines on the stars, grab the middle part, and then split that, too:
content.split(/^\*+$/)[1].split(/\s+/).reject(&:empty?)
# => ["useful1","useful3"]
f = File.open('test_doc.txt', 'r')
content = []
f.each_line do |line|
content << line.rstrip unless !!(line =~ /^\*(\*)*\*$/)
end
f.close
The regex pattern /^*(*)*$/ matches strings that contain only asterisks. !!(line =~ /^*(*)*$/) always returns a boolean value. So if the pattern does not match, the string is added to the array.
What about this:
def values_between(array, separator)
array.slice array.index(separator)+1..array.rindex(separator)-1
end
filepath = '/tmp/test.txt'
lines = %w(trash trash separator content content separator trash)
separator = "separator\n"
File.write '/tmp/test.txt', lines.join("\n")
values_between File.readlines('/tmp/test.txt'), "separator\n"
#=> ["content\n", "content\n"]
I'd do it like this:
lines = []
File.foreach('./test.txt') do |li|
lines << li if (li[/^\*{5}/] ... li[/^\*{5}/])
end
lines[1..-2].map(&:strip).select{ |l| l > '' }
# => ["useful1 text", "useful3 text"]
/^\*{5}/ means "A string that starts with and has at least five '*'.
... is one of two uses of .. and ... and, in this use, is commonly called a "flip-flop" operator. It isn't used often in Ruby because most people don't seem to understand it. It's sometimes mistaken for the Range delimiters .. and ....
In this use, Ruby watches for the first test, li[/^\*{5}/] to return true. Once it does, .. or ... will return true until the second condition returns true. In this case we're looking for the same delimiter, so the same test will work, li[/^\*{5}/], and is where the difference between the two versions, .. and ... come into play.
.. will return toggle back to false immediately, whereas ... will wait to look at the next line, which avoids the problem of the first seeing a delimiter and then the second seeing the same line and triggering.
That lets the test assign to lines, which, prior to the [1..-2].map(&:strip).select{ |l| l > '' } looks like:
# => ["*********************\n",
# "\n",
# "useful1 text\n",
# "\n",
# "useful3 text\n",
# "\n",
# "*********************\n"]
[1..-2].map(&:strip).select{ |l| l > '' } cleans that up by slicing the array to remove the first and last elements, strip removes leading and trailing whitespace, effectively getting rid of the trailing newlines and resulting in empty lines and strings containing the desired text. select{ |l| l > '' } picks up the lines that are greater than "empty" lines, i.e., are not empty.
See "When would a Ruby flip-flop be useful?" and its related questions, and "What is a flip-flop operator?" for more information and some background. (Perl programmers use .. and ... often, for just this purpose.)
One warning though: If the file has multiple blocks delimited this way, you'll get the contents of them all. The code I wrote doesn't know how to stop until the end-of-file is reached, so you'll have to figure out how to handle that situation if it could occur.
I'm trying to complete the first task to our assignment:
Get 5 regular emails and 5 advance-‐fee fraud emails (aka spam). Convert them all into text files and then turn each into an array of words (split may help here). Then use a bunch of regular expressions to search the array of words looking for keywords to classify which files are spam or not. If you want to get fancy you could give each array a spam-‐score out of 10.
Open HTML page and read file.
Strip script, links etc from file.
Have body/para on its own.
Open text file (file2) & write to it (UTF-8).
Pass content from HTML document (file 1).
Now put the words from text file (file2) into an array and later split.
Go through array finding any words that are considered spam and print message to screen stating if the email is a spam or not.
Here is my code:
require 'nokogiri'
file = File.open("EMAILS/REG/Membership.htm", "r")
doc = Nokogiri::HTML(file)
#What ever is passed from elements to the newFile is being put into the new array however the euro sign doesn't appear correctly
elements = doc.xpath("/html/body//p").text
#puts elements
newFile = File.open("test1.txt", "w")
newFile.write(elements)
newFile.close()
#I want to open the file again and print the lines to the screen
#
array_of_words = {}
puts "\n\tRetrieving test1.txt...\n\n"
File.open("test1.txt", "r:UTF-8").each_line do |line|
words = line.split(' ')
words.each do |word|
puts "#{word}"
#array_of_words[word] = gets.chomp.split(' ')
end
end
EDITED: Here I've edited the file, however, I'm unable to retrieve the UTF-8 encoding of the euro sign in the array (see the image).
require 'nokogiri'
doc = Nokogiri::HTML(File.open("EMAILS/REG/Membership.htm", "r:UTF-8"))
#What ever is passed from elements to the newFile is being put into the new
#array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text
#puts elements
File.write("test1.txt", elements)
puts "\n\tRetrieving test1.txt...\n\n"
#I want to open the file again and print the lines to the screen
#
word_array = Array.new
File.read("test1.txt").each_line do |line|
line.split(' ').each do |word|
puts "#{word}"
word_array << word
end
end
Because this is an assignment, I'm not going to try to answer how you're supposed to do this; You're supposed to figure it out on your own.
What I will do is show you how you should have written what you've already done, and point you in a direction:
require 'nokogiri'
doc = Nokogiri::HTML(File.read("EMAILS/REG/Membership.htm"))
# What ever is passed from elements to the newFile is being put into the new
# array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text
File.write("test1.txt", elements)
print "\n\tRetrieving test1.txt...\n\n"
# I want to open the file again and print the lines to the screen
word_hash = {}
File.open("test1.txt", "r:UTF-8").each_line do |line|
line.split(' ').each do |word|
puts "#{word}"
#word_hash[word] = gets.chomp.split(' ')
end
end
Many of Ruby's IO methods, and File's by inheritance, can take advantage of blocks, which automatically close the stream when the block exits. Use that capability as leaving files open throughout the run-time of an app is not good.
array_of_words = {} doesn't define an array, it's a hash.
#array_of_words[word] = gets.chomp.split(' ') wouldn't work because of where gets wants to read from. By default it's STDIN, which would be the console, meaning the keyboard. You've already got word at that point so do something with it.
But think, you're basically creating the basis for a Bayesian Filter. You need to be counting the number of occurrences of words, so merely assigning the word to the hash won't get you what you want to know, you need to know how many times a particular word was seen. Stack Overflow has a lot of questions answered about how to count the number of words found in a string, so search for those.
You're making things harder for yourself. You already have the paragraph text in elements so there's no need to read test1.txt after writing to it. Then use String#split without arguments to split on all whitespace.
I've googled everywhere and can't seem to find an example of what I'm looking for. I'm trying to learn ruby and i'm writing a simple script. The user is prompted to enter letters which are loaded into an array. The script then goes through a file containing a bunch of words and pulls out the words that contain what is in the array. My problem is that it only pulls words out if they are in order of the array. For example...
characterArray = Array.new;
puts "Enter characters that the password contains";
characters = gets.chomp;
puts "Searching words containing #{characters}...";
characterArray = characters.scan(/./);
searchCharacters=characterArray[0..characterArray.size].join;
File.open("dictionary.txt").each { |line|
if line.include?(searchCharacters)
puts line;
end
}
If i was to use this code and enter "dog"
The script would return
dog
doggie
but i need the output to return words even if they're not in the same order. Like...
dog
doggie
rodge
Sorry for the sloppy code. Like i said still learning. Thanks for your help.
PS. I've also tried this...
File.open("dictionary.txt").each { |line|
if line =~ /[characterArray[0..characterArray.size]]/
puts line;
end
}
but this returns all words that contain ANY of the letters the user entered
First of all, you don't need to create characterArray yourself. When you assign result of function to a new variable, it will work without it.
In your code characters will be, for example, "asd". characterArray then will be ["a", "s", "d"]. And searchCharacters will be "asd" again. It seems you don't need this conversion.
characterArray[0..characterArray.size] is just equal to characterArray.
You can use each_char iterator to iterate through characters of string. I suggest this:
puts "Enter characters that the password contains";
characters = gets.chomp;
File.open("dictionary.txt").each { |line|
unless characters.each_char.map { |c| line.include?(c) }.include? false
puts line;
end
}
I've checked it works properly. In my code I make an array:
characters.each_char.map { |c| line.include?(c) }
Values of this array will indicate: true - character found in line, false - character not found. Length of this array equals to count of characters in characters. We will consider line good if there is no false values.