Counting words in Ruby with some exceptions

Counting words in Ruby with some exceptions - ruby

Say that we want to count the number of words in a document. I know we can do the following:
text.each_line(){ |line| totalWords = totalWords + line.split.size }
Say, that I just want to add some exceptions, such that, I don't want to count the following as words:
(1) numbers
(2) standalone letters
(3) email addresses
How can we do that?
Thanks.

You can wrap this up pretty neatly:
text.each_line do |line|
total_words += line.split.reject do |word|
word.match(/\A(\d+|\w|\S*\#\S+\.\S+)\z/)
end.length
end
Roughly speaking that defines an approximate email address.
Remember Ruby strongly encourages the use of variables with names like total_words and not totalWords.

assuming you can represent all the exceptions in a single regular expression regex_variable, you could do:
text.each_line(){ |line| totalWords = totalWords + line.split.count {|wrd| wrd !~ regex_variable }
your regular expression could look something like:
regex_variable = /\d.|^[a-z]{1}$|\A([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i
I don't claim to be a regex expert, so you may want to double check that, particularly the email validation part

In addition to the other answers, a little gem hunting came up with this:
WordsCounted Gem
Get the following data from any string or readable file:
Word count
Unique word count
Word density
Character count
Average characters per word
A hash map of words and the number of times they occur
A hash map of words and their lengths
The longest word(s) and its length
The most occurring word(s) and its number of occurrences.
Count invividual strings for occurrences.
A flexible way to exclude words (or anything) from the count. You can pass a string, a regexp, an array, or a lambda.
Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
Filters special characters but respects hyphens and apostrophes.
Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as ["São", "Paulo"] and not ["S", "", "o", "Paulo"].
Opens and reads files. Pass in a file path or a url instead of a string.

Have you ever started answering a question and found yourself wandering, exploring interesting, but tangential issues, or concepts you didn't fully understand? That's what happened to me here. Perhaps some of the ideas might prove useful in other settings, if not for the problem at hand.
For readability, we might define some helpers in the class String, but to avoid contamination, I'll use Refinements.
Code
module StringHelpers
refine String do
def count_words
remove_punctuation.split.count { |w|
!(w.is_number? || w.size == 1 || w.is_email_address?) }
end
def remove_punctuation
gsub(/[.!?,;:)](?:\s|$)|(?:^|\s)\(|\-|\n/,' ')
end
def is_number?
self =~ /\A-?\d+(?:\.\d+)?\z/
end
def is_email_address?
include?('#') # for testing only
end
end
end
module CountWords
using StringHelpers
def self.count_words_in_file(fname)
IO.foreach(fname).reduce(0) { |t,l| t+l.count_words }
end
end
Note that using must be in a module (possibly a class). It does not work in main, presumably because that would make the methods available in the class self.class #=> Object, which would defeat the purpose of Refinements. (Readers: please correct me if I'm wrong about the reason using must be in a module.)
Example
Let's first informally check that the helpers are working correctly:
module CheckHelpers
using StringHelpers
s = "You can reach my dog, a 10-year-old golden, at fido#dogs.org."
p s = s.remove_punctuation
#=> "You can reach my dog a 10 year old golden at fido#dogs.org."
p words = s.split
#=> ["You", "can", "reach", "my", "dog", "a", "10",
# "year", "old", "golden", "at", "fido#dogs.org."]
p '123'.is_number? #=> 0
p '-123'.is_number? #=> 0
p '1.23'.is_number? #=> 0
p '123.'.is_number? #=> nil
p "fido#dogs.org".is_email_address? #=> true
p "fido(at)dogs.org".is_email_address? #=> false
p s.count_words #=> 9 (`'a'`, `'10'` and "fido#dogs.org" excluded)
s = "My cat, who has 4 lives remaining, is at abbie(at)felines.org."
p s = s.remove_punctuation
p s.count_words
end
All looks OK. Next, put I'll put some text in a file:
FName = "pets"
text =<<_
My cat, who has 4 lives remaining, is at abbie(at)felines.org.
You can reach my dog, a 10-year-old golden, at fido#dogs.org.
_
File.write(FName, text)
#=> 125
and confirm the file contents:
File.read(FName)
#=> "My cat, who has 4 lives remaining, is at abbie(at)felines.org.\n
# You can reach my dog, a 10-year-old golden, at fido#dogs.org.\n"
Now, count the words:
CountWords.count_words_in_file(FName)
#=> 18 (9 in ech line)
Note that there is at least one problem with the removal of punctuation. It has to do with the hyphen. Any idea what that might be?

Something like...?
def is_countable(word)
return false if word.size < 2
return false if word ~= /^[0-9]+$/
return false if is_an_email_address(word) # you need a gem for this...
return true
end
wordCount = text.split().inject(0) {|count,word| count += 1 if is_countable(word) }
Or, since I am jumping to the conclusion that you can just split your entire text into an array with split(), you might need:
wordCount = 0
text.each_line do |line|
line.split.each{|word| wordCount += 1 if is_countable(word) }
end

Related

How to return the whole array instead of a single string

I am trying to return all words which have more than four letters in the below exercise.
def timed_reading(max_length, text)
var_b = text.split(" ")
var_b.map do |i|
if i.length >= max_length
return i
end
end
end
print timed_reading(4,"The Fox asked the stork, 'How is the soup?'")
# >> asked
I seem to get only one word.

If you want to filter a list and select only certain kinds of entries, use the select method:
var_b.select do |i|
i.length >= max_length
end
Where that's all you need.
The return i in the middle is confusing things, as that breaks out of the loop and returns a single value from the method itself. Remember that in Ruby, unlike others such as JavaScript, return is often implied and doesn't need to be spelled out explicitly.
Blocks don't normally have return in them for this reason unless they need to interrupt the flow and break out of the method itself.

You don't need to first extract all words from the string and then select those having at least four letters. Instead you can just extract the desired words using String#scan with a regular expression.
str = "The Fox asked the stork, 'How is the soup?'? Très bon?"
str.scan /\p{Alpha}{4,}/
#=> ["asked", "stork", "soup", "Très"]
The regular expression reads, "Match strings containing 4 or more letters". I've used \p{Alpha} (same as \p{L} and [[:alpha:]]) to match unicode letters. (These are documented in Regexp. Search for these expressions there.) You could replace \p{Alpha} with [a-zA-Z], but in that case "Très" would not be matched.
If you wish to also match digits, use \p{Alnum} or [[:alnum:]] instead. While \w also matches letters (English only) and digits, it also matches underscores, which you probably don't want in this situation.
Punctuation can be a problem when words are extracted from the string by splitting on whitespace.
arr = "That is a cow.".split
#=> ["That", "is", "a", "cow."]
arr.select { |word| word.size >= 4 }
#=> ["That", "cow."]
but "cow" has only three letters. If you instead used String#scan to extract words from the string you obtain the desired result.
arr = "That is a cow?".scan /\p{Alpha}+/
#=> ["That", "is", "a", "cow"]
arr.select { |word| word.size >= 4 }
#=> ["That"]
However, if you use scan you may as well use a regular expression to retrieve only words having at least 4 characters, and skip the extra step.

Stuck in Abbreviation implementation to ruby string

I want to convert all the words(alphabetic) in the string to their abbreviations like i18n does. In other words I want to change "extraordinary" into "e11y" because there are 11 characters between the first and the last letter in "extraordinary". It works with a single word in the string. But how can I do the same for a multi-word string? And of course if a word is <= 4 there is no point to make an abbreviation from it.
class Abbreviator
def self.abbreviate(x)
x.gsub(/\w+/, "#{x[0]}#{(x.length-2)}#{x[-1]}")
end
end
Test.assert_equals( Abbreviator.abbreviate("banana"), "b4a", Abbreviator.abbreviate("banana") )
Test.assert_equals( Abbreviator.abbreviate("double-barrel"), "d4e-b4l", Abbreviator.abbreviate("double-barrel") )
Test.assert_equals( Abbreviator.abbreviate("You, and I, should speak."), "You, and I, s4d s3k.", Abbreviator.abbreviate("You, and I, should speak.") )

Your mistake is that your second parameter is a substitution string operating on x (the original entire string) as a whole.
Instead of using the form of gsub where the second parameter is a substitution string, use the form of gsub where the second parameter is a block (listed, for example, third on this page). Now you are receiving each substring into your block and can operate on that substring individually.

def short_form(str)
str.gsub(/[[:alpha:]]{4,}/) { |s| "%s%d%s" % [s[0], s.size-2, s[-1]] }
end
The regex reads, "match four or more alphabetic characters".
short_form "abc" # => "abc"
short_form "a-b-c" #=> "a-b-c"
short_form "cats" #=> "c2s"
short_form "two-ponies-c" #=> "two-p4s-c"
short_form "Humpty-Dumpty, who sat on a wall, fell over"
#=> "H4y-D4y, who sat on a w2l, f2l o2r"

I would recommend something along the lines of this:
class Abbreviator
def self.abbreviate(x)
x.gsub(/\w+/) do |word|
# Skip the word unless it's long enough
next word unless word.length > 4
# Do the same I18n conversion you do before
"#{word[0]}#{(word.length-2)}#{word[-1]}"
end
end
end

The accepted answer isn't bad, but it can be made a lot simpler by not matching words that are too short in the first place:
def abbreviate(str)
str.gsub(/([[:alpha:]])([[:alpha:]]{3,})([[:alpha:]])/i) { "#{$1}#{$2.size}#{$3}" }
end
abbreviate("You, and I, should speak.")
# => "You, and I, s4d s3k."
Alternatively, we can use lookbehind and lookahead, which makes the Regexp more complex but the substitution simpler:
def abbreviate(str)
str.gsub(/(?<=[[:alpha:]])[[:alpha:]]{3,}(?=[[:alpha:]])/i, &:size)
end

Count capitalized of each sentence in a paragraph Ruby

I answered my own question. Forgot to initialize count = 0
I have a bunch of sentences in a paragraph.
a = "Hello there. this is the best class. but does not offer anything." as an example.
To figure out if the first letter is capitalized, my thought is to .split the string so that a_sentence = a.split(".")
I know I can "hello world".capitalize! so that if it was nil it means to me that it was already capitalized
EDIT
Now I can use array method to go through value and use '.capitalize!
And I know I can check if something is .strip.capitalize!.nil?
But I can't seem to output how many were capitalized.
EDIT
a_sentence.each do |sentence|
if (sentence.strip.capitalize!.nil?)
count += 1
puts "#{count} capitalized"
end
end
It outputs:
1 capitalized
Thanks for all your help. I'll stick with the above code I can understand within the framework I only know in Ruby. :)

Try this:
b = []
a.split(".").each do |sentence|
b << sentence.strip.capitalize
end
b = b.join(". ") + "."
# => "Hello there. This is the best class. But does not offer anything."

Your post's title is misleading because from your code, it seems that you want to get the count of capitalized letters at the beginning of a sentence.
Assuming that every sentence is finishing on a period (a full stop) followed by a space, the following should work for you:
split_str = ". "
regex = /^[A-Z]/
paragraph_text.split(split_str).count do |sentence|
regex.match(sentence)
end
And if you want to simply ensure that each starting letter is capitalized, you could try the following:
paragraph_text.split(split_str).map(&:capitalize).join(split_str) + split_str

There's no need to split the string into sentences:
str = "It was the best of times. sound familiar? Out, damn spot! oh, my."
str.scan(/(?:^|[.!?]\s)\s*\K[A-Z]/).length
#=> 2
The regex could be written with documentation by adding x after the closing /:
r = /
(?: # start a non-capture group
^|[.!?]\s # match ^ or (|) any of ([]) ., ! or ?, then one whitespace char
) # end non-capture group
\s* # match any number of whitespace chars
\K # forget the preceding match
[A-Z] # match one capital letter
/x
a = str.scan(r)
#=> ["I", "O"]
a.length
#=> 2
Instead of Array#length, you could use its alias, size, or Array#count.

You can count how many were capitalized, like this:
a = "Hello there. this is the best class. but does not offer anything."
a_sentence = a.split(".")
a_sentence.inject(0) { |sum, s| s.strip!; s.capitalize!.nil? ? sum += 1 : sum }
# => 1
a_sentence
# => ["Hello there", "This is the best class", "But does not offer anything"]
And then put it back together, like this:
"#{a_sentence.join('. ')}."
# => "Hello there. This is the best class. But does not offer anything."
EDIT
As #Humza sugested, you could use count:
a_sentence.count { |s| s.strip!; s.capitalize!.nil? }
# => 1

Searching for single words and combination words in Ruby

I want my output to search and count the frequency of the words "candy" and "gram", but also the combinations of "candy gram" and "gram candy," in a given text (whole_file.)
I am currently using the following code to display the occurrences of "candy" and "gram," but when I aggregate the combinations within the %w, only the word and frequencies of "candy" and "gram" display. Should I try a different way? thanks so much.
myArray = whole_file.split
stop_words= %w{ candy gram 'candy gram' 'gram candy' }
nonstop_words = myArray - stop_words
key_words = myArray - nonstop_words
frequency = Hash.new (0)
key_words.each { |word| frequency[word] +=1 }
key_words = frequency.sort_by {|x,y| x }
key_words.each { |word, frequency| puts word + ' ' + frequency.to_s }

It sounds like you're after n-grams. You could break the text into combinations of consecutive words in the first place, and then count the occurrences in the resulting array of word groupings. Here's an example:
whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
[["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]].each do |term|
terms = whole_file.split(/\s+/).each_cons(term.length).to_a
puts "#{term.join(" ")} #{terms.count(term)}"
end
EDIT: As was pointed out in the comments below, I wasn't paying close enough attention and was splitting the file on each loop which is obviously not a good idea, especially if it's large. I also hadn't accounted for the fact that the original question may've need to sort by the count, although that wasn't explicitly asked.
whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
# This is simplistic. You would need to address punctuation and other characters before
# or at this step.
split_file = whole_file.split(/\s+/)
terms_to_count = [["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]]
counts = []
terms_to_count.each do |term|
terms = split_file.each_cons(term.length).to_a
counts << [term.join(" "), terms.count(term)]
end
# Seemed like you may need to do sorting too, so here that is:
sorted = counts.sort { |a, b| b[1] <=> a[1] }
sorted.each do |count|
puts "#{count[0]} #{count[1]}"
end

Strip punctuation and convert to lower-case
The first thing you probably want to do is remove all punctuation from the string holding the contents of the file and then convert what's left to lower case, the latter so you don't have worry about counting 'Cat' and 'cat' as the same word. Those two operations can be done in either order.
Changing upper-case letters to lower-case is easy:
text = whole_file.downcase
To remove the punctuation it is probably easier to decide what to keep rather than what to discard. If we only want to keep lower-case letters, you can do this:
text = whole_file.downcase.gsub(/[^a-z]/, '')
That is, substitute an empty string for all characters other than (^) lowercase letters.1
Determine frequency of individual words
If you want to count the number of times text contains the word 'candy', you can use the method String#scan on the string text and then determine the size of the array that is returned:
text.scan(/\bcandy\b/).size
scan returns an array with every occurrence of the string 'candy'; .size returns the size of that array. Here \b ensures 'candy gram' has a word "boundary" at each end, which could be whitespace or the beginning or end of a line or the file. That's to prevent `candycane' from being counted.
A second way is to convert the string text to an array of words, as you have done2:
myArray = text.split
If you don't mind, I'd like to call this:
words = text.split
as I find that more expressive.3
The most direct way to determine the number of times 'candy' appears is to use the method Enumberable#count, like this:
words.count('candy')
You can also use the array difference method, Array#-, as you noted:
words.size - (words - ['candy']).size
If you wish to know the number of times either 'candy' or 'gram' appears, you could of course do the above for each and sum the two counts. Some other ways are:
words.size - (myArray - ['candy', 'gram']).size
words.count { |word| word == 'candy' || word = 'gram' }
words.count { |word| ['candy', 'gram'].include?(word) }
Determine the frequency of all words that appear in the text
Your use of a hash with a default value of zero was a good choice:
def frequency_of_all_words(words)
frequency = Hash.new(0)
words.each { |word| frequency[word] +=1 }
frequency
end
I wrote this as a method to emphasize that words.each... does not return frequency. Often you would see this written more compactly using the method Enumerable#each_with_object, which returns the hash ("object"):
def frequency_of_all_words(words)
words.each_with_object(Hash.new(0)) { |word, h| h[word] +=1 }
end
Once you have the hash frequency you can sort it as you did:
frequency.sort_by {|word, freq| freq }
or
frequency.sort_by(&:last)
which you could write:
frequency.sort_by {|_, freq| freq }
since you aren't using the first block variable. If you wanted the most frequent words first:
frequency.sort_by(&:last).reverse
or
frequency.sort_by {|_, freq| -freq }
All of these will give you an array. If you want to convert it back to a hash (with the largest values first, say):
Hash[frequency.sort_by(&:last).reverse]
or in Ruby 2.0+,
frequency.sort_by(&:last).reverse.to_h
Count the number of times a substring appears
Now let's count the number of times the string 'candy gram' appears. You might think we could use String#scan on the string holding the entire file, as we did earlier4:
text.scan(/\bcandy gram\b/).size
The first problem is that this won't catch 'candy\ngram'; i.e., when the words are separated by a newline character. We could fix that by changing the regex to /\bcandy\sgram\b/. A second problem is that 'candy gram' might have been 'candy. Gram' in the file, in which case you might not want to count it.
A better way is to use the method Enumerable#each_cons on the array words. The easiest way to show you how that works is by example:
words = %w{ check for candy gram here candy gram again }
#=> ["check", "for", "candy", "gram", "here", "candy", "gram", "again"]
enum = words.each_cons(2)
#=> #<Enumerator: ["check", "for", "candy", "gram", "here", "candy",
# "gram", "again"]:each_cons(2)>
enum.to_a
#=> [["check", "for"], ["for", "candy"], ["candy", "gram"],
# ["gram", "here"], ["here", "candy"], ["candy", "gram"],
# ["gram", "again"]]
each_cons(2) returns an enumerator; I've converted it to an array to display its contents.
So we can write
words.each_cons(2).map { |word_pair| word_pair.join(' ') }
#=> ["check for", "for candy", "candy gram", "gram here",
# "here candy", "candy gram", "gram again"]
and lastly:
words.each_cons(2).map { |word_pair|
word_pair.join(' ') }.count { |s| s == 'candy gram' }
#=> 2
1 If you also wanted to keep dashes, for hyphenated words, change the regex to /[^-a-z]/ or /[^a-z-]/.
2 Note from String#split that .split is the same as both .split(' ') and .split(/\s+/)).
3 Also, Ruby's naming convention is to use lower-case letters and underscores ("snake-case") for variables and methods, such as my_array.

Randomly replace letters in word

I tried to write a function which will be able to randomly change letters in word except first and last one.
def fun(string)
z=0
s=string.size
tab=string
a=(1...s-1).to_a.sample s-1
for i in 1...(s-1)
puts tab[i].replace(string[a[z]])
z=z+1
end
puts tab
end
fun("sample")
My output is:
p
l
a
m
sample
Anybody know how to make it my tab be correct?
it seems to change in for block, because in output was 'plamp' so it's random as I wanted but if I want to print the whole word (splampe) it doesn't working. :(

What about:
def fun(string)
first, *middle, last = string.chars
[first, middle.shuffle, last].join
end
fun("sample") #=> "smalpe"

s = 'sample'
[s[0], s[1..-2].chars.shuffle, s[-1]].join
# => "slpmae"

Here is my solution:
def fun(string)
first = string[0]
last = string[-1]
middle = string[1..-2]
puts "#{first}#{middle.split('').shuffle.join}#{last}"
end
fun('sample')

there are some problems with your function. First, when you say tab=string, tab is now a reference to string, so, when you change characters on tab you change the string characters too. I think that for clarity is better to keep the index of sample (1....n)to reference the position in the original array.
I suggest the usage of tab as a new array.
def fun(string)
if string.length <= 2
return
z=1
s=string.size
tab = []
tab[0] = string[0]
a=(1...s-1).to_a.sample(s-1)
(1...s-1).to_a.each do |i|
tab[z] = string[a[i - 1]]
z=z+1
end
tab.push string[string.size-1]
tab.join('')
end
fun("sample")
=> "spalme"

Another way, using String#gsub with a block:
def inner_jumble(str)
str.sub(/(?<=\w)\w{2,}(?=\w)/) { |s| s.chars.shuffle.join }
end
inner_jumble("pneumonoultramicroscopicsilicovolcanoconiosis") # *
#=> "poovcanaiimsllinoonroinuicclprsciscuoooomtces"
inner_jumble("what ho, fellow coders?")
#=> "waht ho, folelw coedrs?"
(?<=\w) is a ("zero-width") positive look-behind that requires the match to immediately follow a word character.
(?=\w) is a ("zero-width") positive look-ahead that requires the match to be followed immediately by a word character.
You could use \w\w+ in place of \w{2,} for matching two or more consecutive word characters.
If you only want it to apply to individual words, you can use gsub or sub.
*A lung disease caused by inhaling very fine ash and sand dust, supposedly the longest word in some English dictionaries.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Counting words in Ruby with some exceptions - ruby

Related

How to return the whole array instead of a single string

Stuck in Abbreviation implementation to ruby string

Count capitalized of each sentence in a paragraph Ruby

Searching for single words and combination words in Ruby

Randomly replace letters in word

Categories

Resources