Ruby count number of found keywords in string - ruby

I have an array with keywords and I have a string, which may contain those keywords. I now need to know how many keywords are in the given string:
keywords = [ 'text' ,'keywords' ,'contains' ,'blue', '42']
text = 'This text is not long but it contains 3 keywords'
How can I now find out with a ruby command how many of the strings in my array are in the text (three in this case)? I could of course use a for each loop but I am almost sure that there is a more concise way to achieve this.
Thanks for your help
Update: Preferably the solution should not rely on the spaces. So the spaces could be replaced by arbitrary characters.
Update 2: The command should look for unique occurrences.

Here's one approach:
text.scan(/#{keywords.join('|')}/).length
Note that this is safe only if the keywords array contains only alphanumeric characters.

Not exactly what you wanted but
irb(main):012:0> text.split(' ')
=> ["This", "text", "is", "not", "long", "but", "it", "contains", "3", "keywords"]
irb(main):013:0> text.split(' ') & keywords
=> ["text", "contains", "keywords"]
will give you an array with matches

Related

Making each word of input an array element

For example, say I have input like follows:
" see all of these cool spaces "
Omit the quotes. What I'm looking for is how to turn that into an array of words. Like this:
['see', 'all', 'of', 'these', 'cool', 'spaces']
Thanks
Here's one way: Use split (see String#split):
string.split
By default, split will split the string into an array where the whitespace is, ignoring leading and trailing whitespace. Exactly what you're asking for. This is the same as using the more explicit string.split(" ").
" see all of these cool spaces ".split
#=> ["see", "all", "of", "these", "cool", "spaces"]

String splitting with unknown punctuation in Ruby

I am building an application that downloads sentences and parses them for a word game. I don't know in advance what punctuation the text will contain.
I'd like to be able to split up the sentence/s, examine them for part of speech tag, and if the correct tag is found, replace it with " ", and rejoin them back in order.
text = "some string, with punctuation- for example: things I don't know about, that may or may not have whitespaces and random characters % !!"
How can I split it into an array so that I can pass the parser over each word, and rejoin them in order, bearing in mind that string.split(//) seems to need to know what punctuation I'm looking for?
split is useful when you can more easily describe the delimiters than the parts to be extracted. In your case, you can more easily describe the parts to be extracted rather than the delimiters, in which case scan is more suited. It is a wrong decision to use split. You should you scan.
text.scan(/[\w']+/)
# => ["some", "string", "with", "punctuation", "for", "example", "things", "I", "don't", "know", "about", "that", "may", "or", "may", "not", "have", "whitespaces", "and", "random", "characters"]
If you want to replace the matches, there is even more reason to not use split. In that case, you should use gsub.
text.gsub(/[\w']+/) do |word|
if word.is_of_certain_part_of_speech?
"___" # Replace it with `"___"`.
else
word # Put back the original word.
end
end

Strange behavior while splitting string with non-word character regex

Case 1(Trailing space)
> "on behalf of all of us ".split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
but if there is leading space then it gives following
Case 2(Leading space)
> " on behalf of all of us".split(/\W+/)
=> ["", "on", "behalf", "of", "all", "of", "us"]
I was expecting result of Case 1 for Case 2 also.
ADDED
> "#dhh congratulations!!".split(/\W+/)
=> ["", "dhh", "congratulations"]
Would anyone please help me to understand the behavior?
[Update]
Skip regex, just Split on space!
> "#dhh congratulations!!".split
=> ["#dhh", "congratulations"]
\W matches any non-word character including space. so as the parser sees a space in start & some chars AFTER the space; it splits. But if the space it at the end, there is no other wordy char[a-zA-Z0-9] present to split with.
To get consistent behavior, you should remove whitespaces using #strip method.
Case 1(Trailing space)
1.9.3p327 :007 > " on behalf of all of us ".strip.split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
Case 2(Leading space)
1.9.3p327 :008 > "on behalf of all of us ".strip.split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
From the docs:
split(pattern=$;, [limit]) → anArray
[...]
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
Just for documentation, following works for me
" #dhh congratulations!!".gsub(/^\W+/,'').split /\W+/
Another one
" #dhh congratulations!!".scan /\w+/
Both gives expected results. However there is a caveat for short forms like
> " Don't be shy.".scan /\w+/
=> ["Don", "t", "be", "shy"]
I am actually collecting words which are not articles, conjunctions, prepositions etc. So anyway I am ignoring such short forms and hence I used this solution.
I am preparing words cloud from tweets. If you know any proven algorithm please share.

Remove all non-alphabetical, non-numerical characters from a string?

If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.
You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.
string.gsub(/[^[:alnum:]]/, "")
The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.
You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.
If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.

Ruby: Is there a way to split a string only with the first x occurrencies?

For example, suppose I have this:
001, "john doe", "male", 37, "programmer", "likes dogs, women, and is lazy"
The problem is that the line is only supposed to have 6 fields. But if I separate it with split I get more, due to the comma being used improperly to separate the fields.
Right now I'm splitting everything, then when I get to the 5-th index onward I concatenate all the strings. But I was wondering if there was a split(",",6) or something along these lines.
Ruby has a CSV module in the standard library. It will do what you really need here (ignore commas in doubles quotes).
require 'CSV.rb'
CSV::Reader.parse("\"cake, pie\", bacon") do |row| p row; end
result:
["cake, pie", " bacon"]
=> nil
You might want to strip the results if you're dim like me and stick whitespace everywhere.
Yes, you can do the_string.split(",", 6). However this will still give "wrong" result if there's a comma inside quotes somewhere in the middle (e.g. 001, "doe, john",...).
However using Shellwords might be more appropriate here as this will also allow other sections than the last to contain commas inside quotes (it will also remove the quotes which may or may not be a problem, depending on what you're trying to do).
Example:
require 'shellwords'
the_string = %(001, "doe, john", "male", 37, "programmer", "likes dogs, women, and is lazy")
Shellwords.shellwords the_string
#=> ["001,", "doe, john,", "male,", "37,", "programmer,", "likes dogs, women, and is lazy"]

Resources