For example, say I have input like follows:
" see all of these cool spaces "
Omit the quotes. What I'm looking for is how to turn that into an array of words. Like this:
['see', 'all', 'of', 'these', 'cool', 'spaces']
Thanks
Here's one way: Use split (see String#split):
string.split
By default, split will split the string into an array where the whitespace is, ignoring leading and trailing whitespace. Exactly what you're asking for. This is the same as using the more explicit string.split(" ").
" see all of these cool spaces ".split
#=> ["see", "all", "of", "these", "cool", "spaces"]
Related
For example, the string is "I am very happy today". I want to remove all words containing the letter "a". So the output should be "I very". how can I do that?
Similar to #Sam's answer, only smaller :) Uses the little known Enumerable#grep_v.
Inverted version of #grep. Returns an array of every element in enum for which not Pattern === element.
"I am very happy today".split.grep_v(/a/).join(' ') # => "I very"
You can try splitting each word and removing the ones that have the letter 'a' and join the words together like this:
"I am very happy today".split.reject{ |word| word.include?("a") }.join(" ")
Here's an example with a regex :
word boundary
alphanumeric characters
a
alphanumeric characters
word boundary
You need to remove the unneeded spaces then.
"I am very happy today".gsub(/\b\w*a\w*\b/i, '').strip.gsub(/\s+/, ' ')
The answers with split and join are cleaner, though.
Based on "How to Delete Strings that Start with Certain Characters in Ruby", I know that the way to remove a string that starts with the character "#" is:
email = email.gsub( /(?:\s|^)#.*/ , "") #removes strings that start with "#"
I want to also remove strings that end in ".". Inspired by "Difference between \A \z and ^ $ in Ruby regular expressions" I came up with:
email = email.gsub( /(?:\s|$).*\./ , "")
Basically I used gsub to remove the dollar sign for the carrot and reversed the order of the part after the closing parentheses (making sure to escape the period). However, it is not doing the trick.
An example I'd like to match and remove is:
"a8&23q2aas."
You were so close.
email = email.gsub( /.*\.\s*$/ , "")
The difference lies in the fact that you didn't consider the relationship between string of reference and the regex tokens that describe the condition you wish to trigger. Here, you are trying to find a period (\.) which is followed only by whitespace (\s) or the end of the line ($). I would read the regex above as "Any characters of any length followed by a period, followed by any amount of whitespace, followed by the end of the line."
As commenters pointed out, though, there's a simpler way: String#end_with?.
I'd use:
words = %w[#a day in the life.]
# => ["#a", "day", "in", "the", "life."]
words.reject { |w| w.start_with?('#') || w.end_with?('.') }
# => ["day", "in", "the"]
Using a regex is overkill for this if you're only concerned with the starting or ending character, and, in fact, regular expressions will slow your code in comparison with using the built-in methods.
I would really like to stick to using gsub....
gsub is the wrong way to remove an element from an array. It could be used to turn the string into an empty string, but that won't remove that element from the array.
def replace_suffix(str,suffix)
str.end_with?(suffix)? str[0, str.length - suffix.length] : str
end
I have text like:
content = "Do you like to code? How I love to code! I'm always coding."
I'm trying to split it on either a ? or . or !:
content.split(/[?.!]/)
When I print out the results, the punctuation delimiters are missing.
Do you like to code
How I love to code
I'm always coding
How can I keep the punctuation?
Answer
Use a positive lookbehind regular expression (i.e. ?<=) inside a parenthesis capture group to keep the delimiter at the end of each string:
content.split(/(?<=[?.!])/)
# Returns an array with:
# ["Do you like to code?", " How I love to code!", " I'm always coding."]
That leaves a white space at the start of the second and third strings. Add a match for zero or more white spaces (\s*) after the capture group to exclude it:
content.split(/(?<=[?.!])\s*/)
# Returns an array with:
# ["Do you like to code?", "How I love to code!", "I'm always coding."]
Additional Notes
While it doesn't make sense with your example, the delimiter can be shifted to the front of the strings starting with the second one. This is done with a positive lookahead regular expression (i.e. ?=). For the sake of anyone looking for that technique, here's how to do that:
content.split(/(?=[?.!])/)
# Returns an array with:
# ["Do you like to code", "? How I love to code", "! I'm always coding", "."]
A better example to illustrate the behavior is:
content = "- the - quick brown - fox jumps"
content.split(/(?=-)/)
# Returns an array with:
# ["- the ", "- quick brown ", "- fox jumps"]
Notice that the square bracket capture group wasn't necessary since there is only one delimiter. Also, since the first match happens at the first character it ends up as the first item in the array.
To answer the question's title, adding a capture group to your split regex will preserve the split delimiters:
"Do you like to code? How I love to code! I'm always coding.".split /([?!.])/
=> ["Do you like to code", "?", " How I love to code", "!", " I'm always coding", "."]
From there, it's pretty simple to reconstruct sentences (or do other massaging as the problem calls for it):
s.split(/([?!.])/).each_slice(2).map(&:join).map(&:strip)
=> ["Do you like to code?", "How I love to code!", "I'm always coding."]
The regexes given in other answers do fulfill the body of the question more succinctly, though.
I'd use something like:
content.scan(/.+?[?!.]/)
# => ["Do you like to code?", " How I love to code!", " I'm always coding."]
If you want to get rid of the intervening spaces, use:
content.scan(/.+?[?!.]/).map(&:lstrip)
# => ["Do you like to code?", "How I love to code!", "I'm always coding."]
Use partition. An example from the documentation:
"hello".partition("l") #=> ["he", "l", "lo"]
The most robust way to do this is with a Natural Language Processing library: Rails gem to break a paragraph into series of sentences
You can also split in groups:
#content.split(/(\?+)|(\.+)|(!+)/)
After splitting into groups, you can join the sentence and delimiter.
#content.split(/(\?+)|(\.+)|(!+)/).each_slice(2) {|slice| puts slice.join}
The following two statements will generate the same result:
arr = %w(abc def ghi jkl)
and
arr = ["abc", "def", "ghi", "jkl"]
In which cases should %w be used?
In the case above, I want an array ["abc", "def", "ghi", "jkl"]. Which is the ideal way: the former (with %w) or the later?
When to use %w[...] vs. a regular array? I'm sure you can think up reasons simply by looking at the two, and then typing them in, and thinking about what you just did.
Use %w[...] when you have a list of single words you want to turn into an array. I use it when I have parameters I want to loop over, or commands I know I'll want to add to in the future, because %w[...] makes it easy to add new elements to the array. There's less visual noise in the definition of the array.
Use a regular array of strings when you have elements that have embedded white-space that would trick %w. Use it for arrays that have to contain elements that are not strings. Enclosing the elements inside " and ' with intervening commas causes visual-noise, but it also makes it possible to create arrays with any object type.
So, you pick when to use one or the other when it makes the most sense to you. It's called "programmer's choice".
As you correctly noted, they generate the same result. So, when deciding, choose one that produces simpler code. In this case, it's the %w operator. In the case of your previous question, it's the array literal.
Using %w allows you to avoid using quotes around strings.
Moreover, there are more shortcuts like these:
%W - double quotes
%r - regular expression
%q - single-quoted string
%Q - double-quoted string
%x - shell command
More information is available in "What does %w(array) mean?"
This is the way I remember it:
%Q/%q is for strings
%Q is for double-quoted strings (useful for when you have multiple quote characters in a string).
Instead of doing this:
“I said \“Hello World\””
You can do:
%Q{I said “Hello World”}
%q is for single-quoted strings (remember single quoted strings do not support string interpolation or escape sequences e.g. \n. And when I say does not "support", I mean that single quoted strings will need process the escape sequence as a special character, in other words, the escape sequence will just be part of the string literal)
Instead of doing this:
‘I said \’Hello World\’’
You can do:
%q{I said 'Hello World'}
But note that if you have an escape sequence in string, that will not be processed and instead treated as a literal backslash and n character:
result = %q{I said Hello World\n}
=> "I said Hello World\\n"
puts result
I said Hello World\n
Notice the literal \n was not treated as a line break, but it is with %Q:
result = %Q{I said Hello World\n}
=> "I said Hello World\n"
puts result
I said Hello World
%W/%w is for array elements
%W is used for double-quoted array elements. This means that it will support string interpolation and escape sequences:
Instead of doing this:
orange = "orange"
result = ["apple", "#{orange}", "grapes"]
=> ["apple", "orange", "grapes”]
you can do this:
result = %W(apple #{orange} grapes\n)
=> ["apple", "orange", "grapes\n"]
puts result
apple
orange
grapes
Notice the escape sequence \n caused a newline break after grapes. That would not happen with %w. %w is used for single-quoted array elements. And of course single quoted strings do not support interpolation and escape sequences.
Instead of doing this:
result = [‘a’, ‘b’, ‘c’]
you can do:
result = %w{a b c}
But look what happens when we try this:
result = %w{a b c\n}
=> ["a", "b", "c\\n"]
puts result
a
b
c\n
Remember do not confuse these constructs with %x (alternative for ` backtick which is used to run unix commands), %r (alternative for // regular expression syntax useful when you have a lot of / characters in your regular expressions and do not want to escape them) and finally %s (which is sued for symbols).
I have an array with keywords and I have a string, which may contain those keywords. I now need to know how many keywords are in the given string:
keywords = [ 'text' ,'keywords' ,'contains' ,'blue', '42']
text = 'This text is not long but it contains 3 keywords'
How can I now find out with a ruby command how many of the strings in my array are in the text (three in this case)? I could of course use a for each loop but I am almost sure that there is a more concise way to achieve this.
Thanks for your help
Update: Preferably the solution should not rely on the spaces. So the spaces could be replaced by arbitrary characters.
Update 2: The command should look for unique occurrences.
Here's one approach:
text.scan(/#{keywords.join('|')}/).length
Note that this is safe only if the keywords array contains only alphanumeric characters.
Not exactly what you wanted but
irb(main):012:0> text.split(' ')
=> ["This", "text", "is", "not", "long", "but", "it", "contains", "3", "keywords"]
irb(main):013:0> text.split(' ') & keywords
=> ["text", "contains", "keywords"]
will give you an array with matches