Eloquent way to format string in Ruby? - ruby

str = 'foo_bar baz __goo'
Should print as
Foo Bar Baz Goo
Tried to use split /(\s|_)/, but it returns '_' and ' ' and multiple spaces...?
Ruby 1.9.3

try this:
str.split(/[\s_]+/).map(&:classify).join(" ")
if you have access to active support, or
str.split(/[\s_]+/).map(&:capitalize).join(" ")
if you want plain ruby.

Assuming you mean /(\s|_)/ (direction of the slashes matters!), your regular expression is pretty close. The reason you're getting the delimiters (spaces and underscores) in your result is the parentheses: they instruct the splitter to include the delimiters in the returned array.
The reason you are getting extra empty strings is that you are splitting on \s, which matches exactly one space (or tab), or '_', which matches exactly one underscore. If you want to treat any number of spaces or underscores as a single delimiter, you need to add + to your regex - it means "one or more of the previous thing".
But \s|_+ means "a space, or one or more underscores". You want to apply the + to the whole expression, not just the _. That brings us back to the parentheses. In this case, you want to group the two alternatives together without capturing (and returning) them; the syntax for that is (?:...). So this is the result:
str.split(/(?:\s|_)+/)
Now, if you want to normalize case, you want to run capitalize on each string, which you can do with map, like this:
str.split(/(?:\s|_)+/).map { |s| s.capitalize }
or use the shortcut:
str.split(/(?:\s|_)+/).map(&:capitalize)
So far, all these solutions return an array of strings, which you can do a variety of things with. But if you just want to put them back together into a single string, you can use join. For instance, to put them together with a single space between them:
str.split(/(?:\s|_)+/).map(&:capitalize).join ' '

Try splitting the string on:
[\s_]+

Use /[\ _]+/, it looks for one or more occurrence or space or underscore. In that way it is able to eat out multiple underscores, spaces or a combination of both. After than you get an array, so you use map to transform each of them. Later you can get them together using join. See examples -
Get them in a list -
str.split(/[\ _]+/).map {|s| s.capitalize }
=> ["Foo", "Bar", "Baz", "Goo"]
Get them as a whole string -
str.split(/[\ _]+/).map {|s| s.capitalize }.join(" ")
=> "Foo Bar Baz Goo"

Here the pure Ruby version:
str = 'foo_bar baz __goo'
str.split(/[ _]+/).map{|s| s[0].upcase+s[1..-1]}.join(" ")

Related

How to remove strings that end with a particular character in Ruby

Based on "How to Delete Strings that Start with Certain Characters in Ruby", I know that the way to remove a string that starts with the character "#" is:
email = email.gsub( /(?:\s|^)#.*/ , "") #removes strings that start with "#"
I want to also remove strings that end in ".". Inspired by "Difference between \A \z and ^ $ in Ruby regular expressions" I came up with:
email = email.gsub( /(?:\s|$).*\./ , "")
Basically I used gsub to remove the dollar sign for the carrot and reversed the order of the part after the closing parentheses (making sure to escape the period). However, it is not doing the trick.
An example I'd like to match and remove is:
"a8&23q2aas."
You were so close.
email = email.gsub( /.*\.\s*$/ , "")
The difference lies in the fact that you didn't consider the relationship between string of reference and the regex tokens that describe the condition you wish to trigger. Here, you are trying to find a period (\.) which is followed only by whitespace (\s) or the end of the line ($). I would read the regex above as "Any characters of any length followed by a period, followed by any amount of whitespace, followed by the end of the line."
As commenters pointed out, though, there's a simpler way: String#end_with?.
I'd use:
words = %w[#a day in the life.]
# => ["#a", "day", "in", "the", "life."]
words.reject { |w| w.start_with?('#') || w.end_with?('.') }
# => ["day", "in", "the"]
Using a regex is overkill for this if you're only concerned with the starting or ending character, and, in fact, regular expressions will slow your code in comparison with using the built-in methods.
I would really like to stick to using gsub....
gsub is the wrong way to remove an element from an array. It could be used to turn the string into an empty string, but that won't remove that element from the array.
def replace_suffix(str,suffix)
str.end_with?(suffix)? str[0, str.length - suffix.length] : str
end

When to use %w?

The following two statements will generate the same result:
arr = %w(abc def ghi jkl)
and
arr = ["abc", "def", "ghi", "jkl"]
In which cases should %w be used?
In the case above, I want an array ["abc", "def", "ghi", "jkl"]. Which is the ideal way: the former (with %w) or the later?
When to use %w[...] vs. a regular array? I'm sure you can think up reasons simply by looking at the two, and then typing them in, and thinking about what you just did.
Use %w[...] when you have a list of single words you want to turn into an array. I use it when I have parameters I want to loop over, or commands I know I'll want to add to in the future, because %w[...] makes it easy to add new elements to the array. There's less visual noise in the definition of the array.
Use a regular array of strings when you have elements that have embedded white-space that would trick %w. Use it for arrays that have to contain elements that are not strings. Enclosing the elements inside " and ' with intervening commas causes visual-noise, but it also makes it possible to create arrays with any object type.
So, you pick when to use one or the other when it makes the most sense to you. It's called "programmer's choice".
As you correctly noted, they generate the same result. So, when deciding, choose one that produces simpler code. In this case, it's the %w operator. In the case of your previous question, it's the array literal.
Using %w allows you to avoid using quotes around strings.
Moreover, there are more shortcuts like these:
%W - double quotes
%r - regular expression
%q - single-quoted string
%Q - double-quoted string
%x - shell command
More information is available in "What does %w(array) mean?"
This is the way I remember it:
%Q/%q is for strings
%Q is for double-quoted strings (useful for when you have multiple quote characters in a string).
Instead of doing this:
“I said \“Hello World\””
You can do:
%Q{I said “Hello World”}
%q is for single-quoted strings (remember single quoted strings do not support string interpolation or escape sequences e.g. \n. And when I say does not "support", I mean that single quoted strings will need process the escape sequence as a special character, in other words, the escape sequence will just be part of the string literal)
Instead of doing this:
‘I said \’Hello World\’’
You can do:
%q{I said 'Hello World'}
But note that if you have an escape sequence in string, that will not be processed and instead treated as a literal backslash and n character:
result = %q{I said Hello World\n}
=> "I said Hello World\\n"
puts result
I said Hello World\n
Notice the literal \n was not treated as a line break, but it is with %Q:
result = %Q{I said Hello World\n}
=> "I said Hello World\n"
puts result
I said Hello World
%W/%w is for array elements
%W is used for double-quoted array elements. This means that it will support string interpolation and escape sequences:
Instead of doing this:
orange = "orange"
result = ["apple", "#{orange}", "grapes"]
=> ["apple", "orange", "grapes”]
you can do this:
result = %W(apple #{orange} grapes\n)
=> ["apple", "orange", "grapes\n"]
puts result
apple
orange
grapes
Notice the escape sequence \n caused a newline break after grapes. That would not happen with %w. %w is used for single-quoted array elements. And of course single quoted strings do not support interpolation and escape sequences.
Instead of doing this:
result = [‘a’, ‘b’, ‘c’]
you can do:
result = %w{a b c}
But look what happens when we try this:
result = %w{a b c\n}
=> ["a", "b", "c\\n"]
puts result
a
b
c\n
Remember do not confuse these constructs with %x (alternative for ` backtick which is used to run unix commands), %r (alternative for // regular expression syntax useful when you have a lot of / characters in your regular expressions and do not want to escape them) and finally %s (which is sued for symbols).

How to change case of letters in string using RegEx in Ruby

Say I have a string : "hEY "
I want to convert it to "Hey "
string.gsub!(/([a-z])([A-Z]+ )/, '\1'.upcase)
That is the idea I have, but it seems like the upcase method does nothing when I use it within the gsub method. Why is that?
EDIT: I came up with this method:
string.gsub!(/([a-z])([A-Z]+ )/) { |str| str.downcase!.capitalize! }
Is there a way to do this within the regex though? I don't really understand the '\1' '\2' thing. Is that backreferencing? How does that work
#sawa Has the simple answer, and you've edited your question with another mechanism. However, to answer two of your questions:
Is there a way to do this within the regex though?
No, Ruby's regex does not support a case-changing feature as some other regex flavors do. You can "prove" this to yourself by reviewing the official Ruby regex docs for 1.9 and 2.0 and searching for the word "case":
https://github.com/ruby/ruby/blob/ruby_1_9_3/doc/re.rdoc
https://github.com/ruby/ruby/blob/ruby_2_0_0/doc/re.rdoc
I don't really understand the '\1' '\2' thing. Is that backreferencing? How does that work?
Your use of \1 is a kind of backreference. A backreference can be when you use \1 and such in the search pattern. For example, the regular expression /f(.)\1/ will find the letter f, followed by any character, followed by that same character (e.g. "foo" or "f!!").
In this case, within a replacement string passed to a method like String#gsub, the backreference does refer to the previous capture. From the docs:
"If replacement is a String it will be substituted for the matched text. It may contain back-references to the pattern’s capture groups of the form \d, where d is a group number, or \k<n>, where n is a group name. If it is a double-quoted string, both back-references must be preceded by an additional backslash."
In practice, this means:
"hello world".gsub( /([aeiou])/, '_\1_' ) #=> "h_e_ll_o_ w_o_rld"
"hello world".gsub( /([aeiou])/, "_\1_" ) #=> "h_\u0001_ll_\u0001_ w_\u0001_rld"
"hello world".gsub( /([aeiou])/, "_\\1_" ) #=> "h_e_ll_o_ w_o_rld"
Now, you have to understand when code runs. In your original code…
string.gsub!(/([a-z])([A-Z]+ )/, '\1'.upcase)
…what you are doing is calling upcase on the string '\1' (which has no effect) and then calling the gsub! method, passing in a regex and a string as parameters.
Finally, another way to achieve this same goal is with the block form like so:
# Take your pick of which you prefer:
string.gsub!(/([a-z])([A-Z]+ )/){ $1.upcase << $2.downcase }
string.gsub!(/([a-z])([A-Z]+ )/){ [$1.upcase,$2.downcase].join }
string.gsub!(/([a-z])([A-Z]+ )/){ "#{$1.upcase}#{$2.downcase}" }
In the block form of gsub the captured patterns are set to the global variables $1, $2, etc. and you can use those to construct the replacement string.
I don't know why you are trying to do it in a complicated way, but the usual way is:
"hEY".capitalize # => "Hey"
If you insist in using a regex and upcase, then you would also need downcase:
"hEY".downcase.sub(/\w/){$&.upcase} # => "Hey"
If you really want to just swap the case of every letter in the string, you can avoid the complexity of regex entirely because There's A Method For That™.
"hEY".swapcase # => "Hey"
"HellO thERe".swapcase # => "hELLo THerE"
There's also swapcase! to do it destructively.

Remove all non-alphabetical, non-numerical characters from a string?

If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.
You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.
string.gsub(/[^[:alnum:]]/, "")
The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.
You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.
If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.

how to count the words of a string in ruby

I want to do something like this
def get_count(string)
sentence.split(' ').count
end
I think there's might be a better way, string may have built-in method to do this.
I believe count is a function so you probably want to use length.
def get_count(string)
sentence.split(' ').length
end
Edit: If your string is really long creating an array from it with any splitting will need more memory so here's a faster way:
def get_count(string)
(0..(string.length-1)).inject(1){|m,e| m += string[e].chr == ' ' ? 1 : 0 }
end
If the only word boundary is a single space, just count them.
puts "this sentence has five words".count(' ')+1 # => 5
If there are spaces, line endings, tabs , comma's followed by a space etc. between the words, then scanning for word boundaries is a possibility:
puts "this, is./tfour words".scan(/\b/).size/2
I know this is an old question, but this might help someone stumbling here. Countring words is a complicated problem. What is a "word"? Do numbers and special characters count as words? Etc...
I wrote the words_counted gem for this purpose. It's a highly flexible, customizable string analyser. You can ask it to analyse any string for word count, word occurrences, and exclude words/characters using regexp, strings, and arrays.
counter = WordsCounted::Counter.new("Hello World!", exclude: "World")
counter.word_count #=> 1
counted.words #=> ["Hello"]
Etc...
The documentation and full source are on Github.
using regular expression will also cover multi spaces:
sentence.split(/\S+/).size
String doesn't have anything pre-built to do what you wanted. You can define a method in your class or extend the String class itself for what you want to do:
def word_count( string )
return 0 if string.empty?
string.split.size
end
Regex split on any non-word character:
string.split(/\W+/).size
...although it makes apostrophe use count as two words, so depending on how small the margin of error needs to be, you might want to build your own regex expression.
I recently found that String#count is faster than splitting up the string by over an order of magnitude.
Unfortunately, String#count only accepts a string, not a regular expression. Also, it would count two adjacent spaces as two things, rather than a single thing, and you'd have to handle other white space characters seperately.
p " some word\nother\tword.word|word".strip.split(/\s+/).size #=> 4
I'd rather check for word boundaries directly:
"Lorem Lorem Lorem".scan(/\w+/).size
=> 3
If you need to match rock-and-roll as one word, you could do like:
"Lorem Lorem Lorem rock-and-roll".scan(/[\w-]+/).size
=> 4

Resources