The upcase method capitalizes the entire string, but I need to capitalize only the first letter.
Also, I need to support several popular languages, like German and Russian.
How do I do it?
It depends on which Ruby version you use:
Ruby 2.4 and higher:
It just works, as since Ruby v2.4.0 supports Unicode case mapping:
"мария".capitalize #=> Мария
Ruby 2.3 and lower:
"maria".capitalize #=> "Maria"
"мария".capitalize #=> мария
The problem is, it just doesn't do what you want it to, it outputs мария instead of Мария.
If you're using Rails there's an easy workaround:
"мария".mb_chars.capitalize.to_s # requires ActiveSupport::Multibyte
Otherwise, you'll have to install the unicode gem and use it like this:
require 'unicode'
Unicode::capitalize("мария") #=> Мария
Ruby 1.8:
Be sure to use the coding magic comment:
#!/usr/bin/env ruby
puts "мария".capitalize
gives invalid multibyte char (US-ASCII), while:
#!/usr/bin/env ruby
#coding: utf-8
puts "мария".capitalize
works without errors, but also see the "Ruby 2.3 and lower" section for real capitalization.
capitalize first letter of first word of string
"kirk douglas".capitalize
#=> "Kirk douglas"
capitalize first letter of each word
In rails:
"kirk douglas".titleize
=> "Kirk Douglas"
OR
"kirk_douglas".titleize
=> "Kirk Douglas"
In ruby:
"kirk douglas".split(/ |\_|\-/).map(&:capitalize).join(" ")
#=> "Kirk Douglas"
OR
require 'active_support/core_ext'
"kirk douglas".titleize
Rails 5+
As of Active Support and Rails 5.0.0.beta4 you can use one of both methods: String#upcase_first or ActiveSupport::Inflector#upcase_first.
"my API is great".upcase_first #=> "My API is great"
"мария".upcase_first #=> "Мария"
"мария".upcase_first #=> "Мария"
"NASA".upcase_first #=> "NASA"
"MHz".upcase_first #=> "MHz"
"sputnik".upcase_first #=> "Sputnik"
Check "Rails 5: New upcase_first Method" for more info.
Well, just so we know how to capitalize only the first letter and leave the rest of them alone, because sometimes that is what is desired:
['NASA', 'MHz', 'sputnik'].collect do |word|
letters = word.split('')
letters.first.upcase!
letters.join
end
=> ["NASA", "MHz", "Sputnik"]
Calling capitalize would result in ["Nasa", "Mhz", "Sputnik"].
Unfortunately, it is impossible for a machine to upcase/downcase/capitalize properly. It needs way too much contextual information for a computer to understand.
That's why Ruby's String class only supports capitalization for ASCII characters, because there it's at least somewhat well-defined.
What do I mean by "contextual information"?
For example, to capitalize i properly, you need to know which language the text is in. English, for example, has only two is: capital I without a dot and small i with a dot. But Turkish has four is: capital I without a dot, capital İ with a dot, small ı without a dot, small i with a dot. So, in English 'i'.upcase # => 'I' and in Turkish 'i'.upcase # => 'İ'. In other words: since 'i'.upcase can return two different results, depending on the language, it is obviously impossible to correctly capitalize a word without knowing its language.
But Ruby doesn't know the language, it only knows the encoding. Therefore it is impossible to properly capitalize a string with Ruby's built-in functionality.
It gets worse: even with knowing the language, it is sometimes impossible to do capitalization properly. For example, in German, 'Maße'.upcase # => 'MASSE' (Maße is the plural of Maß meaning measurement). However, 'Masse'.upcase # => 'MASSE' (meaning mass). So, what is 'MASSE'.capitalize? In other words: correctly capitalizing requires a full-blown Artificial Intelligence.
So, instead of sometimes giving the wrong answer, Ruby chooses to sometimes give no answer at all, which is why non-ASCII characters simply get ignored in downcase/upcase/capitalize operations. (Which of course also reads to wrong results, but at least it's easy to check.)
Use capitalize. From the String documentation:
Returns a copy of str with the first character converted to uppercase and the remainder to lowercase.
"hello".capitalize #=> "Hello"
"HELLO".capitalize #=> "Hello"
"123ABC".capitalize #=> "123abc"
My version:
class String
def upcase_first
return self if empty?
dup.tap {|s| s[0] = s[0].upcase }
end
def upcase_first!
replace upcase_first
end
end
['NASA title', 'MHz', 'sputnik'].map &:upcase_first #=> ["NASA title", "MHz", "Sputnik"]
Check also:
https://www.rubydoc.info/gems/activesupport/5.0.0.1/String%3Aupcase_first
https://www.rubydoc.info/gems/activesupport/5.0.0.1/ActiveSupport/Inflector#upcase_first-instance_method
You can use mb_chars. This respects umlaute:
class String
# Only capitalize first letter of a string
def capitalize_first
self[0] = self[0].mb_chars.upcase
self
end
end
Example:
"ümlaute".capitalize_first
#=> "Ümlaute"
Below is another way to capitalize each word in a string. \w doesn't match Cyrillic characters or Latin characters with diacritics but [[:word:]] does. upcase, downcase, capitalize, and swapcase didn't apply to non-ASCII characters until Ruby 2.4.0 which was released in 2016.
"aAa-BBB ä мария _a a_a".gsub(/\w+/,&:capitalize)
=> "Aaa-Bbb ä мария _a A_a"
"aAa-BBB ä мария _a a_a".gsub(/[[:word:]]+/,&:capitalize)
=> "Aaa-Bbb Ä Мария _a A_a"
[[:word:]] matches characters in these categories:
Ll (Letter, Lowercase)
Lu (Letter, Uppercase)
Lt (Letter, Titlecase)
Lo (Letter, Other)
Lm (Letter, Modifier)
Nd (Number, Decimal Digit)
Pc (Punctuation, Connector)
[[:word:]] matches all 10 of the characters in the "Punctuation, Connector" (Pc) category:
005F _ LOW LINE
203F ‿ UNDERTIE
2040 ⁀ CHARACTER TIE
2054 ⁔ INVERTED UNDERTIE
FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
FE4D ﹍ DASHED LOW LINE
FE4E ﹎ CENTRELINE LOW LINE
FE4F ﹏ WAVY LOW LINE
FF3F _ FULLWIDTH LOW LINE
This is another way to only convert the first character of a string to uppercase:
"striNG".sub(/./,&:upcase)
=> "StriNG"
Related
Working on a Ruby challenge to convert dash/underscore delimited words into camel casing. The first word within the output should be capitalized only if the original word was capitalized (known as Upper Camel Case).
My solution so far..:
def to_camel_case(str)
str.split('_,-').collect.camelize(:lower).join
end
However .camelize(:lower) is a rails method I believe and doesn't work with Ruby. Is there an alternative method, equally as simplistic? I can't seem to find one. Or do I need to approach the challenge from a completely different angle?
main.rb:4:in `to_camel_case': undefined method `camelize' for #<Enumerator: []:collect> (NoMethodError)
from main.rb:7:in `<main>'
I assume that:
Each "word" is made up of one or more "parts".
Each part is made of up characters other than spaces, hypens and underscores.
The first character of each part is a letter.
Each successive pair of parts is separated by a hyphen or underscore.
It is desired to return a string obtained by modifying each part and removing the hypen or underscore that separates each successive pair of parts.
For each part all letters but the first are to be converted to lowercase.
All characters in each part of a word that are not letters are to remain unchanged.
The first letter of the first part is to remain unchanged.
The first letter of each part other than the first is to be capitalized (if not already capitalized).
Words are separated by spaces.
It this describes the problem correctly the following method could be used.
R = /(?:(?<=^| )|[_-])[A-Za-z][^ _-]*/
def to_camel_case(str)
str.gsub(R) do |s|
c1 = s[0]
case c1
when /[A-Za-z]/
c1 + s[1..-1].downcase
else
s[1].upcase + s[2..-1].downcase
end
end
end
to_camel_case "Little Miss-muffet sat_on_HE$R Tuffett eating-her_cURDS And_whey"
# => "Little MissMuffet satOnHe$r Tuffett eatingHerCurds AndWhey"
The regular expression is can be written in free-spacing mode to make it self-documenting.
R = /
(?: # begin non-capture group
(?<=^| ) # use a positive lookbehind to assert that the next character
# is preceded by the beginning of the string or a space
| # or
[_-] # match '_' or '-'
) # end non-capture group
[A-Za-z] # match a letter
[^ _-]* # match 0+ characters other than ' ', '_' and '-'
/x # free-spacing regex definition mode
Most Rails methods can be added into basic Ruby projects without having to pull in the whole Rails source.
The trick is to figure out the minimum amount of files to require in order to define the method you need. If we go to APIDock, we can see that camelize is defined in active_support/inflector/methods.rb.
Therefore active_support/inflector seems like a good candidate to try. Let's test it:
irb(main)> require 'active_support/inflector'
=> true
irb(main)> 'foo_bar'.camelize
=> "FooBar"
Seems to work. Note that this assumes you already ran gem install activesupport earlier. If not, then do it first (or add it to your Gemfile).
In pure Ruby, no Rails, given str = 'my-var_name' you could do:
delimiters = Regexp.union(['-', '_'])
str.split(delimiters).then { |first, *rest| [first, rest.map(&:capitalize)].join }
#=> "myVarName"
Where str = 'My-var_name' the result is "MyVarName", since the first element of the splitting result is untouched, while the rest is mapped to be capitalized.
It works only with "dash/underscore delimited words", no spaces, or you need to split by spaces, then map with the presented method.
This method is using string splitting by delimiters, as explained here Split string by multiple delimiters,
chained with Object#then.
Using Ruby, I am trying to weed out spam messages the manual way, so why exactly does the below test return false when it should return true? The tested string is the original one, so you can literally copy/paste the whole thing into your ruby console to verify this example:
irb(main):053:0> "Веautiful women fоr sеx in yоur town АU: https://links.wtf/qLFs".include? "sex"
=> false
Hint: If you replace the word "sex" inside the entire string by typing it in yourself, the test will return true as expected. So, somehow, the two "sex" strings are not the same, but on what level? How to test that correctly?
EDIT:
I have narrowed it all down to this (copy/paste it to test it!):
irb(main):073:0> "е" == "e"
=> false
JavaScript's charCodeAt method tells me that the two characters are a different Unicode value. Ruby's .ord method tells me the same thing. You could check against those Unicode values more literally in Ruby, but I'd recommend finding a way to normalize the data instead of adding endless conditionals for unusual characters. It looks like that is a 0x0435 1077 CYRILLIC SMALL LETTER IE е according to a Unicode lookup table I found online.
Alternatively, here's one approach where you could just ban all Cyrillic characters. I used a full range of excluded characters so you could add exclusions as needed.
#!/usr/bin/env ruby
CYRILLIC_UNICODE_DECIMALS = *(1024..1273).freeze
for arg in ARGV
# next unless arg.is_a?(String)
arg.split('').each do |char|
p char if CYRILLIC_UNICODE_DECIMALS.include?(char.ord)
end
end
For reference, these are the .ord and .charCodeAt methods I used against your example. I started with JavaScript because it's a simple test in the browser console.
2.6.3 :005 > 'е'.ord
=> 1077
2.6.3 :006 > 'e'.ord
=> 101
'"е" == "e"'.charCodeAt(1)
1077
'"e" == "e"'.charCodeAt(1)
101
I've got this string: WinterIDäSchwiiz, which comes from an API and I want to search for it in the database. Now it turns out that this string has a different encoding than how its saved in my database. Yet ruby says the encoding for both is utf-8. What is going on?
I've figured out the most terrible way to fix this problem by going down to the bytesequence and replace the bytes representing the "ä" with a different bytesequence and then forceencoding it to utf8. It works but hurts my eyes. Does anyone have a better solution than:
"WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i}.pack('C*').force_encoding('utf-8')
Your string is UTF-8.
I can tell because your fix is to replace the bytes (97, 204, 136) with the bytes (195, 164).
The first byte you're replacing, 97 (0x61) is the UTF-8 character a. The second two bytes, 204 and 136 (0xCC 0x88), are the bytes for the UTF-8 character U+0308, the combining diaeresis: ̈. The two characters combine to form ä.
The bytes you're expecting are 195 and 164 (0xC3 0xA4) which, together, are U+00E4, or Latin small letter "a" with diaeresis.
Both are UTF-8. One prints ä and the other prints ä. This is an example of Unicode equivalence.
In other words:
str1 = "a\xCC\x88"
puts str1 # => ä
p str1.bytes # => [97, 204, 136]
p str1.encoding # => #<Encoding:UTF-8>
str2 = "\xC3\xA4"
puts str2 # => ä
p str2.bytes # => [195, 164]
p str2.encoding # => #<Encoding:UTF-8>
Fortunately, we have Unicode normalization to help deal with this. This is a big topic, but the very, very insufficient TL;DR is that the Unicode consortium has prescribed standard ways to normalize strings like the above, i.e. how to turn str1 into str2.
Unfortunately, it's impossible to say what the best solution for you is, since you didn't provide any details. Your database might have built-in normalization functionality, but I don't know what database you're using so I can't say. Since you did mention Ruby I can point you to the String#unicode_normalize method, which was introduced in Ruby's standard library in Ruby 2.2:
str1 = "a\xCC\x88"
str2 = "\xC3\xA4"
p str1 == str2 # => false
str1_normalized = str1.unicode_normalize
p str1_normalized == str2
# => true
p str1_normalized.bytes == str2.bytes
# => true
If you don't have Ruby 2.2+, well... upgrade. But if you can't upgrade for some reason you can use ActiveSupport::Multibyte::Unicode.normalize, which is especially convenient if you're using Rails, or the Unicode gem.
One more thing
You don't need to do this, since the above is the correct way to do Unicode normalization in Ruby, but a much easier way to do this:
"WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i }.pack('C*').force_encoding('utf-8')
...would have been this:
"WinterIDäSchwiiz".gsub("a\xCC\x88", "\xC3\xA4")
Any time you see something like join(",")...split(",") in Ruby it's almost certainly the wrong solution.
I'm writing a Rack app to split hostnames ending with certain prefixes.
For example, the hostname (and port) hello.world.lvh.me:3000 needs to be split into tokens hello.world, .lvh.me and :3000. Additionally, the prefix (hello.world), suffix (.lvh.me) and port (:3000) are all optional.
So far, I have a (Ruby) regex that looks like /(.*)(\.lvh\.me)(\:\d+)?/.
This successfully breaks the hostname into component parts but it falls down when one or more of the optional components is missing, e.g. for hello.world:3000 or lvh.me:3000 or even plain old hello.world.
I've tried adding ? to each group to make them optional (/(.*)?(\.lvh\.me)?(\:(\d+)?/) but this invariably ends up with the first group, (.*), capturing the entire string and stopping there.
My gut feeling is that this is something which might be solved using lookaround but I'll admit this is a totally new realm of regex for me.
You can try with this pattern:
\A(?=[^:])(.+?)??((?:\.|\A)lvh\.me)?(:[0-9]+)?\z
the lookahead (?=[^:]) checks there is at least one character that is not the : (in other words, not the port alone). This means that at least hello.word or lvh.me is present.
The first group is optional and non-greedy ??, this means that it is matched only when needed.
\A and \z are anchors for the start and the end of the string (when ^ and $ are used for the line)
Note that the character class \d matches all unicode digits in Ruby, but in this case you only need ascii digits. It's better to use [0-9]
Note too that \A(?=[^:])((?>[^l:\n.]+|\.|\Bl|l(?!vh\.me\b))*)((?:\.|\A)lvh\.me)?(:[0-9]+)?\z may be more performant.
online demo
Try ^(.*?)?(\.?lvh\.me)?(\:\d+)?$
I added:
a ? to the first group making the * non-greedy
^,$ to anchor it to the start and end.
a ? to the \. before lvh because you want to match lvh.me:3000 not .lvh.me:3000
A Tokenizing Answer
Just for fun, I decided to see if there was a relatively simple way to do what you wanted without a complicated regular expression. The only regular expressions I used were for splitting and validation.
This works for me with your provided corpus, and several variations.
str = 'hello.world.lvh.me:3000'
tokens = str.split /[.:]/
port = tokens.last =~ /\A\d+\z/ ? ?: + tokens.pop : ''
domain = sprintf '.%s.%s', *tokens.pop(2)
prefix = tokens.join ?.
You'll certainly need to check for empty strings in certain cases, but it seems like it might be more straightforward and/or flexible than a pure regex solution. I find it more readable, anyway. If you truly need a single regular expression, though, I'm sure one of the other answers will help you out.
You could try splitting rather than matching,
irb(main):012:0> "hello.world.lvh.me:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "lvh.me", "3000"]
irb(main):013:0> "hello.world:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "3000"]
irb(main):014:0> "lvh.me:3000".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["lvh.me", "3000"]
irb(main):015:0> "hello.world".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world"]
irb(main):016:0> "hello.world.lvh.me".split(/\.(?=[^.:]+\.[^:.]+(?::\d+)?$)|:/)
=> ["hello.world", "lvh.me"]
Look, ma, no regex!
def split_up(str)
str.sub(':','.:')
.split('.')
.each_slice(2)
.map { |arr| arr.join('.') }
end
split_up("hello.world.lvh.me:3000") #=> ["hello.world", "lvh.me", ":3000"]
split_up("hello.world:3000") #=> ["hello.world", ":3000"]
split_up("hello.world.lvh.me") #=> ["hello.world", "lvh.me"]
split_up("hello.world") #=> ["hello.world"]
split_up("") #=> []
Steps:
str1 = "hello.world.lvh.me:3000" #=> "hello.world.lvh.me:3000"
str2 = str1.sub(':','.:') #=> "hello.world.lvh.me.:3000"
arr = str2.split('.') #=> ["hello", "world", "lvh", "me", ":3000"]
enum = arr.each_slice(2) #=> #<Enumerator: ["hello", "world", "lvh",
# "me", ":3000"]:each_slice(2)>
enum.to_a #=> [["hello", "world"], ["lvh", "me"],
# [":3000"]]
enum.map { |arr| arr.join('.') } #=> ["hello.world", "lvh.me", ":3000"]
I need to make the first character of every word uppercase, and make the rest lowercase...
manufacturer.MFA_BRAND.first.upcase
is only setting the first letter uppercase, but I need this:
ALFA ROMEO => Alfa Romeo
AUDI => Audi
BMW => Bmw
ONETWO THREE FOUR => Onetwo Three Four
In Rails:
"kirk douglas".titleize => "Kirk Douglas"
#this also works for 'kirk_douglas'
w/o Rails:
"kirk douglas".split(/ |\_/).map(&:capitalize).join(" ")
#OBJECT IT OUT
def titleize(str)
str.split(/ |\_/).map(&:capitalize).join(" ")
end
#OR MONKEY PATCH IT
class String
def titleize
self.split(/ |\_/).map(&:capitalize).join(" ")
end
end
w/o Rails (load rails's ActiveSupport to patch #titleize method to String)
require 'active_support/core_ext'
"kirk douglas".titleize #=> "Kirk Douglas"
(some) string use cases handled by #titleize
"kirk douglas"
"kirk_douglas"
"kirk-douglas"
"kirkDouglas"
"KirkDouglas"
#titleize gotchas
Rails's titleize will convert things like dashes and underscores into spaces and can produce other unexpected results, especially with case-sensitive situations as pointed out by #JamesMcMahon:
"hEy lOok".titleize #=> "H Ey Lo Ok"
because it is meant to handle camel-cased code like:
"kirkDouglas".titleize #=> "Kirk Douglas"
To deal with this edge case you could clean your string with #downcase first before running #titleize. Of course if you do that you will wipe out any camelCased word separations:
"kirkDouglas".downcase.titleize #=> "Kirkdouglas"
try this:
puts 'one TWO three foUR'.split.map(&:capitalize).join(' ')
#=> One Two Three Four
or
puts 'one TWO three foUR'.split.map(&:capitalize)*' '
"hello world".titleize which should output "Hello World".
Another option is to use a regex and gsub, which takes a block:
'one TWO three foUR'.gsub(/\w+/, &:capitalize)
"hello world".split.each{|i| i.capitalize!}.join(' ')
Look into the String#capitalize method.
http://www.ruby-doc.org/core-1.9.3/String.html#method-i-capitalize
If you are trying to capitalize the first letter of each word in an array you can simply put this:
array_name.map(&:capitalize)
I used this for a similar problem:
'catherine mc-nulty joséphina'.capitalize.gsub(/(\s+\w)/) { |stuff| stuff.upcase }
This handles the following weird cases I saw trying the previous answers:
non-word characters like -
accented characters common in names like é
capital characters in the middle of the string