Special character uppercase - ruby

I have strings with a bunch of special characters. This works:
myString.upcase.tr('æ-ý','Æ-Ý')
However, it does not work really cross-platform. My Ruby implementation on Windows won't go with this (on my Mac and Linux machines, works like a charm). Any pointers / workarounds / solutions, really appreciated!

Try mb_chars method if you are using Rails >= 3. For example,
'æ-ý'.mb_chars.upcase
=> "Æ-Ý"
If you're not using Rails please try unicode gem.
Unicode::upcase('æ-ý')
Or you can override String class methods as well:
require "unicode";
class String
def downcase
Unicode::downcase(self)
end
def downcase!
self.replace downcase
end
def upcase
Unicode::upcase(self)
end
def upcase!
self.replace upcase
end
def capitalize
Unicode::capitalize(self)
end
def capitalize!
self.replace capitalize
end
end

Unfortunately, it is impossible to correctly upcase/downcase a string without knowing the language and it in some cases even the contents of the string.
For example, in English the uppercase variant of i is I and the lowercase variant of I is i, but in Turkish the uppercase variant of i is İ and the lowercase variant of I is ı. In German, the uppercase variant of ß is SS, but so is the uppercase variant of ss, so to downcase, you need to understand the text, because e.g. MASSE could be downcased to either masse (mass) or maße (measurements).
Ruby takes the easy way out and simply only uppercases/downcases within the ASCII alphabet.
However, that only explains why your workaround is needed, not why it sometimes works and sometimes doesn't. Provided that you use the same Ruby version and the same Ruby implementation and the same version of the implementation on all platforms, it should work. YARV doesn't use the underlying platform's string manipulation routines much (the same is true for most Ruby implementations, actually, even JRuby doesn't use Java's powerful string libraries but rolls its own for maximum compatibility), and it also doesn't use any third-party libraries (like e.g. ICU) except Onigmo, so it's unlikely that platform differences are to blame. Different versions of Ruby use different versions of the Unicode Character Database, though (e.g. I believe it was updated somewhere between 1.9 and 2.2 at least once), so if you have a version mismatch, that might explain it.
Or, it might be a genuine bug in YARV on Windows. Maybe try JRuby? It tends to be more consistent between platforms, in fact, on Windows, it is more compatible with Ruby than Ruby (i.e. YARV) itself!

Related

Modify string class to use only uppercase

I have a small application that processes some basic data (Names, birthdate, etc.). It will be interfacing with a management system that only accepts uppercase strings. Thinking of ways to go about this, I know I could just use .upcase for all the variables. I figured the most DRY way would be to modify the String class itself and make a conversion, but could not find any documentation as to the method within String that actually takes in the value of said string. The more I think about it, I also do not know what the implications of doing it this way would be (if it's even possible).
I tried monkey patching the String class
class String
def initialize
self = self.upcase
end
end
Or
class String
def new(str="")
new_str = str.upcase
end
end
But I haven't found any info on how a string is actually initialized.
Tl;Dr
How can I convert a lower case string to uppercase on said string's
initialization
Are there any implications I should be aware of if it
is possible?
Thank you for your time.
The solution here is not to boil the ocean and make every string in Ruby force everything to uppercase, but to uppercase the things that system needs if and when you provide it to that system.
Changing fundamental Ruby classes in this dramatic a way is bound to cause your entire code-base to implode. Many internals depend on being able to store arbitrary data in strings, and if those strings are arbitrarily uppercased you're in big trouble. It's like redefining what Integer#+ does. You can, but you really, really shouldn't. This would be akin to redefining the electrical charge of a proton. The universe would literally explode.
It's better to write some kind of adapter method that can operate on arbitrary strings or values and make sure they conform to whatever quirks or encoding your other system uses:
def to_arcahic(string)
string.upcase
end
If, for example, they don't allow accented characters or emoji, you'll need to strip those out and/or convert them to something else. Maybe "é" becomes "E" or maybe you just delete it.

undefined method `to_a' for "ens160":String

Ruby version - 2.0 gives this error
# Convert values to a pair of bytes ...
interface = values[0]
values.collect! { |i| i.to_a.pack('H*') }
with the same code, we never faced this sort of issue in ruby 1.8.7
As of Ruby 1.9.0, String#s are no longer Enumerable. You can't simply iterate over a String or convert it to an Array – what would you iterate over? What would the elements of the Array be?
In different contexts, a String can be interpreted as
a sequence of bytes,
a sequence of octets,
a sequence of codepoints,
a sequence of characters,
a sequence of lines,
a sequence of words,
a sequence of sentences,
a sequence of paragraphs,
a sequence of sections,
a sequence of paragraphs,
… and many other things.
You have to tell Ruby what interpretation you want. That's what the various iteration methods in the String class are for:
String#each_byte
String#each_char
String#each_codepoint
String#each_line
There are also corresponding methods which represent the String as an Array:
String#bytes
String#chars
String#codepoints
String#lines
Note that all of those methods already exist in Ruby 1.8.7 as well, and in fact, treating Strings as Enumerables was considered deprecated in Ruby 1.8.7.
It is unclear from your code what exactly you are trying to do, but my best guess is that you are looking for String#chars.
To answer your exact question
Why does to_a() not work the same way in ruby 2.0 as it worked in ruby 1.8.7?
It is because in ruby 1.8.7, strings were enumerables: https://ruby-doc.org/core-1.8.7/String.html. Which means they included the Enumerable module, which had method .to_a.
This was already not the case in ruby 1.9.3 and up. That's why.
So either use ruby 1.8.7 everywhere or change that facts retrieval code (or whatever it is) to not use now-nonexistent String#to_a.

In Ruby can data interpolated into a string cause the string to terminate?

In Ruby is there any way that data added to a string with interpolation can terminate the string? For example something like:
"This remains#{\somekindofmagic} and this does not" # => "This remains"
I'm assuming not but I want to be sure that doing something like
something.send("#{untrusted_input}=", more_untrusted_input)
doesn't actually leave some way that the interpolated string could be terminated and used to send eval.
Not possible with input string data AFAIK. Ruby Strings can contain arbitrary binary data, there should be no magic combination of bytes that terminates a String early.
If you are worried about "injection" style attacks on Ruby strings, then this is generally not easy to achieve if input is in the form of external data that has been converted to a string (and your specific concern about having an eval triggered cannot occur). This style of attack relies on code that passes an input string into some other interpreter (e.g. SQL or JavaScript) without properly escaping language constructs.
However, if String parameters are coming in the form of Ruby objects from untrusted Ruby code in the same process, it is possible to add side-effects to them:
class BadString
def to_s
puts "Payload"
"I am innocent"
end
end
b = BadString.new
c = "Hello #{b}"
Payload
=> "Hello I am innocent"
Edit: Your example
something.send("#{untrusted_input}=", more_untrusted_input)
would still worry me slightly, if untrusted_input really is untrusted, you are relying heavily on the fact that there are no methods ending in = that you would be unhappy to have called. Sometimes new methods can be defined on core classes due to use of a framework or gem, and you may not know about them, or they may appear in later versions of a gem. Personally I would whitelist allowed method names for that reason, or use some other validation scheme on the incoming data, irrespective of how secure you feel against open-ended evals.
Strings in ruby are internally handled as an array of bytes on the heap and an integer that holds the length of the string. So while in C a NUL byte (\0) terminates a string, this can not happen in ruby.
More info on ruby string internals here: http://patshaughnessy.net/2012/1/4/never-create-ruby-strings-longer-than-23-characters (also includes why ruby strings longer than 23 bytes were slower in ruby 1.9).

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

What options do exist now to implement UTF8 in Ruby and RoR?

Following the development of Ruby very closely I learned that detailed character encoding is implemented in Ruby 1.9. My question for now is: How may Ruby be used at the moment to talk to a database that stores all data in UTF8?
Background: I am involved in a new project where Ruby/RoR is at least an option. But the project needs to rely on an internationalized character set (it's spread over many countries), preferably UTF8.
So how do you deal with that? Thanks in advance.
Ruby 1.8 works fine with UTF-8 strings for basic operations with the strings. Depending on your application's need, some operations will either not work or not work as expected.
Eg:
1) The size of strings will give you bytes, not characters since the mult-byte support is not there yet. But do you need to know the size of your strings in characters?
2) No splitting a string at a character boundary. But do you need this? Etc.
3) Sorting order will be funky if sorted in Ruby. The suggestion of using the db to sort is a good idea.
etc.
Re poster's comment about sorting data after reading from db: As noted, results will probably not match users' expectations. So the solution is to sort on the db. And it will usually be faster, anyhow--databases are designed to sort data.
Summary: My Ruby 1.8.6 RoR app works fine with international Unicode characters processed and stored as UTF-8 on modern browsers. Right to left languages work fine too. Main issues: be sure that your db and all web pages are set to use UTF-8. If you already have some data in your db, then you'll need to go through a conversion process to change it to UTF-8.
Regards,
Larry
"Unicode ahoy! While Rails has always been able to store and display unicode with no beef, it’s been a little more complicated to truncate, reverse, or get the exact length of a UTF-8 string. You needed to fool around with KCODE yourself and while plenty of people made it work, it wasn’t as plug’n’play easy as you could have hoped (or perhaps even expected).
So since Ruby won’t be multibyte-aware until this time next year, Rails 1.2 introduces ActiveSupport::Multibyte for working with Unicode strings. Call the chars method on your string to start working with characters instead of bytes." Click Here for more
Although I haven't tested it, the character-encodings library (currently in alpha) adds methods to the String class to handle UTF-8 and others. Its page on RubyForge is here. It is designed for Ruby 1.8.
It is my experience, however, that, using Ruby 1.8, if you store data in your database as UTF-8, Ruby will not get in the way as long as your character encoding in the HTTP header is UTF-8. It may not be able to operate on the strings, but it won't break anything. Example:
file.txt:
¡Hola! ¿Como estás? Leí el artículo. ¡Fue muy excellente!
Pardon my poor Spanish; it was the best example of Unicode I could come up with.
in irb:
str = File.read("file.txt")
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\n"
str += "Foo is equal to bar."
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
str = " " + str + " "
=> " \302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar. "
str.strip
=> "\302\241Hola! \302\277Como est\303\241s? Le\303\255 el art\303\255culo. \302\241Fue muy excellente!\nFoo is equal to bar."
Basically, it will just treat the UTF-8 as ASCII with odd characters in it. It will not sort lexigraphically if the code points are out of order; however, it will sort by code point. Example:
"\302" <=> "\301"
=> -1
How much are you planning on operating on the data in the Rails app, anyway? Most sorting etc. is usually done by your database engine.

Resources