Preventing converting string into octal number in Ruby - ruby

Assume we have following ruby code
require 'yaml'
h={"key"=>[{"step1"=>["0910","1223"]}]}
puts h.to_yaml
"0910" is a string
but after to_yaml conversion, string turns into octal number.
---
key:
- step1:
- 0910
- '1223'
the problem is I cannot change h variable. I receive it from outside, and I need to solve problem without changing it.

You are mistaken that there is an octal number in your YAML output. The YAML spec refers to octal on two occasions, and both clearly indicate that an octal number in a YAML file starts with 0o (which is a similar to what Ruby and newer versions of Python use for specifying octal; Python also dropped the support for 0 only octals in version 3, Ruby doesn't seem to have done that—yet).
The custom to indicate octal integers starting with a 0 only, has been proven confusing in many language and was dropped from the YAML specification six years ago. It might be that your parser still supports it, but it shouldn't.
In any case the characters 8 and 9 can never occur in an integer represented as an octal number, so in this case there can be no confusion that that unquoted scalar is a number.
The string 1223 could be interpreted as a normal integer, therefore it must always be represented as a quoted string scalar.
The interesting thing would be to see what happens when you dump the string "0708". If your YAML library is up-to-date with the spec (version 1.2) it can just dump this as an unquoted scalar. Because of the leading zero that is not followed by o (or x) there can be no confusion that this could be an octal number (resp. hexadecimal) either, but for compatibility with old parsers (from before 2009) your parser might just quote it to be on the safe side.

According the the YAML spec numbers prefixed with a 0 signal an octal base (as does in Ruby). However 08 is not a valid octal number, so it doesn't get quoted.
When you come to load this data from the YAML file, the data appears exactly as you need.
0> h={"key"=>[{"step1"=>["0910","1223"]}]}
=> {"key"=>[{"step1"=>["0910", "1223"]}]}
0> yaml_h = h.to_yaml
=> "---\nkey:\n- step1:\n - 0910\n - '1223'\n"
0> YAML.load(yaml_h)
=> {"key"=>[{"step1"=>["0910", "1223"]}]}
If you can't use the data in this state perhaps you could expand on the question and give more detail.

There was a similar task.
I use in secrets.yml:
processing_eth_address: "0x5774226def39e67d6afe6f735f9268d63db6031b"
OR
processing_eth_address: <%= "\'#{ENV["PROCESSING_ETH_ADDRESS"]}\'" %>

My Ruby doesn't do this octal conversion but I had a similar issue with dates. I used to_yaml(canonical: true) to get around this issue. It's more verbose but it's correct.
{"date_of_birth" => "1991-02-29"}.to_yaml
=> "---\ndate_of_birth: 1991-02-29\n"
{"date_of_birth" => "1991-02-29"}.to_yaml(canonical: true)
=> "---\n{\n ? \"date_of_birth\"\n : \"1991-02-29\",\n}\n"

Related

Ruby, problems comparing strings with UTF-8 characters

I have these 2 UTF-8 strings:
a = "N\u01b0\u0303"
b = "N\u1eef"
They look pretty different but the are the same once they are rendered:
irb(main):039:0> puts "#{a} - #{b}"
Nữ - Nữ
The a version is the one I have stored in the DB. The b version is the one is coming from the browser in a POST request, I don't know why the browser is sending a different combination of UTF8 characters, and it is not happening always, I can't reproduce the issue in my dev environment, it happens in production and in a percentage of the total requests.
The case is that I try to compare both of them but they return false:
irb(main):035:0> a == b
=> false
I've tried different things like forcing encoding:
irb(main):022:0> c.force_encoding("UTF-8") == a.force_encoding("UTF-8")
=> false
Another interesting fact is:
irb(main):005:0> a.chars
=> ["N", "ư", "̃"]
irb(main):006:0> b.chars
=> ["N", "ữ"]
How can I compare these kind of strings?
This is an issue with Unicode equivalence.
The a version of your string consists of the character ư (U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.
The b version of the string uses the character ữ (U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).
So currently you have
a == b
which is false, but if you do
a.unicode_normalize == b.unicode_normalize
you should get true.
If you are on an older version of Ruby, there are a couple of options. Rails has a normalize method as part of its multibyte support, so if you are using Rails you can do:
a.mb_chars.normalize == b.mb_chars.normalize
or perhaps something like:
ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)
If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:
UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)
(nfkc refers to the normalisation form, it is the same as the default in the other techniques.)
There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.
You can see these are distinct characters. First and second. In the first case, it is using a modifier "combining tilde".
Wikipedia has a section on this:
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
and
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
It seems that Ruby supports this normalization, but only as of Ruby 2.2:
http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html
a = "N\u01b0\u0303".unicode_normalize
b = "N\u1eef".unicode_normalize
a == b # true
Alternatively, if you are using Ruby on Rails, there appears to be a built-in method for normalization.

string to integer conversion omitting first character of charset

This is more of a general problem than a ruby specific one, I just happen to be doing it in ruby. I am trying to convert a string into an Integer/Long/Bigint, or whatever you want to call it, using a charset for example Base62 (0-9a-zA-Z).
Problem is when I try to convert a string like "0ab" into an integer and that integer back to a string I get "ab". This occurs with any string starting the first character of the alphabet.
Here is an example implementation, that has the same issue.
https://github.com/jtzemp/base62/blob/master/lib/base62.rb
In action:
2.1.3 :001 > require 'base62'
=> true
2.1.3 :002 > Base62.decode "0ab"
=> 2269
2.1.3 :003 > Base62.encode 2269
=> "ab"
I might be missing the obvious.
How can I convert bidirectionally without that exception?
You're correct that this is a more general problem.
One solution is to use "padding", which fills in extra information such as indicating missing bits, or a conversion that isn't quite perfectly clean.
In your particular code for example, you are currently losing the leading character if it's the first primitive. This is because the leading character has a zero index, and you're adding the zero to your int, which doesn't change anything.
In your code, the padding could be accomplished a variety of ways.
For example, prepending a given leading character that is not the first primitive.
Essentially, you need to choose a way to protect the zero value, so it is not lost by the int.
An alternate solution is to change your storage from using an int to using any other object that doesn't lose leading zeros, such as a string. This is how a typical Base64 encoding class does it: the input is a string, and the storage is also a string.

Split utf8 string regardless of ruby version

str = "é-du-Marché"
I get the first char via
str.split(//).first
How I can get the rest of the string regardless of my ruby version ?
String does not have a method first. So you need in addition a split. When you do the split in unicode-mode (exactly utf-8) you have acces to the first (and other characters).
My solution:
puts RUBY_VERSION
str = "é-du-Marché"
p str.split(//u, 2)
Test with ruby 1.9.2:
1.9.2
["\u00E9", "-du-March\u00E9"]
Test with ruby 1.8.6:
1.8.6
["\303\251", "-du-March\303\251"]
With first and last you get your results:
str.split(//u, 2).first is the first character
str.split(//u, 2).last is the string after the first character.
str[1..-1] should return you everything after the first digit normally.
The first number is the starting index, which is set to 1 to skip the first digit, the second is the length, which is set to -1, so ruby counts from the back
Note: that multibyte characters only work in Ruby 1.9. If you wish to mimic this behavior downwards, you'll have to loop over the bytes yourself and figure out what needs to be removed from the data, cause Ruby 1.8 does not support this.
UPDATE:
You could try this as well, but I can't guarantee that it will work for every multibyte char:
str = "é-du-Marché"
substring = str.mb_chars[1..-1]
the mb_chars is a proxy class that directs the call to the appropiate implementation when dealing with UTF-8, UTF-32 or UTF-16 encoding of characters (e.g. multibyte chars).
More detailed info can be found here : http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
But I do not know if this exists in older rails versions
UPDATE2:
Ruby 1.8 treats any string just as a bunch of bytes, calling size() on it will return the amount of bytes that is used to store the data. To determine the characters regardless of the encoding try this:
char_array = str.scan(/./m)
substring = char_array[1..-1].join
This should do the trick normally. Try looking at http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 who explains how to treat multibyte data in older ruby versions.
EDIT3:
Playing around with the scan & join operations brings me closer to your problem & solution. I honestly don't have the time at to get the full solution working but if you play with the scan(/./mu) options you convert it to utf-8, which is supported by all ruby versions.

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

Getting an ASCII character code in Ruby using `?` (question mark) fails

I'm in a situation where I need the ASCII value of a character (for Project Euler question #22, if you want to get specific) and I'm running into an issue.
Being new to ruby, I googled it, and found that ? was the way to go: ?A or whatever. But when I incorporate it into my code, the result of that statement is the string "A"—no character code. Same issue with [0] and slice(0), both of which should theoretically return the ASCII code.
The only thing I can think of is that this is a ruby version issue. I'm using 1.9.1-p0, having upgraded from 1.8.6 this afternoon. I cheated a little going from a working version of Ruby, in the same directory, I figured I probably already had the files that don't come bundled with the .zip file, so I didn't download them.
So why exactly are all my ASCII codes being turned into actual characters?
Ruby before 1.9 treated characters somewhat inconsistently. ?a and "a"[0] would return an integer representing the character's ASCII value (which was usually not the behavior people were looking for), but in practical use characters would normally be represented by a one-character string. In Ruby 1.9, characters are never mysteriously turned into integers. If you want to get a character's ASCII value, you can use the ord method, like ?a.ord (which returns 97).
How about
"a"[0].ord
for 1.8/1.9 portability.
Ruby Programming/ASCII
In previous ruby version before 1.9, you can use question-mark syntax.
?a
After 1.9, we use ord instead.
'a'.ord
For 1.8 and 1.9
?a.class == String ? ?a.ord : ?a
or
"a".class == String ? "a".ord : "a"[0]
Found the solution. "string".ord returns the ascii code of s.
Looks like the methods I had found are broken in the 1.9 series of ruby.
If you read question 22 from project Euler again you'll find you you are not looking for the ASCII values of the characters. What the question is looking for, for the character "A" for example is 1, its position in the alphabet where as "A" has an ASCII value of 65.

Resources