Encode string as \uXXXX - ruby

I am trying to port a code from python to ruby, and having difficulties in one of the functions that encodes a UTF-8 string to JSON.
I have stripped down the code to what I believe is my problem.
I would like to make ruby output the exact same output as python.
The python code:
#!/usr/bin/env python
# encoding: utf-8
import json
import hashlib
text = "ÀÈG"
js = json.dumps( { 'data': text } )
print 'Python:'
print js
print hashlib.sha256(js).hexdigest()
The ruby code:
#!/usr/bin/env ruby
require 'json'
require 'digest'
text = "ÀÈG"
obj = {'data': text}
# js = obj.to_json # not using this, in order to get the space below
js = %Q[{"data": "#{text}"}]
puts 'Ruby:'
puts js
puts Digest::SHA256.hexdigest js
When I run both, this is the output:
$ ./test.rb && ./test.py
Ruby:
{"data": "ÀÈG"}
6cbe518180308038557d28ecbd53af66681afc59aacfbd23198397d22669170e
Python:
{"data": "\u00c0\u00c8G"}
a6366cbd6750dc25ceba65dce8fe01f283b52ad189f2b54ba1bfb39c7a0b96d3
What do I need to change in the ruby code to make its output identical to the python output (at least the final hash)?
Notes:
I have tried things from this SO question (and others) without success.
The code above produces identical results when using only english characters, so I know the hashing is the same.

Surely someone will come along with a more elegant (or at least a more efficient and robust) solution, but here's one for the time being:
#!/usr/bin/env ruby
require 'json'
require 'digest'
text = 'ÀÈG'
.encode('UTF-16') # convert UTF-8 characters to UTF-16
.inspect # escape UTF-16 characters and convert back to UTF-8
.sub(/^"\\u[Ff][Ee][Ff][Ff](.*?)"$/, '\1') # remove outer quotes and BOM
.gsub(/\\u\w{4}/, &:downcase!) # downcase alphas in escape sequences
js = { data: text } # wrap in containing data structure
.to_json(:space=>' ') # convert to JSON with spaces after colons
.gsub(/\\\\u(?=\w{4})/, '\\u') # remove extra backslashes
puts 'Ruby:', js, Digest::SHA256.hexdigest(js)
Output:
$ ./test.rb
Ruby:
{"data": "\u00c0\u00c8G"}
a6366cbd6750dc25ceba65dce8fe01f283b52ad189f2b54ba1bfb39c7a0b96d3

Related

Garbage Base64 decoded string [duplicate]

This question already has answers here:
p vs puts in Ruby
(8 answers)
Closed 3 years ago.
Could somebody explain me, why there are two various outputs?
CODE IN IRB(Interactive ruby shell):
irb(main):001:0> require 'base64'
=> true
irb(main):002:0> cookie = "YXNkZmctLTBEAiAvi95NGgcgk1W0pyUKXFEo6IuEvdxhmrfLqNVpskDv5AIgVn8wfIWf0y41cb%2Bx9I0ah%2F4BIIeRJ54nX2qGcxw567Y%3D"
=> "YXNkZmctLTBEAiAvi95NGgcgk1W0pyUKXFEo6IuEvdxhmrfLqNVpskDv5AIgVn8wfIWf0y41cb%2Bx9I0ah%2F4BIIeRJ54nX2qGcxw567Y%3D"
irb(main):003:0> decoded_cookie = Base64.urlsafe_decode64(URI.decode(cookie))
=> "asdfg--0D\x02 /\x8B\xDEM\x1A\a \x93U\xB4\xA7%\n\\Q(\xE8\x8B\x84\xBD\xDCa\x9A\xB7\xCB\xA8\xD5i\xB2#\xEF\xE4\x02 V\x7F0|\x85\x9F\xD3.5q\xBF\xB1\xF4\x8D\x1A\x87\xFE\x01 \x87\x91'\x9E'_j\x86s\x1C9\xEB\xB6"
Code from Linux terminal:
asd#asd:~# ruby script.rb
asdfg--0D /��M� �U��%
\Q(苄��a��˨�i�#�� V0|���.5q������ ��'�'_j�s9�
Script:
require 'base64'
require 'ecdsa'
cookie = "YXNkZmctLTBEAiAvi95NGgcgk1W0pyUKXFEo6IuEvdxhmrfLqNVpskDv5AIgVn8wfIWf0y41cb%2Bx9I0ah%2F4BIIeRJ54nX2qGcxw567Y%3D"
def decode_cookie(cookie)
decoded_cookie = Base64.urlsafe_decode64(URI.decode(cookie))
end
puts (decode_cookie(cookie))
How can i get the same output in terminal?
I need the output:
"asdfg--0D\x02 /\x8B\xDEM\x1A\a \x93U\xB4\xA7%\n\Q(\xE8\x8B\x84\xBD\xDCa\x9A\xB7\xCB\xA8\xD5i\xB2#\xEF\xE4\x02 V\x7F0|\x85\x9F\xD3.5q\xBF\xB1\xF4\x8D\x1A\x87\xFE\x01 \x87\x91'\x9E'_j\x86s\x1C9\xEB\xB6"
In Linux terminal.
A string like "\x8B" is a representation of character, not the literal \x8B. Ruby uses such representation if it's missing the font to display the character or if it messes with whitespacing (for example "\n" is a newline and not \ followed by a n).
The reason you get another output in irb is because you don't print the string using puts (like you do in your script). Simply calling decoded_cookie will return the string representation, not the actual content.
You can display the actual content by simply printing it to an output.
require 'base64'
cookie = "YXNkZmctLTBEAiAvi95NGgcgk1W0pyUKXFEo6IuEvdxhmrfLqNVpskDv5AIgVn8wfIWf0y41cb%2Bx9I0ah%2F4BIIeRJ54nX2qGcxw567Y%3D"
decoded_cookie = Base64.urlsafe_decode64(URI.decode(cookie))
puts decoded_cookie
# asdfg--0D /��M �U��%
# \Q(苄��a��˨�i�#�� V0|���.5q����� ��'�'_j�s9�
#=> nil
You can find more info about the "\xnn" representation here.
If you'd like the script to display the string representation use p instead of puts, or use puts decoded_cookie.inspect.

Ruby: parse yaml from ANSI to UTF-8

Problem:
I have the yaml file test.yml that can be encoded in UTF-8 or ANSI:
:excel:
"Test":
"eins_Ä": :eins
"zwei_ä": :zwei
When I load the file I need it to be encoded in UTF-8 therefore tried to convert all of the Strings:
require 'yaml'
file = YAML::load_file('C:/Users/S61256/Desktop/test.yml')
require 'iconv'
CONV = Iconv.new("UTF-8", "ASCII")
class Test
def convert(hash)
hash.each{ |key, value|
convert(value) if value.is_a? Hash
CONV.iconv(value) if value.is_a? String
CONV.iconv(key) if key.is_a? String
}
end
end
t = Test.new
converted = t.convert(file)
p file
p converted
But when I try to run this example script it prints:
in 'iconv': eins_- (Iconv:IllegalSequence)
Questions:
1. Why does the error show up and how can I solve it?
2. Is there another (more appropiate) way to get the file's content in UTF-8?
Note:
I need this code to be compatible to Ruby 1.8 as well as Ruby 2.2. For Ruby 2.2 I would replace all the Iconv stuff with String::encode, but that's another topic.
The easiest way to deal with wrong encoded files is to read it in its original encoding, convert to UTF-8 and then pass to receiver (YAML in this case):
▶ YAML.load File.read('/tmp/q.yml', encoding: 'ISO-8859-1').force_encoding 'UTF-8'
#⇒ {:excel=>{"Test"=>{"eins_Ä"=>:eins, "zwei_ä"=>:zwei}}}
For Ruby 1.8 you should probably use Iconv, but the whole process (read as is, than encode, than yaml-load) remains the same.

Using binary data (strings in utf-8) from external file

I have problem with using strings in UTF-8 format, e.g. "\u0161\u010D\u0159\u017E\u00FD".
When such string is defined as variable in my program it works fine. But when I use such string by reading it from some external file I get the wrong output (I don't get what I want/expect). Definitely I'm missing some necessary encoding stuff...
My code:
file = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io| io.read.split(/\t/) }
puts data
data_var = "\u306b\u3064\u3044\u3066"
puts data_var
Output:
\u306b\u3064\u3044\u3066 # what I don't want
について # what I want
I'm trying to read the file in binary form by specifying 'rb' but obviously there is some other problem...
I run my code in Netbeans 7.3.1 with build in JRuby 1.7.3 (I tried also Ruby 2.0.0 but without any effect.)
Since I'm new in ruby world any ideas are welcomed...
If your file contains the literal escaped string:
\u306b\u3064\u3044\u3066
Then you will need to unescape it after reading. Ruby does this for you with string literals, which is why the second case worked for you. Taken from the answer to "Is this the best way to unescape unicode escape sequences in Ruby?", you can use this:
file = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io|
contents = io.read.gsub(/\\u([\da-fA-F]{4})/) { |m|
[$1].pack("H*").unpack("n*").pack("U*")
}
contents.split(/\t/)
}
Alternatively, if you will like to make it more readable, extract the substitution into a new method, and add it to the String class:
class String
def unescape_unicode
self.gsub(/\\u([\da-fA-F]{4})/) { |m|
[$1].pack("H*").unpack("n*").pack("U*")
}
end
end
Then you can call:
file = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io|
io.read.unescape_unicode.split(/\t/)
}
Just as a FYI:
data = File.open(file, 'rb') { |io| io.read.split(/\t/) }
Can be written more simply as one of these:
data = File.read(file, 'rb').split(/\t/)
data = File.readlines(file, "\t", 'mode' => 'rb')
(Remember that File inherits from IO, which is where these methods are defined, so look in IO for documentation on them.)
readlines takes a "separator" parameter, which in the example above is "\t". Ruby will substitute it for the usual "\n" on *nix or Mac OS, or "\r\n" on Windows, so records will be retrieved using the tab-delimiter.
This makes me wonder a bit why you'd want to do that though? I've never seen tabs as record delimiters, only column/field delimiters in "TSV" (Tab-Seperated-Value) files. So that leads me to think you should probably be using Ruby's CSV class, with a "\t" as the column-separator. But, without samples of the actual file you're reading I can't say for sure.

ruby 1.9 character conversion errors while testing regex

I know there are a tons of docs and debates out there, but still:
This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.
What can I do?
# encoding: utf-8
require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'
url = 'http://www.website.com/url/test'
sio = open(url)
#cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, #cur_encoding)
txtdoc = doc.to_s
# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36  "
p doc.search('h1')[0].text.strip! # nil <- ERROR
# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"
p /#{regex}/i =~ txtdoc # integer expected
I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that's fine, BUT how can I fix this problem on my app while running live?
#cur_encoding = doc.encoding # ISO-8859-15
ISO-8859-15 is not the correct encoding for the quoted page; it should have been UTF-8. iconving it to UTF-8 as if it were 8859-15 only compounds the problem.
This encoding is coming from a faulty <meta> tag in the document. A browser will ignore that tag and use the overriding encoding from the Content-Type: text/html;charset=utf-8 HTTP response header.
However Nokogiri appears not to be able to read this header from the open()ed stream. With the caveat that I know nothing about Ruby, looking at the source the problem would seem to be that it uses the property encoding from the string-or-IO instead of charset which seems to be what open-uri writes.
You can pass in an override encoding of your own, so I guess try:
sio= open(url)
doc= Nokogiri::HTML.parse(doc, nil, sio.charset) # should be UTF-8?
The problems you're having are caused by non breaking space characters (Unicode U+00A0) in the page.
In your first problem, the string:
"Nove36 "
actually ends with U+00A0, and String#strip! doesn't consider this character to be whitespace to be removed:
1.9.3-p125 :001 > s = "Foo \u00a0"
=> "Foo  "
1.9.3-p125 :002 > s.strip
=> "Foo  " #unchanged
In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn't match as it is looking for a normal space:
# s as before
1.9.3-p125 :003 > s =~ /Foo / #2 spaces, no match
=> nil
1.9.3-p125 :004 > s =~ /Foo / #1 space, match
=> 0
1.9.3-p125 :005 > s =~ /Foo \u00a0/ #space and non breaking space, match
=> 0
When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.
The simplest fix would be to do a global substitution of \u00a0 for space before you start processing:
sio = open(url)
#cur_encoding = sio.charset
txt = sio.read #read the whole file
txt.gsub! "\u00a0", " " #global replace
doc = Nokogiri::HTML(txt, nil, #cur_encoding) #use this new string instead...

hash strings get improperly encoded

I have a simple constant hash with string keys defined:
MY_CONSTANT_HASH = {
'key1' => 'value1'
}
Now, I've noticed that encoding.name on the key is US-ASCII. However, Encoding.default_internal is set to UTF-8 beforehand. Why is it not being properly encoded? I can't force_encoding later, because the object is frozen at that point, so I get this error:
can't modify frozen String
P.S.: I'm using ruby 1.9.3p0 (2011-10-30 revision 33570).
The default internal and external encodings are aimed at IO operations:
CSV
File data read from disk
File names from Dir
etc...
The easiest thing for you to do is to add a # encoding=utf-8 comment to tell Ruby that the source file is UTF-8 encoded. For example, if you run this:
# encoding=utf-8
H = { 'this' => 'that' }
puts H.keys.first.encoding
as a stand-alone Ruby script you'll get UTF-8, but if you run this:
H = { 'this' => 'that' }
puts H.keys.first.encoding
you'll probably get US-ASCII.

Resources