Ruby - internationalized domain names

Ruby - internationalized domain names - ruby

I need to support internationalized domain names in an app I am writing. More specifically, I need to ACE encode domain names before I pass them on to an external API.
The best way to do this seems to be by using libidn. However, I have problems installing it on my development machine (Windows 7, ruby 1.8.6), as it complains about not finding the GNU IDN library (which I have installed, and also provided the full path to).
So basically I am considering two things:
Search the web for a prebuilt win32 libidn gem (fruitless so far)
Find another (hopefully pure) ruby library that can do the same thing (not found apperantly as I am asking this question here)
So have anyone of you got libidn to work under Windows? Or have you used some other library/code snippet that is able to encode domain names?

Thanks to this snippet, I finally found a solution that did not require libidn. It is built upon punicode4r together with either the unicode gem (a prebuilt binary can be found here), or with ActiveSupport. I will use ActiveSupport since I use Rails anyway, but for reference I include both methods.
With the unicode gem:
require 'unicode'
require 'punycode' #This is not a gem, but a standalone file.
def idn_encode(domain)
parts = domain.split(".").map do |label|
encoded = Punycode.encode(Unicode::normalize_KC(Unicode::downcase(label)))
if encoded =~ /-$/ #Pure ASCII
encoded.chop!
else #Contains non-ASCII characters
"xn--" + encoded
end
end
parts.join(".")
end
With ActiveSupport:
require "punycode"
require "active_support"
$KCODE = "UTF-8" #Have to set this to enable mb_chars
def idn_encode(domain)
parts = domain.split(".").map do |label|
encoded = Punycode.encode(label.mb_chars.downcase.normalize(:kc))
if encoded =~ /-$/ #Pure ASCII
encoded.chop! #Remove trailing '-'
else #Contains non-ASCII characters
"xn--" + encoded
end
end
parts.join(".")
end
The ActiveSupport solution was found thanks to this StackOverflow question.

Related

Simple alternative to URI.escape

While using URI.parse on a URL, I was confronted with an error message:
URI::InvalidURIError: URI must be ascii only
I found a StackOverflow question that recommended using URI.escape, which works. Using the URL in that question as an example:
URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> URI::InvalidURIError: URI must be ascii only "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/\u041E\u0443\u044D\u043D-\u041C\u044D\u0442\u044C\u044E\u0441.jpg"
URI.encode('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
However, URI.escape is obsolete, as Rubocop warns:
URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
But while URI.escape gives us a usable result, the alternatives don’t:
CGI.escape('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
=> "http%3A%2F%2Fdxczjjuegupb.cloudfront.net%2Fwp-content%2Fuploads%2F2017%2F10%2F%D0%9E%D1%83%D1%8D%D0%BD-%D0%9C%D1%8D%D1%82%D1%8C%D1%8E%D1%81.jpg"
This is a bother because in my case I’m constructing a URL from data I get via Nokogiri:
my_url = page.at('.someclass').at('img').attr('src')
Since I only need to escape the last part of the resulting URL, but CGI.escape and similar transform the whole string (including necessary characters, such as : and /), getting the escaped result now becomes a multiple-lines-of-code ordeal, having to split the path and using several variables to achieve what could be previously done with a single function (URI.escape).
Is there a simple alternative I’m not seeing? It needs to be done without external gems.

I tend to use Addressable for parsing URLs since the standard URI has flaws:
require 'addressable'
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg')
#<Addressable::URI:0x3fc37ecc1c40 URI:http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg>
Addressable::URI.parse('http://dxczjjuegupb.cloudfront.net/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg').path
# "/wp-content/uploads/2017/10/Оуэн-Мэтьюс.jpg"
It isn't part of the Ruby core or the standard library but it should be and it always ends up in my Gemfiles.

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?

The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Special character uppercase

I have strings with a bunch of special characters. This works:
myString.upcase.tr('æ-ý','Æ-Ý')
However, it does not work really cross-platform. My Ruby implementation on Windows won't go with this (on my Mac and Linux machines, works like a charm). Any pointers / workarounds / solutions, really appreciated!

Try mb_chars method if you are using Rails >= 3. For example,
'æ-ý'.mb_chars.upcase
=> "Æ-Ý"
If you're not using Rails please try unicode gem.
Unicode::upcase('æ-ý')
Or you can override String class methods as well:
require "unicode";
class String
def downcase
Unicode::downcase(self)
end
def downcase!
self.replace downcase
end
def upcase
Unicode::upcase(self)
end
def upcase!
self.replace upcase
end
def capitalize
Unicode::capitalize(self)
end
def capitalize!
self.replace capitalize
end
end

Unfortunately, it is impossible to correctly upcase/downcase a string without knowing the language and it in some cases even the contents of the string.
For example, in English the uppercase variant of i is I and the lowercase variant of I is i, but in Turkish the uppercase variant of i is İ and the lowercase variant of I is ı. In German, the uppercase variant of ß is SS, but so is the uppercase variant of ss, so to downcase, you need to understand the text, because e.g. MASSE could be downcased to either masse (mass) or maße (measurements).
Ruby takes the easy way out and simply only uppercases/downcases within the ASCII alphabet.
However, that only explains why your workaround is needed, not why it sometimes works and sometimes doesn't. Provided that you use the same Ruby version and the same Ruby implementation and the same version of the implementation on all platforms, it should work. YARV doesn't use the underlying platform's string manipulation routines much (the same is true for most Ruby implementations, actually, even JRuby doesn't use Java's powerful string libraries but rolls its own for maximum compatibility), and it also doesn't use any third-party libraries (like e.g. ICU) except Onigmo, so it's unlikely that platform differences are to blame. Different versions of Ruby use different versions of the Unicode Character Database, though (e.g. I believe it was updated somewhere between 1.9 and 2.2 at least once), so if you have a version mismatch, that might explain it.
Or, it might be a genuine bug in YARV on Windows. Maybe try JRuby? It tends to be more consistent between platforms, in fact, on Windows, it is more compatible with Ruby than Ruby (i.e. YARV) itself!

How can I make a #gsub call work with encodings in both Ruby 1.8 and Ruby 1.9?

I have the following code which requires Ruby 1.9, and I need to achieve the same functionality in Ruby 1.8. How can I accomplish this?
# encoding: UTF-8
... [code omitted]
body.force_encoding("UTF-8")
body = body.gsub(/^(?=>)/, ">").gsub(/^(?!>)/, "> ")
body is a string obtained from an external source.
I think what I need is called a "shim" but I'm not sure.

James Gray wrote a series of articles about dealing with encodings in Ruby. They're very good reading.
For 1.8.7 the jcode library can help.
$KCODE = "U"
require 'jcode'

Detect encoding

I'm getting some string data from the web, and I suspect that it's not always what it says it is. I don't know where the problem is, and I just don't care any more. From day one on this project I've been fighting Ruby string encoding. I really want some way to say: "Here's a string. What is it?", and then use that data to get it to UTF-8 so that it doesn't explode gsub() 2,000 lines down in the depths of my app. I've checked out rchardet, but even though it supposedly works for 1.9 now, it just blows up given any input with multiple bytes... which is not helpful.

You can't really detect the encoding. You can only assume it.
For the most Western languages applications, the following construct
will work. The traditional encoding usually is "ISO-8859-1". The new and preferred encoding is UTF-8. Why not simply try to encode it with UTF-8 and fallback with the old encoding
def detect_encoding( str )
begin
str.encode("UTF-8")
"UTF-8"
rescue
"ISO-8859-1"
end
end

It is impossible to tell from a string what encoding it is in. You always need some additional metadata that tells you what the string's encoding is.
If you get the string from the web, that metadata is in the HTTP headers. If the HTTP headers are wrong, there is absolutely nothing that you or Ruby or anyone else can do. You need to file a bug with the webmaster of the site where you got the string from and wait till he fixes it. If you have a Service Level Agreement with the website, file a bug, wait a week, then sue them.

Old question, but chardet works on 1.9: http://rubygems.org/gems/chardet

why not try use https://github.com/brianmario/charlock_holmes to get the exact encoding. Then also use it to convert to UTF8
require 'charlock_holmes'
class EncodeParser
def initialize(text)
#text = text
end
def detected_encoding
CharlockHolmes::EncodingDetector.detect(#text)[:encoding]
end
def convert_to_utf8
CharlockHolmes::Converter.convert(#text, detected_encoding, "UTF-8")
end
end
then just use EncodeParser.new(text).detected_encoding or EncodeParser.new(text). convert_to_utf8

We had some fine experience with ensure_encoding. It actually does the job for us to convert resource files having unknown encoding to UTF-8.
The README will give you some hints which options would be a good fit for your situation.
I have never tried chardet since ensure_encoding did the job just fine for us.
I covered here how we use ensure_encoding.

Try setting these in your environment.
export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
Try ruby -EBINARY or ruby -EASCII-8BIT to command line
Try adding -Ku or -Kn to your ruby command line.
Could you paste the error message ?
Also try this: http://github.com/candlerb/string19/blob/master/string19.rb

Might try reading this: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

I know it's an old question, but in modern versions of Ruby it's as simple as str.encoding. You get a return value something like this: #Encoding:UTF-8

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio