Create own encoding - ruby

How can I create my own encoding in Ruby (1.9)? The encoding would be for converting string while reading/writing from/for a file, i.e. generally for manipulating data in nonstandard encoded strings ( http://en.wikipedia.org/wiki/Mazovia_encoding )

To your updated question: At the moment all you can do is write some custom code which handles file reading/writing at byte level and does the needed conversions.
If you refer to how you can use different character encodings in ruby with version 1.9 I point you to
Working with Encodings in Ruby 1.9 and
Understanding M17n

I couldn't find any references in the ruby-docs about using proprietary encodings, and the Encoding class doesn't have any initializers (but Encoding.find() can load some of the encodings IConv supports dynamically) Unfortunately afaik Mazovia is unsupported even in iconv, so you're stuck with implementing your own class...

Related

Ruby internal and external encoding

I have gone through various material and unable to find
the difference between default internal encoding and external encoding in ruby. Can anyone help me in this regard.
When reading strings from external sources (such as files, network sockets, ...) Ruby may assume that this data is encoded in a specific string encoding. This is the external encoding. For example, if you are reading text files and known that they are encoded in UTF-8, you may set the external encoding to UTF-8 to hint to Ruby that the data is supposed to be UTF-8 encododed.
Now, when reading the data, Ruby can also convert the data to a different encoding which might be more useful for use with your program. For example, if you are assembling data from different sources such as files you read and an HTTP request, it's often useful if you can make sure that your string all have the same encoding regardless of their source.
For this, you can set the internal encoding. If you set the correct external encoding for your data source and e.g. your internal encoding to UTF-8, you can be fairly sure that all your strings (regardless of where they come from) are correctly UTF-8 encoded Strings and can be manipulated, merged and changed at will without worrying about encoding issues deep in your business logic.

UTF-8 encoding not work with gets method in Ruby

I need to get "öçğü" string with gets method but I can't. I can read from file correctly. But gets doesn;t accept these characters. I use # encoding: UTF-8 and I am running this code on the Windows cmd shell.
When I try to type ç, I get the following error:
`downcase': input string invalid (ArgumentError)
input = gets.chomp.downcase.split
Setting the file encoding using the "magic" comment on top of the file only specifies the encoding of your source code in the file (that is: the encoding of string literals created directly from the parser in your code).
Ruby knows two other default encodings:
the external encoding - this specifies the default encoding of data read from external sources (such as the console, opened files, network sockets, ...)
the internal encoding - data read from external sources will be transformed into the default internal encoding after reading to ensure you can use compatible encodings everywhere (this is not used by default, the external encoding is thus preserved).
In your case, you have not set the external encoding. On Windows and with Ruby before version 3.0, Ruby assumes the local console encoding of your Windows installation here (such as cp850 in Western Europe).
When Ruby reads your String, it assumes it to be in cp850 encoding (or whatever your default encoding is) while you likely provide utf-8 encoded data. As spoon as you start to operate on this incorrectly encoded data, you will get errors similar to the one you have seen there.
Thus, to be able to correctly read data you need to either provide it with an encoding matching your shell encoding, or you need to tell Ruby which encoding it should assume there.
If you are providing UTF-8 encoded data, you can set the expected encoding using the -E switch when invoking ruby, e.g.:
ruby -E utf-8 your_program.rb
You can also set this in an environment variable of your Windows shell using
set RUBYOPT=-Eutf-8
In Ruby 3.0, the default external encoding on Windows was changed so that it now defaults to UTF-8 on Windows, similar to other platforms. See https://bugs.ruby-lang.org/issues/16604 for details.

ICU requires intermediate UTF16 conversion step

Why is libicu using utf16 as it's "common denominator" format instead of utf8? I need to convert from utf8 to utf32 and back and libicu seems to make it unnecessarily difficult by requiring this 2 step utf8->utf16->utf32 conversion, although it's own functions like u_tolower also require a UChar32 input.
It doesn't seem memory is the determining factor here, otherwise they could just use utf8 for their "base" format as well.
UTF-16 is the default encoding form of the Unicode Standard, so I suspect that answers the "why" there. See this ICU page for some additional information.

Using Ruby's fastercsv with character encodings

Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)
It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?
I started to look at UniversalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)
According to FasterCSV's docs, the initialize method takes an :encoding option:
The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values: n??? orN??? for none, e??? orE??? for EUC, s??? orS??? for SJIS, and u??? orU??? for UTF-8 (see Regexp.new()).
Because its list is limited, you might want to look into using iconv to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.
Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.
Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.

JSON encoding issue with Ruby 1.9 and HTTParty

I've created a WebAPI that returns JSON.
The initial data is as follow (UTF-8 encoded):
#text="Rosenborg har ikke h\xC3\xB8rt hva Steffen"
Then with a .to_json on my object, here is what is sent by the API (I think it is ISO-8859-1 encoding) :
"text":"Rosenborg har ikke h\ufffd\ufffdrt hva Steffen"
I'm using HTTParty on the client side, and that's what I finally get :
"text":"Rosenborg har ikke h��rt hva"
Both WebAPI and client app are using Ruby 1.9.2 and Rails 3.
I'm a bit lost with this encoding issue... I tried to add the utf8 encoding header to my ruby files but it didn't changed anything.
I guess that I'm missing an encoding / decoding part somewhere... anyone has an idea?
Thank you very much !!!
Vincent
In Ruby 1.9, encoding is explicit now. However, Rails may or may not be configured to send the responses in the encoding you expect. You'll have to set the global configuration setting:
Encoding.default_external = "utf-8".
I believe the encoding that Ruby specifies by default for serialization is the platform default. In America on Windows that would be CodePage-1251. Other countries would have an alternate encoding.
Edit: Also see this url if the json is executed against MySQL: https://rails.lighthouseapp.com/projects/8994/tickets/5210-encoding-problem-in-json-format-response
Edit 2: Rails core and its suite of libraries (ActiveRecord, et. al.) will respect the Encoding.default_external configuration setting which encodes all the values it sends. Unfortunately, because encoding is a relatively new concept to Ruby not every 3rd party library has been adjusted for proper encoding. The ones that have may require additional configuration settings for those libraries. This includes MySQL, and the RSolr library you were using.
In all versions of Ruby before the 1.9 series, a string was just an array of bytes. When you've been thinking like that for so long, it's hard to wrap your head around the concept of multiple string encodings. The thing that is even more confusing now is that unlike Java, C#, and other languages that use some form of UTF as the native string format, Ruby allows each string to be encoded differently. In retrospect, that might be a mistake, but at least now they are respecting encoding.
The Encoding.force_encoding method is designed to treat the byte sequence with that new encoding, but does not change any of the underlying data. So it is possible to have invalid byte sequences. There is another method called .encode() that will transform the bytes from one encoding to another and guarantees valid byte sequences. For more information read this:
http://blog.grayproductions.net/articles/ruby_19s_string
Ok, I finally found out what the problem is...
I'm using RSolr to get my data from Solr, and by default encoding for all results is unfortunately 'US-ASCII' as mentioned here (and checked by myself) :
http://groups.google.com/group/rsolr/browse_thread/thread/2d4890fa7737e7ef#
So you need to force encoding as follow :
my_string.force_encoding(Encoding::UTF_8)
There is maybe a nice encoding option to provide to RSolr!

Resources