Ruby internal and external encoding - ruby

I have gone through various material and unable to find
the difference between default internal encoding and external encoding in ruby. Can anyone help me in this regard.

When reading strings from external sources (such as files, network sockets, ...) Ruby may assume that this data is encoded in a specific string encoding. This is the external encoding. For example, if you are reading text files and known that they are encoded in UTF-8, you may set the external encoding to UTF-8 to hint to Ruby that the data is supposed to be UTF-8 encododed.
Now, when reading the data, Ruby can also convert the data to a different encoding which might be more useful for use with your program. For example, if you are assembling data from different sources such as files you read and an HTTP request, it's often useful if you can make sure that your string all have the same encoding regardless of their source.
For this, you can set the internal encoding. If you set the correct external encoding for your data source and e.g. your internal encoding to UTF-8, you can be fairly sure that all your strings (regardless of where they come from) are correctly UTF-8 encoded Strings and can be manipulated, merged and changed at will without worrying about encoding issues deep in your business logic.

Related

UTF-8 encoding not work with gets method in Ruby

I need to get "öçğü" string with gets method but I can't. I can read from file correctly. But gets doesn;t accept these characters. I use # encoding: UTF-8 and I am running this code on the Windows cmd shell.
When I try to type ç, I get the following error:
`downcase': input string invalid (ArgumentError)
input = gets.chomp.downcase.split
Setting the file encoding using the "magic" comment on top of the file only specifies the encoding of your source code in the file (that is: the encoding of string literals created directly from the parser in your code).
Ruby knows two other default encodings:
the external encoding - this specifies the default encoding of data read from external sources (such as the console, opened files, network sockets, ...)
the internal encoding - data read from external sources will be transformed into the default internal encoding after reading to ensure you can use compatible encodings everywhere (this is not used by default, the external encoding is thus preserved).
In your case, you have not set the external encoding. On Windows and with Ruby before version 3.0, Ruby assumes the local console encoding of your Windows installation here (such as cp850 in Western Europe).
When Ruby reads your String, it assumes it to be in cp850 encoding (or whatever your default encoding is) while you likely provide utf-8 encoded data. As spoon as you start to operate on this incorrectly encoded data, you will get errors similar to the one you have seen there.
Thus, to be able to correctly read data you need to either provide it with an encoding matching your shell encoding, or you need to tell Ruby which encoding it should assume there.
If you are providing UTF-8 encoded data, you can set the expected encoding using the -E switch when invoking ruby, e.g.:
ruby -E utf-8 your_program.rb
You can also set this in an environment variable of your Windows shell using
set RUBYOPT=-Eutf-8
In Ruby 3.0, the default external encoding on Windows was changed so that it now defaults to UTF-8 on Windows, similar to other platforms. See https://bugs.ruby-lang.org/issues/16604 for details.

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

JSON encoding issue with Ruby 1.9 and HTTParty

I've created a WebAPI that returns JSON.
The initial data is as follow (UTF-8 encoded):
#text="Rosenborg har ikke h\xC3\xB8rt hva Steffen"
Then with a .to_json on my object, here is what is sent by the API (I think it is ISO-8859-1 encoding) :
"text":"Rosenborg har ikke h\ufffd\ufffdrt hva Steffen"
I'm using HTTParty on the client side, and that's what I finally get :
"text":"Rosenborg har ikke h��rt hva"
Both WebAPI and client app are using Ruby 1.9.2 and Rails 3.
I'm a bit lost with this encoding issue... I tried to add the utf8 encoding header to my ruby files but it didn't changed anything.
I guess that I'm missing an encoding / decoding part somewhere... anyone has an idea?
Thank you very much !!!
Vincent
In Ruby 1.9, encoding is explicit now. However, Rails may or may not be configured to send the responses in the encoding you expect. You'll have to set the global configuration setting:
Encoding.default_external = "utf-8".
I believe the encoding that Ruby specifies by default for serialization is the platform default. In America on Windows that would be CodePage-1251. Other countries would have an alternate encoding.
Edit: Also see this url if the json is executed against MySQL: https://rails.lighthouseapp.com/projects/8994/tickets/5210-encoding-problem-in-json-format-response
Edit 2: Rails core and its suite of libraries (ActiveRecord, et. al.) will respect the Encoding.default_external configuration setting which encodes all the values it sends. Unfortunately, because encoding is a relatively new concept to Ruby not every 3rd party library has been adjusted for proper encoding. The ones that have may require additional configuration settings for those libraries. This includes MySQL, and the RSolr library you were using.
In all versions of Ruby before the 1.9 series, a string was just an array of bytes. When you've been thinking like that for so long, it's hard to wrap your head around the concept of multiple string encodings. The thing that is even more confusing now is that unlike Java, C#, and other languages that use some form of UTF as the native string format, Ruby allows each string to be encoded differently. In retrospect, that might be a mistake, but at least now they are respecting encoding.
The Encoding.force_encoding method is designed to treat the byte sequence with that new encoding, but does not change any of the underlying data. So it is possible to have invalid byte sequences. There is another method called .encode() that will transform the bytes from one encoding to another and guarantees valid byte sequences. For more information read this:
http://blog.grayproductions.net/articles/ruby_19s_string
Ok, I finally found out what the problem is...
I'm using RSolr to get my data from Solr, and by default encoding for all results is unfortunately 'US-ASCII' as mentioned here (and checked by myself) :
http://groups.google.com/group/rsolr/browse_thread/thread/2d4890fa7737e7ef#
So you need to force encoding as follow :
my_string.force_encoding(Encoding::UTF_8)
There is maybe a nice encoding option to provide to RSolr!

Create own encoding

How can I create my own encoding in Ruby (1.9)? The encoding would be for converting string while reading/writing from/for a file, i.e. generally for manipulating data in nonstandard encoded strings ( http://en.wikipedia.org/wiki/Mazovia_encoding )
To your updated question: At the moment all you can do is write some custom code which handles file reading/writing at byte level and does the needed conversions.
If you refer to how you can use different character encodings in ruby with version 1.9 I point you to
Working with Encodings in Ruby 1.9 and
Understanding M17n
I couldn't find any references in the ruby-docs about using proprietary encodings, and the Encoding class doesn't have any initializers (but Encoding.find() can load some of the encodings IConv supports dynamically) Unfortunately afaik Mazovia is unsupported even in iconv, so you're stuck with implementing your own class...

Resources