Why doesn't Encoding.default_external respect LANG? - ruby

It's my understanding that Ruby's Encoding.default_external is given a default value based on the environment variables LC_ALL and LANG, giving precedence to the former. I've run into several bugs where the default external encoding somehow ends up set to ASCII even though the environment variables are set to UTF-8.
For example:
$ irb
irb(main):001:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):002:0> ENV['LC_ALL']
=> nil
irb(main):003:0> ENV['LANG']
=> "en_US.UTF-8"
In the environments where this has happened, I've also grepped through all the gems being loaded for any code manually setting the default external encoding, but haven't found anything. How is what I'm seeing above possible? I'm using Ruby 2.2 above, but I've seen this happen on all Ruby 2.x versions.

I figured it out. Not only does the LANG environment variable need to be set, but the locale it species must have been generated for the OS. On a stock Linux image, the default locale may be something that is not UTF-8. In my particular case, I'm using Debian 7.7 and the default locale is "POSIX". I was able to set the default locale by installing the locales package and following the interactive prompts to generate the en_US.UTF-8 locale:
$ apt-get -y install locales
If the locales package is already installed, you can just reconfigure it instead:
$ dpkg-reconfigure locales
Now setting LANG will change the current system locale, and Ruby's Encoding.default_external will be set properly:
$ export LANG=en_US.UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ irb
irb(main):001:0> Encoding.default_external
=> #<Encoding:UTF-8>
For an example of how to automate the generation and configuration of the default locale instead of doing it interactively, take a look at this Docker image.

Related

why do I get a locale error even though it is set?

When I run bitbake, I'm getting the following:
$ bitbake core-image-base
Please use a locale setting which supports utf-8.
Python can't change the filesystem locale after loading so we need a utf-8 when python starts or things won't work.
even though my locale is set to en_US.UTF-8, why is this?
$ echo $LC_ALL
en_US.UTF-8
For additional background information, please also see https://unix.stackexchange.com/questions/626916/how-to-set-locale-correctly-manually/626919
UPDATE:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
My ~/.bashrc looks like:
$ cat ~/.bashrc
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
and when opening a new shell I get:
$ bash
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
If the shell existed before you added the locale, then you need to open a new one (either running bash as a child, opening a new terminal, doing a new ssh,...)
Then this should work.
$ export LC_ALL="en_US.UTF-8"
$ bitbake core-image-base
The export might not be needed, that depends on the default for your system.
Making the "comment crowned by success" an answer:
sudo locale-gen en_US en_US.UTF-8
sudo dpkg-reconfigure locales

Displaying Telugu on Terminal or iterm2 applications of Mac OS

Terminal and iterm2 applications on my Mac don't display Telugu characters properly. The characters get all jumbled up. I see the same issue with other languages like Kannada and Sanskrit. Some characters seem fine but some others are getting jumbled (as if one character is being super-imposed on another).
I set my text-encoding of Terminal to utf-8, did export LC_CTYPE=en_US.UTF-8 as suggested by other answers but nothing seems to work. Here is my locale:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Enabling "double width" setting did not solve the problem either. I also checked "set locale environment variable on startup". That did not work either.
Note that the characters are being displayed properly in other applications like browsers, word processors, etc. So the problem is local to terminal apps like Terminal and iterm2.
This is how the word "Telugu" is being displayed

Lynx UTF-8 support

I am using Lynx on OS X 10.11. However, it does not print UTF-8 for non-ASCII characters, but rather either an ASCII representation of them, or the ef bf bd "replacement" character (?).
I have been studying this guide for help.
The output from the locale command:
locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
When I run Lynx with
lynx http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
here is what the display appears like:
According to the posts in the article, Lynx should print UTF-8 properly.
lynx -dump ... prints the same.
(running export LC_ALL="en_US.UTF-8" doesn't help either.)
What is strange, is that if I run with the -mime_header argument, eg:
lynx -mime_header http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
It prints the characters properly. (Albeit, as a dump rather than opening in a browser environment):
EDIT:
Forgot to mention,
-assume_charset=utf8 and -assume_unrec_charset=utf8
don't help either.
EDIT:
Well I am able to get the output I want by hard-setting CHARACTER_SET in lynx.cfg. Though this seems like a bit of a workaround, as in the documentation it states:
# ... The 'o'ptions menu setting will be stored in the user's RC
# file whenever those settings are saved, and thereafter will be used as the
# default. ...
However, the setting only persists for the session it is set in. That won't work for me as I am primarily using lynx -dump in a script. But as I pretty much am only UTF-8, I guess I can live with the hard setting for now.
I do think you should use
lynx -dump --display_charset=utf-8
rather than hard-setting the config file
so
lynx --display_charset=utf-8 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
alternatively
check
https://www.brow.sh/

File missing upon launching EC2 using Amazon Linux AMI

I've just launched an t2.micro free tier EC2 instance and SSH into it from my local machine. And I was welcomed by an error message as below:
-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8): No such file or directory
This instance was launch using the console and every option has been by default and nothing special being configured. What could be the reason for it?
Below is the Locale of this instance:
[root#ip-xxx .aws]# locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
in my case it was required to add following lines to /etc/environment
LANG=en_US.utf-8
LC_ALL=en_US.utf-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Also for later use add the same two commands to ~/.bash_profile
Based on #ardit comment

Changing Locale in Docker Stops Many Commands From Executing?

How come when I change my locale in a Dockerfile using this...:
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
...so that I can achieve a change in locale from this...
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
..to this..
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Then I get issues, such as this with no commands executing:
root#820760edeb77:/# irb
bash: irb: command not found
But if I take those changes to the locale, and leave them as they are after a rebuilt container, then everything works as expected??:
# ENV LANG en_US.UTF-8
# ENV LANGUAGE en_US:en
# ENV LC_ALL en_US.UTF-8
root#820760edeb77:/# irb
2.3.1 :001 >
I'm not too sure what would be causing this issue, whereby the changed locale inhibits commands from working, but suspect this could just be a side effect of such a change in Locale within the Docker container, and possibly not the real issue?
So I just figured it out, turns out I was using the wrong locale type for Docker, which is related to this issue here.
A small, but critical distinction C vs en_US:
WRONG
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
RIGHT
ENV LANG C.UTF-8
ENV LANGUAGE C.UTF-8
ENV LC_ALL C.UTF-8
Can anyone tell me why Docker uses C as a locale, vs en_US, or any other?
Now Ruby/irb is successfully working (albeit, with 4 hours of my life lost.....)

Resources