What is the difference between languages that end in UTF-8 and those that don't?
In particular between it_IT and it_IT.UTF-8, and then the one that interests me most which is between C and C.UTF-8. What should I put between C and C.UTF-8 in the variable "LC_ALL" for example?
Here is the list that appears when I run the locale -a command, which is to make you better understand what my concerns are.
C
C.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
it_CH.utf8
it_IT.utf8
POSIX
The main difference between languages that end in UTF-8 and those that don't is that the former supports Unicode, which is a character encoding that can represent a wide range of characters from different scripts. This allows for a more internationalized environment, as it allows for text to be displayed in a variety of languages.
LC_ALL should be set to "it_IT.UTF-8" to enable Unicode support for the Italian language.
I'd recommend to use UTF-8 locale which is more versatile.
For example, in Git Bash :
LC_ALL=C grep -P hello /dev/null
# output :
# grep: -P supports only unibyte and UTF-8 locales
LC_ALL=C.UTF-8 grep -P hello /dev/null
# No output
Related
We have some Groovy scripts that we run from Git Bash (MINGW64) in Windows.
Some scripts prints the bullet character • (or similar).
To make it work we set this variable:
export LC_ALL=en_US.UTF-8
But, for some people, this is not enough. Its console prints ΓÇó instead of •.
Any idea about how to make it prints properly and why is printing that even after setting the LC_ALL variable?
Update
The key part is that the output from Groovy scripts is printing incorrectly, but there are no problems with the plain bash scripts.
An example with querying the current characters mapping locale charmap used by the system locale, and filtering the output with recode to render it with compatible characters mapping:
#!/usr/bin/env sh
cat <<EOF | recode -qf "UTF-8...$(locale charmap)"
• These are
• UTF-8 bullets in source
• But it can gracefully degrade with recode
EOF
With a charmap=ISO-8859-1 it renders as:
o These are
o UTF-8 bullets in source
o But it can gracefully degrade with recode
Alternate method using iconv instead of recode and results may even be better.
#!/usr/bin/env sh
cat <<EOF | iconv -f 'UTF-8' -t "$(locale charmap)//TRANSLIT"
• These are
• UTF-8 bullets followed by a non-breaking space in source
• But it can gracefully degrade with iconv
• Europe's currency sign is € for Euro.
EOF
iconv output with an fr_FR.iso-8859-15#Euro locale:
o These are
o UTF-8 bullets followed by a non-breaking space in source
o But it can gracefully degrade with iconv
o Europe's currency sign is € for Euro.
When I have the files a.txt, b.txt and c.txt is it guaranteed that
cat *.txt > all_files.txt
or
cat ?.txt > all_files.txt
will combine the files in alphabetical order?
(In all my tests, the alphabetical order was preserved, but I'm not sure because for example with ls the order is undefined and need not be alphabetic - but it often is, because the files have often been written to the directory alphabetically)
No, it depends on the locale. The order is dictated by the collation sequence in the locale, which can be changed using the LC_COLLATE or LC_ALL environment variables. Note that bash behaves differently in this respect to some other shells (e.g. Korn shell).
If you have a locale setting of C or POSIX then it will be in character set order. Otherwise you will probably only notice a difference with mixed case letters, e.g. the sequence for en_ locales is aAbBcC ... xXyYzZ. For example see http://collation-charts.org/fc6/fc6.en_GB.iso885915.html.
Available locales may be listed using locale -a.
Edit: another variable LANG is available, but it is generally not used much nowadays. According to the Single UNIX Specification it is used: in the absence of the LC_ALL and other LC_* ... environment variables.
I'm unable to correctly pass UTF-8 string values as arguments to command line apps.
Approaches I've tried:
pass the value between double quotes: "café"
pass with single quotes: 'café'
use the char code: 'caf\233'
use a $ sign before the string: $'café'
I'm using Mac OS 10.10, iTerm and my current locale outputs:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
It is doubtful this has anything to do with the shell. I would make sure that your tools (both the writer tools and whatever you're reading with) correctly deal with UTF-8 at all. I would most suspect that whatever you're reading your tags with is interpreting and printing this as Latin-1. You should look inside the file with a hex editor and look for the tag. I'm betting it will be correct (C3 82, which is é in UTF-8, and é in Latin-1). Your output tool is probably the problem, not the writer (and definitely not the shell).
If your reading tool demands Latin-1, then you need to encode é as E9. The iconv tool can be useful in making those conversions for scripts.
When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'
I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)