What is the difference between languages that end in UTF-8 and those that don't?
In particular between it_IT and it_IT.UTF-8, and then the one that interests me most which is between C and C.UTF-8. What should I put between C and C.UTF-8 in the variable "LC_ALL" for example?
Here is the list that appears when I run the locale -a command, which is to make you better understand what my concerns are.
C
C.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
it_CH.utf8
it_IT.utf8
POSIX
The main difference between languages that end in UTF-8 and those that don't is that the former supports Unicode, which is a character encoding that can represent a wide range of characters from different scripts. This allows for a more internationalized environment, as it allows for text to be displayed in a variety of languages.
LC_ALL should be set to "it_IT.UTF-8" to enable Unicode support for the Italian language.
I'd recommend to use UTF-8 locale which is more versatile.
For example, in Git Bash :
LC_ALL=C grep -P hello /dev/null
# output :
# grep: -P supports only unibyte and UTF-8 locales
LC_ALL=C.UTF-8 grep -P hello /dev/null
# No output
I want to replace ISO-8859-1 characters from file below to be valid for UTF-8 encoding.
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</HEAD>
<BODY>
<A NAME="top"></A>
<TABLE border=0 width=609 cellspacing=0 cellpadding=0>
<TR><td rowspan=2><img src="http://www.example.com" width=10></td>
<TD width=609 valign=top>
<p>'</p>
<p>*</p>
<p>-</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
</TD>
</TR>
</TABLE>
</body>
</html>
Doing some research I found that the issue is related with locale language and I was able to build this awk program, but only replaces the first 2 characters (' and *)
LC_ALL=ISO_8859-1 awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8" , $0)
gsub(/\047/, "\\'" , $0)
gsub(/*/, "\\*" , $0)
gsub(/–/, "\\–" , $0)
gsub(/—/, "\\—" , $0)
gsub(/§/, "\\§" , $0)
gsub(/«/, "\\«" , $0)
gsub(/»/, "\\»" , $0)
gsub(/¿/, "\\¿" , $0)
gsub(/Á/, "\\Á" , $0)
print
}' t.html | iconv -f ISO_8859-1 -t UTF-8
This is the current output (showing below partial output, only lines affected by the program):
<p>'</p>
<p>*</p>
<p>-</p>
<p>-</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
and expected output is:
<p>*</p>
<p>–</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
I've already tried a similar code using sed but the same issue.
How to fix this?
Below locale config:
***Ubuntu 18.04.1 LTS
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
This issue is likely due to an encoding mismatch between the input file and the awk script.
Please first note that there is probably a (very common) confusion between ISO-8859-1 and Windows-1252 here. The html sample in the original post contains em/en dash characters which are not part of the ISO-8859-1 layout, so it certainly uses another encoding, probably Windows-1252 (which is a superset of ISO-8859-1 including the dash characters) since the OP reported to use Ubuntu through the Windows subsystem layer.
I'll then assume that the html input file is indeed encoded with Windows-1252. So non-ASCII characters (code points ≥ 128) use only one byte.
If the awk program is loaded from a file encoded in UTF-8, or even directly typed in a terminal window which uses the UTF-8 endoding, then the regular expressions and literal strings embedded in the program are also encoded in UTF-8. So non-ASCII characters use multiple bytes.
For example, the character § (code point 167 = 0xA7), is represented by the byte A7 in Windows-1252 and the sequence of bytes C2 A7 in UTF-8. If you use gsub(/§/, "S") in your UTF-8 encoded awk program, then awk looks for the sequence C2 A7 in the input file which only contains A7. It will not match. Unless you are (un)lucky enough to have a character  (code point 194 = 0xC2) hanging out just before your §.
Changing the locale does not help here because it only tells awk how to parse its input (data and program), whereas what you need here is to transcode either the data or the regular expressions. For this to work you would have to be able to specify the locale of the data independently of the locale of the program, which is not supported.
So, assuming that your system is set up with an UTF-8 locale and that your awk script uses this locale (no matter if loaded from a file or typed in a terminal), here are several methods you can use to align the input file and the regular expressions on the same encoding so that gsub works as expected.
Please note that these suggestions stick to your first awk command since it is the source of the issue. The final pipe to iconv is needed only if you intentionally does not transform all the special characters you may have in the input to html entities. Otherwise the output of awk is plain ASCII so already UTF-8 compliant.
Option 1 : convert the input file from Windows-1252 to UTF-8
No need for another iconv step after that in any case.
iconv -f WINDOWS-1252 t.html | awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/\047/, "\\'")
gsub(/\*/, "\\*")
gsub(/–/, "\\–")
gsub(/—/, "\\—")
gsub(/§/, "\\§")
gsub(/«/, "\\«")
gsub(/»/, "\\»")
gsub(/¿/, "\\¿")
gsub(/Á/, "\\Á")
print
}'
Option 2 : convert the awk program from UTF-8 to Windows-1252
Because the awk program may want to have fun too. Let's use process substitution.
awk -f <(iconv -t WINDOWS-1252 <<'EOS'
{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/'/, "\\'")
gsub(/\*/, "\\*")
gsub(/–/, "\\–")
gsub(/—/, "\\—")
gsub(/§/, "\\§")
gsub(/«/, "\\«")
gsub(/»/, "\\»")
gsub(/¿/, "\\¿")
gsub(/Á/, "\\Á")
print
}
EOS
) t.html
Option 3 : save the awk/schell script in a file encoded in Windows-1252
... with your favorite tool.
Option 4 : switch the encoding of your terminal session to Windows-1252
In case you type/paste the awk command in a terminal of course.
Note that this different from setting the locale (LC_CTYPE). I'm not aware of a way to do this programmatically. If somebody knows, feel free to contribute.
Option 5 : avoid non-ASCII characters altogether in the awk program
Sounds anyway a good practice in my opinion.
awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/\047/, "\\'")
gsub(/\*/, "\\*")
gsub(/\226/, "\\–")
gsub(/\227/, "\\—")
gsub(/\247/, "\\§")
gsub(/\253/, "\\«")
gsub(/\273/, "\\»")
gsub(/\277/, "\\¿")
gsub(/\301/, "\\Á")
print
}' t.html
I'm unable to correctly pass UTF-8 string values as arguments to command line apps.
Approaches I've tried:
pass the value between double quotes: "café"
pass with single quotes: 'café'
use the char code: 'caf\233'
use a $ sign before the string: $'café'
I'm using Mac OS 10.10, iTerm and my current locale outputs:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
It is doubtful this has anything to do with the shell. I would make sure that your tools (both the writer tools and whatever you're reading with) correctly deal with UTF-8 at all. I would most suspect that whatever you're reading your tags with is interpreting and printing this as Latin-1. You should look inside the file with a hex editor and look for the tag. I'm betting it will be correct (C3 82, which is é in UTF-8, and é in Latin-1). Your output tool is probably the problem, not the writer (and definitely not the shell).
If your reading tool demands Latin-1, then you need to encode é as E9. The iconv tool can be useful in making those conversions for scripts.
When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'
I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)