Transliteration from Ethiopic (and others) to ASCII (ሀ -> ha; ü -> ue) - utf-8

I am not yet so good with reading Amharic (Geez / Ethiopic) letters.
If I have a text in Ge'ez (Ethiopia) letters ( http://en.wikipedia.org/wiki/Ge%27ez_language ) I want to transliterate them to ASCII.
When I go with the LYNX Textmode browser to http://www.addismap.com/am/ (webpage in Amharic) it showes me "edis map: yeedis ebeba karta". How can I access this functionality for example in Python, Bash or PHP? Which API do they use?
It seems not to be iconv:
$ iconv -f UTF-8 -t ASCII//TRANSLIT
Input: ሀ ለ ሐ መ ሠ ረ ሰ
Output: ? ? ? ? ? ? ?

ICU http://icu-project.org/ has an Amharic-Latin transform, which will turn your text into "hā le ḥā me še re se". You could use this using uconv -x 'Amharic/BGN-Latin' from the command line, or use pyicu.

The Unicode Common Locale Data Repository defines some transliterations. Unidecode (or its Python port) has even more of them.

Related

Bash: Strange characters even after setting locale to UTF-8: "•" prints as "ΓÇó"

We have some Groovy scripts that we run from Git Bash (MINGW64) in Windows.
Some scripts prints the bullet character • (or similar).
To make it work we set this variable:
export LC_ALL=en_US.UTF-8
But, for some people, this is not enough. Its console prints ΓÇó instead of •.
Any idea about how to make it prints properly and why is printing that even after setting the LC_ALL variable?
Update
The key part is that the output from Groovy scripts is printing incorrectly, but there are no problems with the plain bash scripts.
An example with querying the current characters mapping locale charmap used by the system locale, and filtering the output with recode to render it with compatible characters mapping:
#!/usr/bin/env sh
cat <<EOF | recode -qf "UTF-8...$(locale charmap)"
• These are
• UTF-8 bullets in source
• But it can gracefully degrade with recode
EOF
With a charmap=ISO-8859-1 it renders as:
o These are
o UTF-8 bullets in source
o But it can gracefully degrade with recode
Alternate method using iconv instead of recode and results may even be better.
#!/usr/bin/env sh
cat <<EOF | iconv -f 'UTF-8' -t "$(locale charmap)//TRANSLIT"
• These are
• UTF-8 bullets followed by a non-breaking space in source
• But it can gracefully degrade with iconv
• Europe's currency sign is € for Euro.
EOF
iconv output with an fr_FR.iso-8859-15#Euro locale:
o These are
o UTF-8 bullets followed by a non-breaking space in source
o But it can gracefully degrade with iconv
o Europe's currency sign is € for Euro.

How to get correct replacement of ISO-8859-1 characters to UTF-8?

I want to replace ISO-8859-1 characters from file below to be valid for UTF-8 encoding.
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</HEAD>
<BODY>
<A NAME="top"></A>
<TABLE border=0 width=609 cellspacing=0 cellpadding=0>
<TR><td rowspan=2><img src="http://www.example.com" width=10></td>
<TD width=609 valign=top>
<p>'</p>
<p>*</p>
<p>-</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
</TD>
</TR>
</TABLE>
</body>
</html>
Doing some research I found that the issue is related with locale language and I was able to build this awk program, but only replaces the first 2 characters (' and *)
LC_ALL=ISO_8859-1 awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8" , $0)
gsub(/\047/, "\\&apos;" , $0)
gsub(/*/, "\\&ast;" , $0)
gsub(/–/, "\\–" , $0)
gsub(/—/, "\\—" , $0)
gsub(/§/, "\\§" , $0)
gsub(/«/, "\\«" , $0)
gsub(/»/, "\\»" , $0)
gsub(/¿/, "\\¿" , $0)
gsub(/Á/, "\\Á" , $0)
print
}' t.html | iconv -f ISO_8859-1 -t UTF-8
This is the current output (showing below partial output, only lines affected by the program):
<p>&apos;</p>
<p>&ast;</p>
<p>-</p>
<p>-</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
and expected output is:
<p>&ast;</p>
<p>–</p>
<p>—</p>
<p>§</p>
<p>«</p>
<p>»</p>
<p>¿</p>
<p>Á</p>
I've already tried a similar code using sed but the same issue.
How to fix this?
Below locale config:
***Ubuntu 18.04.1 LTS
$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
This issue is likely due to an encoding mismatch between the input file and the awk script.
Please first note that there is probably a (very common) confusion between ISO-8859-1 and Windows-1252 here. The html sample in the original post contains em/en dash characters which are not part of the ISO-8859-1 layout, so it certainly uses another encoding, probably Windows-1252 (which is a superset of ISO-8859-1 including the dash characters) since the OP reported to use Ubuntu through the Windows subsystem layer.
I'll then assume that the html input file is indeed encoded with Windows-1252. So non-ASCII characters (code points ≥ 128) use only one byte.
If the awk program is loaded from a file encoded in UTF-8, or even directly typed in a terminal window which uses the UTF-8 endoding, then the regular expressions and literal strings embedded in the program are also encoded in UTF-8. So non-ASCII characters use multiple bytes.
For example, the character § (code point 167 = 0xA7), is represented by the byte A7 in Windows-1252 and the sequence of bytes C2 A7 in UTF-8. If you use gsub(/§/, "S") in your UTF-8 encoded awk program, then awk looks for the sequence C2 A7 in the input file which only contains A7. It will not match. Unless you are (un)lucky enough to have a character  (code point 194 = 0xC2) hanging out just before your §.
Changing the locale does not help here because it only tells awk how to parse its input (data and program), whereas what you need here is to transcode either the data or the regular expressions. For this to work you would have to be able to specify the locale of the data independently of the locale of the program, which is not supported.
So, assuming that your system is set up with an UTF-8 locale and that your awk script uses this locale (no matter if loaded from a file or typed in a terminal), here are several methods you can use to align the input file and the regular expressions on the same encoding so that gsub works as expected.
Please note that these suggestions stick to your first awk command since it is the source of the issue. The final pipe to iconv is needed only if you intentionally does not transform all the special characters you may have in the input to html entities. Otherwise the output of awk is plain ASCII so already UTF-8 compliant.
Option 1 : convert the input file from Windows-1252 to UTF-8
No need for another iconv step after that in any case.
iconv -f WINDOWS-1252 t.html | awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/\047/, "\\&apos;")
gsub(/\*/, "\\&ast;")
gsub(/–/, "\\–")
gsub(/—/, "\\—")
gsub(/§/, "\\§")
gsub(/«/, "\\«")
gsub(/»/, "\\»")
gsub(/¿/, "\\¿")
gsub(/Á/, "\\Á")
print
}'
Option 2 : convert the awk program from UTF-8 to Windows-1252
Because the awk program may want to have fun too. Let's use process substitution.
awk -f <(iconv -t WINDOWS-1252 <<'EOS'
{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/'/, "\\&apos;")
gsub(/\*/, "\\&ast;")
gsub(/–/, "\\–")
gsub(/—/, "\\—")
gsub(/§/, "\\§")
gsub(/«/, "\\«")
gsub(/»/, "\\»")
gsub(/¿/, "\\¿")
gsub(/Á/, "\\Á")
print
}
EOS
) t.html
Option 3 : save the awk/schell script in a file encoded in Windows-1252
... with your favorite tool.
Option 4 : switch the encoding of your terminal session to Windows-1252
In case you type/paste the awk command in a terminal of course.
Note that this different from setting the locale (LC_CTYPE). I'm not aware of a way to do this programmatically. If somebody knows, feel free to contribute.
Option 5 : avoid non-ASCII characters altogether in the awk program
Sounds anyway a good practice in my opinion.
awk '{
gsub(/charset=iso-8859-1/, "charset=UTF-8")
gsub(/\047/, "\\&apos;")
gsub(/\*/, "\\&ast;")
gsub(/\226/, "\\–")
gsub(/\227/, "\\—")
gsub(/\247/, "\\§")
gsub(/\253/, "\\«")
gsub(/\273/, "\\»")
gsub(/\277/, "\\¿")
gsub(/\301/, "\\Á")
print
}' t.html

Is Bash support Unicode 6.0?

When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'

How can I convert a string from windows-1252 to utf-8 in Ruby?

I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).
Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.
Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?
For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:
Iconv documentation
According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
One might then attempt to do a full conversion like so:
ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
If you're on Ruby 1.9...
string_in_windows_1252 = database.get(...)
# => "Fåbulous"
string_in_windows_1252.encoding
# => "windows-1252"
string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"
string_in_utf_8.encoding
# => 'UTF-8'
Hy,
I had the exact same problem.
These tips helped me get goin:
Always check for the proper encoding name in order to feed your conversion tools correctly.
In doubt you can get a list of supported encodings for iconv or recode using:
$ recode -l
or
$ iconv -l
Always start from you original file and encode a sample to work with:
$ recode windows-1252..u8 < original.txt > sample_utf8.txt
or
$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt
Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is.
File.open has a new 'mode' parameter in Ruby 1.9. Use it!
This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.
Have fun and swear a lot!
If you want to convert a file named win1252file, on a unix OS, run:
$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file
You should probably be able to do the same on Windows with cygwin.
If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try
File.open('/tmp/w1252', 'w') do |file|
my_windows_1252_string.each_byte do |byte|
file << byte
end
end
`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`
my_utf_8_string = File.read('/tmp/utf8')
['/tmp/w1252', '/tmp/utf8'].each do |path|
FileUtils.rm path
end

How to fix weird issue with iconv on Mac Os x

I am on Mac Os X 10.5 (but I reproduced the issue on 10.4)
I am trying to use iconv to convert an UTF-8 file to ASCII
the utf-8 file contains characters like 'éàç'
I want the accented characters to be turned into their closest ascii equivalent
so
my command is this :
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE myutf8file.txt
which works fine on a Linux machine
but on my local Mac Os X I get this for instance :
è => 'e
à => `a
I really dont undersatnd why iconv returns this weird output on mac os x but all is fine on linux
any help ? or directions ?
thanks in advance
The problem is that Mac OSX uses another implementation of iconv called libiconv. Most Linux distributions have an implementation of iconv which is part of libc. Unfortunately libiconv transliterates characters such as ö, è and ñ as "o, `e and ~n. The only way to fix this is to download the source and modify the translit.h file in the lib directory. Find lines that look like this:
2, '"', 'o',
and replace them with something like this:
1, 'o',
I spent hours on google trying to figure out the answer to this problem and finally decided to download the source and hack around with it. Hope this helps someone!
I found a workaround suitable for my needs (just to clarify: a script gets a string and converts it to a “permalink” URL.
My workaround consist on piping the iconv output to a sed filter:
echo á é ç this is a test | iconv -f utf8 -t ascii//TRANSLIT | sed 's/[^a-zA-Z 0-9]//g'
The result for the above in OS X Yosemite is:
a e c this is a test
my guess is that on your linux machine the locale is set differently...
as far as I can remember, iconv uses the current locale to translate UTF-X, and by default the macos has the locale set to "C" which (obviously) does not handle accents and language specific characters... maybe try doing this before running iconv:
setLocale( LC_ALL, "en_EN");
|K<
Another option is to use unaccent which is installed by brew install unac:
$ unaccent utf-8<<<é
e
unaccent does not convert characters in decomposed form (such as LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT), but you can use uconv to convert characters to composed form:
$ unaccent utf-8<<<$'e\u0301'
é
$ uconv -f utf-8 -t utf-8 -x NFC<<<$'e\u0301'|unaccent utf-8
e
brew install icu4c;ln -s /usr/local/opt/icu4c/bin/uconv /usr/local/bin installs uconv.

Resources