I am on Mac Os X 10.5 (but I reproduced the issue on 10.4)
I am trying to use iconv to convert an UTF-8 file to ASCII
the utf-8 file contains characters like 'éàç'
I want the accented characters to be turned into their closest ascii equivalent
so
my command is this :
iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE myutf8file.txt
which works fine on a Linux machine
but on my local Mac Os X I get this for instance :
è => 'e
à => `a
I really dont undersatnd why iconv returns this weird output on mac os x but all is fine on linux
any help ? or directions ?
thanks in advance
The problem is that Mac OSX uses another implementation of iconv called libiconv. Most Linux distributions have an implementation of iconv which is part of libc. Unfortunately libiconv transliterates characters such as ö, è and ñ as "o, `e and ~n. The only way to fix this is to download the source and modify the translit.h file in the lib directory. Find lines that look like this:
2, '"', 'o',
and replace them with something like this:
1, 'o',
I spent hours on google trying to figure out the answer to this problem and finally decided to download the source and hack around with it. Hope this helps someone!
I found a workaround suitable for my needs (just to clarify: a script gets a string and converts it to a “permalink” URL.
My workaround consist on piping the iconv output to a sed filter:
echo á é ç this is a test | iconv -f utf8 -t ascii//TRANSLIT | sed 's/[^a-zA-Z 0-9]//g'
The result for the above in OS X Yosemite is:
a e c this is a test
my guess is that on your linux machine the locale is set differently...
as far as I can remember, iconv uses the current locale to translate UTF-X, and by default the macos has the locale set to "C" which (obviously) does not handle accents and language specific characters... maybe try doing this before running iconv:
setLocale( LC_ALL, "en_EN");
|K<
Another option is to use unaccent which is installed by brew install unac:
$ unaccent utf-8<<<é
e
unaccent does not convert characters in decomposed form (such as LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT), but you can use uconv to convert characters to composed form:
$ unaccent utf-8<<<$'e\u0301'
é
$ uconv -f utf-8 -t utf-8 -x NFC<<<$'e\u0301'|unaccent utf-8
e
brew install icu4c;ln -s /usr/local/opt/icu4c/bin/uconv /usr/local/bin installs uconv.
Related
We have some Groovy scripts that we run from Git Bash (MINGW64) in Windows.
Some scripts prints the bullet character • (or similar).
To make it work we set this variable:
export LC_ALL=en_US.UTF-8
But, for some people, this is not enough. Its console prints ΓÇó instead of •.
Any idea about how to make it prints properly and why is printing that even after setting the LC_ALL variable?
Update
The key part is that the output from Groovy scripts is printing incorrectly, but there are no problems with the plain bash scripts.
An example with querying the current characters mapping locale charmap used by the system locale, and filtering the output with recode to render it with compatible characters mapping:
#!/usr/bin/env sh
cat <<EOF | recode -qf "UTF-8...$(locale charmap)"
• These are
• UTF-8 bullets in source
• But it can gracefully degrade with recode
EOF
With a charmap=ISO-8859-1 it renders as:
o These are
o UTF-8 bullets in source
o But it can gracefully degrade with recode
Alternate method using iconv instead of recode and results may even be better.
#!/usr/bin/env sh
cat <<EOF | iconv -f 'UTF-8' -t "$(locale charmap)//TRANSLIT"
• These are
• UTF-8 bullets followed by a non-breaking space in source
• But it can gracefully degrade with iconv
• Europe's currency sign is € for Euro.
EOF
iconv output with an fr_FR.iso-8859-15#Euro locale:
o These are
o UTF-8 bullets followed by a non-breaking space in source
o But it can gracefully degrade with iconv
o Europe's currency sign is € for Euro.
I'm unable to correctly pass UTF-8 string values as arguments to command line apps.
Approaches I've tried:
pass the value between double quotes: "café"
pass with single quotes: 'café'
use the char code: 'caf\233'
use a $ sign before the string: $'café'
I'm using Mac OS 10.10, iTerm and my current locale outputs:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
It is doubtful this has anything to do with the shell. I would make sure that your tools (both the writer tools and whatever you're reading with) correctly deal with UTF-8 at all. I would most suspect that whatever you're reading your tags with is interpreting and printing this as Latin-1. You should look inside the file with a hex editor and look for the tag. I'm betting it will be correct (C3 82, which is é in UTF-8, and é in Latin-1). Your output tool is probably the problem, not the writer (and definitely not the shell).
If your reading tool demands Latin-1, then you need to encode é as E9. The iconv tool can be useful in making those conversions for scripts.
I'm having a problem trying to read in windows a CSV file generated in MAC.
My question is how can I convert the encoding to UTF-8 or even ISO-8859-1.
I've already tried iconv with no success.
Inside "vim" I can understand that in this file linebreaks are marked with ^M and the accent ã is marked with <8b>, Ç = <82> and so on.
Any ideas?
To convert from encoding a to encoding b, you need to know what encoding a is.
Given that
ã is marked with <8b>, Ç = <82>
encoding a is very likely Mac OS Roman.
So call iconv with macintosh* as from argument, and utf-8 as to argument.
*try macroman, x-mac-roman etc if macintosh is not found
I am not yet so good with reading Amharic (Geez / Ethiopic) letters.
If I have a text in Ge'ez (Ethiopia) letters ( http://en.wikipedia.org/wiki/Ge%27ez_language ) I want to transliterate them to ASCII.
When I go with the LYNX Textmode browser to http://www.addismap.com/am/ (webpage in Amharic) it showes me "edis map: yeedis ebeba karta". How can I access this functionality for example in Python, Bash or PHP? Which API do they use?
It seems not to be iconv:
$ iconv -f UTF-8 -t ASCII//TRANSLIT
Input: ሀ ለ ሐ መ ሠ ረ ሰ
Output: ? ? ? ? ? ? ?
ICU http://icu-project.org/ has an Amharic-Latin transform, which will turn your text into "hā le ḥā me še re se". You could use this using uconv -x 'Amharic/BGN-Latin' from the command line, or use pyicu.
The Unicode Common Locale Data Repository defines some transliterations. Unidecode (or its Python port) has even more of them.
I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).
Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.
Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?
For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:
Iconv documentation
According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
One might then attempt to do a full conversion like so:
ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
If you're on Ruby 1.9...
string_in_windows_1252 = database.get(...)
# => "Fåbulous"
string_in_windows_1252.encoding
# => "windows-1252"
string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"
string_in_utf_8.encoding
# => 'UTF-8'
Hy,
I had the exact same problem.
These tips helped me get goin:
Always check for the proper encoding name in order to feed your conversion tools correctly.
In doubt you can get a list of supported encodings for iconv or recode using:
$ recode -l
or
$ iconv -l
Always start from you original file and encode a sample to work with:
$ recode windows-1252..u8 < original.txt > sample_utf8.txt
or
$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt
Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is.
File.open has a new 'mode' parameter in Ruby 1.9. Use it!
This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.
Have fun and swear a lot!
If you want to convert a file named win1252file, on a unix OS, run:
$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file
You should probably be able to do the same on Windows with cygwin.
If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try
File.open('/tmp/w1252', 'w') do |file|
my_windows_1252_string.each_byte do |byte|
file << byte
end
end
`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`
my_utf_8_string = File.read('/tmp/utf8')
['/tmp/w1252', '/tmp/utf8'].each do |path|
FileUtils.rm path
end