iconv: cannot convert some strings from gb3212 to UTF-8 - utf-8

Good day! I have a problem with converting this string in gb3212: "е – с"
My actions:
[i.remen#win74 ~]$ iconv -f gb2312 -t utf-8 tst.txt
е iconv: illegal input sequence at position 3
[i.remen#win74 ~]$
I tried many different versions(both from separate iconv and as part of glibc). Is there any way to to this conversion?

maybe some characters is not in gb2312 ,try gb18030,it's a 'bigger' charset than gb2312

Related

Why I can't convert my text from UTF-8 to ASCII using iconv?

I'm actually have a hard time trying to convert UTF-8 code to ASCII (I'm a beginner).
My input is a text file named text.txt with these words in utf-8 :
Être
Éloignée
Éloigné
Église
So I used iconv to convert them to ASCII :
iconv -f UTF8 -t ASCII texte.txt > text_ascii.txt
When using this command I have this error
iconv: illegal input sequence at position 0
So I checked the encoding of the file using
file -i text.txt
text.txt: text/plain; charset=utf-8
So here I am sure that the encoding it utf-8 but still it's not working ?
Does anyone know the solution for this please ?
Thanks

Convert UTF-8 characters to NCR Iconv

I'm trying to convert UTF-8 characters from a file to NCR Hexadecimal. I tried the following:
iconv -f UTF-8 -t //TRANSLIT file --unicode-subst=&#x%04X;'
However, it doesn't do anything, and I can't even find the appropriate encoding name for NCR in iconv --list.

How to convert Utf8 file to CP1252 by Unix

I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252).
I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead.
I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion.
For exmple, the POSIX iconv command doesn't have this choose, in fact UTF8 is used only as "to" encoding (-t) but never as "from" encoding (-f). iconv -l returns a long list with conversion pairs but UTF8 is always only in the second column.
How can I convert my file to CP1252 by UNIX?
If your UTF-8 file only contains characters which are also representable as CP1252, you should be able to perform the conversion.
iconv -f utf-8 -t cp1252 <file.utf8 >file.txt
If, however, the UTF-8 text contains some characters which cannot be represented as CP1252, you have a couple of options:
Convert anyway, and have the converter omit the problematic characters
Convert anyway, and have the converter replace the problematic characters
This should be a conscious choice, so out of the box, iconv doesn't allow you to do this; but there are options to enable this behavior. Look at the -c option for the first behavior, and --unicode-subst for the second.
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252
x
iconv: (stdin):1:1: cannot convert
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 -c
xy
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 --unicode-subst='?'
x?y
This is on OS X; apparently, Linux iconv lacks some of these options. Maybe look at recode and/or write your own simple conversion tool if you don't get the behavior you need out of iconv on your platform.
#!/usr/bin/env python
import sys
for line in sys.stdin:
print(line.decode('utf-8').encode('cp1252', 'replace'))
Put 'ignore' instead of 'replace' to drop characters which cannot be represented. The default replacement character is ? like in the iconv example above.
Have a look at this Java converter: native2ascii
It is part of JDK installation.
The conversion is done in two steps:
native2ascii -encoding UTF-8 <your_file.txt> <your_file.txt.ascii>
native2ascii -reverse -encoding windows-1252 <your_file.txt.ascii> <your_file_new.txt>
Characters which are used in UTF-8 but not supported in CP1252 (including BOM) are replaced by ?

Replacing special characters

I have a document which contains various special characters such as é ÿ ° Æ oºi
I've written the following two commands which both work on 'single looking' characters such as à ± È.
However neither of which work with the special characters listed above.
This command works using two byte hex decimals (To replace é with A)
sed -i 's/\xc3\xA9/A/g' test.csv
This command uses utf8 to replace characters:
CHARS=$(python -c 'print u"\u00a9".encode("utf8")') sed -i 's/['"$CHARS"']/A/g' $filename
Either of these commands should work but neither do.
It looks like you are viewing UTF-8 data as ISO-8859-1 (aka latin1).
This is what you'd experience when handling a UTF-8 encoded file in a ISO-8859-1 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The café has crème brûlée.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
This usually only happens for PuTTY users, because PuTTY is one of the few terminal emulators that still uses ISO-8859-1 by default. You can set it to use UTF-8 in the PuTTY configuration.
Here's the same example in a UTF-8 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The caf� has cr�me br�l�e.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
The only correct solution is to fix your setup so that it uses UTF-8 throughout. ISO-8859-1 does not support the languages and features we take for granted today, and is not a useful option.

Remove invalid non-ASCII characters in Bash

In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.
I can use sed, awk or similar utilities if needed.
The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output, you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:
perl -CIO -pe 's/[^[:print:]]//g'
If you want a simpler alternative to Perl, try iconv as follows:
iconv -c <<<$'Mot\x{fc}rhead' # -> 'Motrhead'
Both the input and output encodings default to UTF-8, but can be specified explicitly: the input encoding with -f (e.g., -f UTF8); the output encoding with -t (e.g., -t UTF8) - run iconv -l to see all supported encodings.
-c simply discards input chars. that aren't valid in the input encoding; in the example, \x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö, which is invalid in UTF8 (where it's represented as \x{c3}\x{b6}).
Note (after discovering a comment by the OP): If your output still contains garbled characters:
"� (question mark) or ߻ (box with hex numbers in it)"
the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.

Resources