Replacing special characters - shell

I have a document which contains various special characters such as é ÿ ° Æ oºi
I've written the following two commands which both work on 'single looking' characters such as à ± È.
However neither of which work with the special characters listed above.
This command works using two byte hex decimals (To replace é with A)
sed -i 's/\xc3\xA9/A/g' test.csv
This command uses utf8 to replace characters:
CHARS=$(python -c 'print u"\u00a9".encode("utf8")') sed -i 's/['"$CHARS"']/A/g' $filename
Either of these commands should work but neither do.

It looks like you are viewing UTF-8 data as ISO-8859-1 (aka latin1).
This is what you'd experience when handling a UTF-8 encoded file in a ISO-8859-1 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The café has crème brûlée.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
This usually only happens for PuTTY users, because PuTTY is one of the few terminal emulators that still uses ISO-8859-1 by default. You can set it to use UTF-8 in the PuTTY configuration.
Here's the same example in a UTF-8 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The caf� has cr�me br�l�e.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
The only correct solution is to fix your setup so that it uses UTF-8 throughout. ISO-8859-1 does not support the languages and features we take for granted today, and is not a useful option.

Related

Character Encoding Issue on Ubuntu/Bash

I'd like to cat a Swedish txt file.
For special characters (like ä or é) I get back these characters: �.
eg.
�r han fr�n Apornas planet.
I have multiple files from multiple sources and some of them gives back the correct results, (Eg. Det här är fel!) some of them produces the above mentioned issue.
Based on that I'm pretty sure, that the issue is with the file's character (en)coding, but I just simply can't find how to encode the file at the command line.
I've tried:
iconv -f UTF-8 -t UTF-16 file.txt
and similars.
But I've ended up in an error message all the time.
Do you have any tips?
Thanks!
Based on comments the solution was:
First execute:
chardet file.txt
to find out the character encoding.
Then:
iconv -f iso-8859-1 -t utf-8 file.txt
to create the "translation".

How to convert Utf8 file to CP1252 by Unix

I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252).
I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead.
I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion.
For exmple, the POSIX iconv command doesn't have this choose, in fact UTF8 is used only as "to" encoding (-t) but never as "from" encoding (-f). iconv -l returns a long list with conversion pairs but UTF8 is always only in the second column.
How can I convert my file to CP1252 by UNIX?
If your UTF-8 file only contains characters which are also representable as CP1252, you should be able to perform the conversion.
iconv -f utf-8 -t cp1252 <file.utf8 >file.txt
If, however, the UTF-8 text contains some characters which cannot be represented as CP1252, you have a couple of options:
Convert anyway, and have the converter omit the problematic characters
Convert anyway, and have the converter replace the problematic characters
This should be a conscious choice, so out of the box, iconv doesn't allow you to do this; but there are options to enable this behavior. Look at the -c option for the first behavior, and --unicode-subst for the second.
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252
x
iconv: (stdin):1:1: cannot convert
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 -c
xy
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 --unicode-subst='?'
x?y
This is on OS X; apparently, Linux iconv lacks some of these options. Maybe look at recode and/or write your own simple conversion tool if you don't get the behavior you need out of iconv on your platform.
#!/usr/bin/env python
import sys
for line in sys.stdin:
print(line.decode('utf-8').encode('cp1252', 'replace'))
Put 'ignore' instead of 'replace' to drop characters which cannot be represented. The default replacement character is ? like in the iconv example above.
Have a look at this Java converter: native2ascii
It is part of JDK installation.
The conversion is done in two steps:
native2ascii -encoding UTF-8 <your_file.txt> <your_file.txt.ascii>
native2ascii -reverse -encoding windows-1252 <your_file.txt.ascii> <your_file_new.txt>
Characters which are used in UTF-8 but not supported in CP1252 (including BOM) are replaced by ?

Remove invalid non-ASCII characters in Bash

In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.
I can use sed, awk or similar utilities if needed.
The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output, you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:
perl -CIO -pe 's/[^[:print:]]//g'
If you want a simpler alternative to Perl, try iconv as follows:
iconv -c <<<$'Mot\x{fc}rhead' # -> 'Motrhead'
Both the input and output encodings default to UTF-8, but can be specified explicitly: the input encoding with -f (e.g., -f UTF8); the output encoding with -t (e.g., -t UTF8) - run iconv -l to see all supported encodings.
-c simply discards input chars. that aren't valid in the input encoding; in the example, \x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö, which is invalid in UTF8 (where it's represented as \x{c3}\x{b6}).
Note (after discovering a comment by the OP): If your output still contains garbled characters:
"� (question mark) or ߻ (box with hex numbers in it)"
the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.

iconv in Mac OS X 10.7.3 does nothing

I am trying to convert a php file (client.php) from utf-8 to iso-8859-1 and the following command does nothing on the file:
iconv -f UTF-8 -t ISO-8859-1 client.php
Upon execution the original file contents are displayed.
In fact, when I check for the file's encoding after executing iconv with:
file -I client.php
The same old utf-8 is shown:
client.php: text/x-php; charset=utf-8
The iconv utility shall convert the encoding of characters in file from one codeset to another and write the results to standard output.
Here's a solution :
Write stdout to a temporary file and rename the temporary file
iconv -f UTF-8 -t ISO_8859-1 client.php > client_temp.php && mv -f client_temp.php client.php
ASCII, UTF-8 and ISO-8859 are 100% identical encodings for the lowest 128 characters. If your file only contains characters in that range (which is basically the set of characters you find on a common US English keyboard), there's no difference between these encodings.
My guess what's happening: A plain text file has no associated encoding meta data. You cannot know the encoding of a plain text file just by looking at it. What the file utility is doing is simply giving its best guess, and since there's no difference it prefers to tell you the file is UTF-8 encoded, which technically it may well be.
In addition to jackjr300 with the following One-Liner you can do it for all php files in the current folder:
for filename in *.php; do iconv -f ISO_8859-1 -t UTF-8 $filename > temp_$filename && mv -f ./temp_$filename ./$filename; done

Unaccent string in bash script (RHEL)

On Debian-based distributions, there is a utility called unaccent which can be used to remove accents from accented letters in a text.
I was looking for a package containing this on Redhat distros, but the only one I found was unac available for Mandriva only.
I tried to use iconv but it seems to not support my case.
What is the best, lightweight approach, easily usable in a bash script ?
Are there any secret options to iconv that allow this ?
You can use the -c(clear) option in iconv to remove non-ascii chars:
$ echo 'été' | iconv -c -f utf8 -t ascii
t
If you just want to remove the accent:
$ echo 'été' | iconv -f utf8 -t ascii//TRANSLIT
ete

Resources