Unaccent string in bash script (RHEL) - bash

On Debian-based distributions, there is a utility called unaccent which can be used to remove accents from accented letters in a text.
I was looking for a package containing this on Redhat distros, but the only one I found was unac available for Mandriva only.
I tried to use iconv but it seems to not support my case.
What is the best, lightweight approach, easily usable in a bash script ?
Are there any secret options to iconv that allow this ?

You can use the -c(clear) option in iconv to remove non-ascii chars:
$ echo 'été' | iconv -c -f utf8 -t ascii
t
If you just want to remove the accent:
$ echo 'été' | iconv -f utf8 -t ascii//TRANSLIT
ete

Related

How to convert utf8 into binary

Recently I am studying about binary code, and I want to know how do I convert text that has been encoded by UTF-8 and then into binary?
I recommend using the command-line tool iconv.
For example:
$ iconv option
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
Here is a online tutorial that might be of help:
https://www.tecmint.com/convert-files-to-utf-8-encoding-in-linux/

how to replace non ascii character with their respective character in ruby 1.8.7

i used iconv to replace the character:
<%= Iconv.iconv("ascii//translit", "utf-8", "ENDÜœSTRIYEL").to_s %>
it displays,
END?oeSTRIYEL
whereas in irb it shows like this:
irb(main):006:0> Iconv.iconv('ascii//translit', 'utf-8', 'ENDÜœSTRIYEL').to_s
=> "ENDUoeSTRIYEL"
how to get the full translation of the nonascii characters as in irb?
Thanks.
The iconv facility of glibc has a transliteration that depends on the locale:
$ echo "ENDÜœSTRIYEL" | LC_ALL=C iconv -f utf-8 -t ascii//translit
END?oeSTRIYEL
$ echo "ENDÜœSTRIYEL" | LC_ALL=de_DE.UTF-8 iconv -f utf-8 -t ascii//translit
ENDUEoeSTRIYEL
$ echo "ENDÜœSTRIYEL" | LC_ALL=ja_JP.UTF-8 iconv -f utf-8 -t ascii//translit
ENDUoeSTRIYEL
As you can see, three different results for three different locales.
If you are hosting a server that is meant to handle input from users in different countries, you have two options:
Use a single locale for all users, and hope it's good enough for all.
Switch the locale temporarily for each conversion (using uselocale, not setlocale). However, I don't know whether uselocale is available in Ruby.

How to convert Utf8 file to CP1252 by Unix

I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252).
I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead.
I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion.
For exmple, the POSIX iconv command doesn't have this choose, in fact UTF8 is used only as "to" encoding (-t) but never as "from" encoding (-f). iconv -l returns a long list with conversion pairs but UTF8 is always only in the second column.
How can I convert my file to CP1252 by UNIX?
If your UTF-8 file only contains characters which are also representable as CP1252, you should be able to perform the conversion.
iconv -f utf-8 -t cp1252 <file.utf8 >file.txt
If, however, the UTF-8 text contains some characters which cannot be represented as CP1252, you have a couple of options:
Convert anyway, and have the converter omit the problematic characters
Convert anyway, and have the converter replace the problematic characters
This should be a conscious choice, so out of the box, iconv doesn't allow you to do this; but there are options to enable this behavior. Look at the -c option for the first behavior, and --unicode-subst for the second.
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252
x
iconv: (stdin):1:1: cannot convert
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 -c
xy
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 --unicode-subst='?'
x?y
This is on OS X; apparently, Linux iconv lacks some of these options. Maybe look at recode and/or write your own simple conversion tool if you don't get the behavior you need out of iconv on your platform.
#!/usr/bin/env python
import sys
for line in sys.stdin:
print(line.decode('utf-8').encode('cp1252', 'replace'))
Put 'ignore' instead of 'replace' to drop characters which cannot be represented. The default replacement character is ? like in the iconv example above.
Have a look at this Java converter: native2ascii
It is part of JDK installation.
The conversion is done in two steps:
native2ascii -encoding UTF-8 <your_file.txt> <your_file.txt.ascii>
native2ascii -reverse -encoding windows-1252 <your_file.txt.ascii> <your_file_new.txt>
Characters which are used in UTF-8 but not supported in CP1252 (including BOM) are replaced by ?

Replacing special characters

I have a document which contains various special characters such as é ÿ ° Æ oºi
I've written the following two commands which both work on 'single looking' characters such as à ± È.
However neither of which work with the special characters listed above.
This command works using two byte hex decimals (To replace é with A)
sed -i 's/\xc3\xA9/A/g' test.csv
This command uses utf8 to replace characters:
CHARS=$(python -c 'print u"\u00a9".encode("utf8")') sed -i 's/['"$CHARS"']/A/g' $filename
Either of these commands should work but neither do.
It looks like you are viewing UTF-8 data as ISO-8859-1 (aka latin1).
This is what you'd experience when handling a UTF-8 encoded file in a ISO-8859-1 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The café has crème brûlée.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
This usually only happens for PuTTY users, because PuTTY is one of the few terminal emulators that still uses ISO-8859-1 by default. You can set it to use UTF-8 in the PuTTY configuration.
Here's the same example in a UTF-8 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The caf� has cr�me br�l�e.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
The only correct solution is to fix your setup so that it uses UTF-8 throughout. ISO-8859-1 does not support the languages and features we take for granted today, and is not a useful option.

how can I use linux command sed to process Little-endian UTF-16 file

I am working on an application about windows rdp. Now I get a problem when I try to use the sed command to replace the string of IP address directly in the rdp file. But after executing this command, the origin rdp file is garbled.
sed -i "s/address:s:.*/address:s:$(cat check-free-ip.to.rdpzhitong.rdp)/" rdpzhitong.rdp
I find that the file's format is Little-endian UTF-16 Unicode.
Can I still use the sed command to replace the text in the files correctly? Or other method to process this problem?
If the file is UTF-16 encoded text (as RDP is), and that is not your current encoding (it's not likely to be on Linux) then you can pre- and post-process the file with iconv. For example:
iconv -f utf-16 -t us-ascii <rdpzhitong.rdp |
sed 's/original/modified/' |
iconv -f us-ascii -t utf-16 >rdpzhitong.rdp.modified
if you can cat the file, then you may use sed. no harm in trying before you ask the question.
if the check-free-ip.to.rdpzhitong.rdp file has any text, you may want to do this:
address=$(sed 1q check-free-ip.to.rdpzhitong.rdp)
sed -i "s/address:s:.*/address:s:$address/" rdpzhitong.rdp
also, a little advice. try without the -i switch, until you know it's working.

Resources