Unaccent string in bash script (RHEL)

Unaccent string in bash script (RHEL) - bash

On Debian-based distributions, there is a utility called unaccent which can be used to remove accents from accented letters in a text.
I was looking for a package containing this on Redhat distros, but the only one I found was unac available for Mandriva only.
I tried to use iconv but it seems to not support my case.
What is the best, lightweight approach, easily usable in a bash script ?
Are there any secret options to iconv that allow this ?

You can use the -c(clear) option in iconv to remove non-ascii chars:
$ echo 'été' | iconv -c -f utf8 -t ascii
t
If you just want to remove the accent:
$ echo 'été' | iconv -f utf8 -t ascii//TRANSLIT
ete

Related

How to convert utf8 into binary

Recently I am studying about binary code, and I want to know how do I convert text that has been encoded by UTF-8 and then into binary?

I recommend using the command-line tool iconv.
For example:
$ iconv option
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
Here is a online tutorial that might be of help:
https://www.tecmint.com/convert-files-to-utf-8-encoding-in-linux/

how to replace non ascii character with their respective character in ruby 1.8.7

i used iconv to replace the character:
<%= Iconv.iconv("ascii//translit", "utf-8", "ENDÜœSTRIYEL").to_s %>
it displays,
END?oeSTRIYEL
whereas in irb it shows like this:
irb(main):006:0> Iconv.iconv('ascii//translit', 'utf-8', 'ENDÜœSTRIYEL').to_s
=> "ENDUoeSTRIYEL"
how to get the full translation of the nonascii characters as in irb?
Thanks.

The iconv facility of glibc has a transliteration that depends on the locale:
$ echo "ENDÜœSTRIYEL" | LC_ALL=C iconv -f utf-8 -t ascii//translit
END?oeSTRIYEL
$ echo "ENDÜœSTRIYEL" | LC_ALL=de_DE.UTF-8 iconv -f utf-8 -t ascii//translit
ENDUEoeSTRIYEL
$ echo "ENDÜœSTRIYEL" | LC_ALL=ja_JP.UTF-8 iconv -f utf-8 -t ascii//translit
ENDUoeSTRIYEL
As you can see, three different results for three different locales.
If you are hosting a server that is meant to handle input from users in different countries, you have two options:
Use a single locale for all users, and hope it's good enough for all.
Switch the locale temporarily for each conversion (using uselocale, not setlocale). However, I don't know whether uselocale is available in Ruby.

How to convert Utf8 file to CP1252 by Unix

I'm trying to transform txt file encoding from UTF8 to ANSI (cp1252).
I need this because the file is used in a fixed position Oracle import (external Table) which apparently only supports CP1252. If I import an UTF-8 file, some special characters turn up as two incorrect characters instead.
I'm working in a Unix machine (my OS is HP UX). I have been looking for an answer on the web but I don't find any way to do this conversion.
For exmple, the POSIX iconv command doesn't have this choose, in fact UTF8 is used only as "to" encoding (-t) but never as "from" encoding (-f). iconv -l returns a long list with conversion pairs but UTF8 is always only in the second column.
How can I convert my file to CP1252 by UNIX?

If your UTF-8 file only contains characters which are also representable as CP1252, you should be able to perform the conversion.
iconv -f utf-8 -t cp1252 <file.utf8 >file.txt
If, however, the UTF-8 text contains some characters which cannot be represented as CP1252, you have a couple of options:
Convert anyway, and have the converter omit the problematic characters
Convert anyway, and have the converter replace the problematic characters
This should be a conscious choice, so out of the box, iconv doesn't allow you to do this; but there are options to enable this behavior. Look at the -c option for the first behavior, and --unicode-subst for the second.
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252
x
iconv: (stdin):1:1: cannot convert
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 -c
xy
bash$ echo 'x≠y' | iconv -f utf-8 -t cp1252 --unicode-subst='?'
x?y
This is on OS X; apparently, Linux iconv lacks some of these options. Maybe look at recode and/or write your own simple conversion tool if you don't get the behavior you need out of iconv on your platform.
#!/usr/bin/env python
import sys
for line in sys.stdin:
print(line.decode('utf-8').encode('cp1252', 'replace'))
Put 'ignore' instead of 'replace' to drop characters which cannot be represented. The default replacement character is ? like in the iconv example above.

Have a look at this Java converter: native2ascii
It is part of JDK installation.
The conversion is done in two steps:
native2ascii -encoding UTF-8 <your_file.txt> <your_file.txt.ascii>
native2ascii -reverse -encoding windows-1252 <your_file.txt.ascii> <your_file_new.txt>
Characters which are used in UTF-8 but not supported in CP1252 (including BOM) are replaced by ?

Replacing special characters

I have a document which contains various special characters such as Ã© Ã¿ Â° Ã† oÂºi
I've written the following two commands which both work on 'single looking' characters such as Ã Â± È.
However neither of which work with the special characters listed above.
This command works using two byte hex decimals (To replace Ã© with A)
sed -i 's/\xc3\xA9/A/g' test.csv
This command uses utf8 to replace characters:
CHARS=$(python -c 'print u"\u00a9".encode("utf8")') sed -i 's/['"$CHARS"']/A/g' $filename
Either of these commands should work but neither do.

It looks like you are viewing UTF-8 data as ISO-8859-1 (aka latin1).
This is what you'd experience when handling a UTF-8 encoded file in a ISO-8859-1 terminal:
$ cat file
The cafÃ© has crÃ¨me brÃ»lÃ©e.
$ iconv -f utf-8 -t iso-8859-1 < file
The café has crème brûlée.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
This usually only happens for PuTTY users, because PuTTY is one of the few terminal emulators that still uses ISO-8859-1 by default. You can set it to use UTF-8 in the PuTTY configuration.
Here's the same example in a UTF-8 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The caf� has cr�me br�l�e.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
The only correct solution is to fix your setup so that it uses UTF-8 throughout. ISO-8859-1 does not support the languages and features we take for granted today, and is not a useful option.

how can I use linux command sed to process Little-endian UTF-16 file

I am working on an application about windows rdp. Now I get a problem when I try to use the sed command to replace the string of IP address directly in the rdp file. But after executing this command, the origin rdp file is garbled.
sed -i "s/address:s:.*/address:s:$(cat check-free-ip.to.rdpzhitong.rdp)/" rdpzhitong.rdp
I find that the file's format is Little-endian UTF-16 Unicode.
Can I still use the sed command to replace the text in the files correctly? Or other method to process this problem?

If the file is UTF-16 encoded text (as RDP is), and that is not your current encoding (it's not likely to be on Linux) then you can pre- and post-process the file with iconv. For example:
iconv -f utf-16 -t us-ascii <rdpzhitong.rdp |
sed 's/original/modified/' |
iconv -f us-ascii -t utf-16 >rdpzhitong.rdp.modified

if you can cat the file, then you may use sed. no harm in trying before you ask the question.
if the check-free-ip.to.rdpzhitong.rdp file has any text, you may want to do this:
address=$(sed 1q check-free-ip.to.rdpzhitong.rdp)
sed -i "s/address:s:.*/address:s:$address/" rdpzhitong.rdp
also, a little advice. try without the -i switch, until you know it's working.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unaccent string in bash script (RHEL) - bash

You can use the -c(clear) option in iconv to remove non-ascii chars: $ echo 'été' | iconv -c -f utf8 -t ascii t If you just want to remove the accent: $ echo 'été' | iconv -f utf8 -t ascii//TRANSLIT ete

Related

How to convert utf8 into binary

how to replace non ascii character with their respective character in ruby 1.8.7

How to convert Utf8 file to CP1252 by Unix

Replacing special characters

how can I use linux command sed to process Little-endian UTF-16 file

Categories

Resources