Why I can't convert my text from UTF-8 to ASCII using iconv? - utf-8

I'm actually have a hard time trying to convert UTF-8 code to ASCII (I'm a beginner).
My input is a text file named text.txt with these words in utf-8 :
Être
Éloignée
Éloigné
Église
So I used iconv to convert them to ASCII :
iconv -f UTF8 -t ASCII texte.txt > text_ascii.txt
When using this command I have this error
iconv: illegal input sequence at position 0
So I checked the encoding of the file using
file -i text.txt
text.txt: text/plain; charset=utf-8
So here I am sure that the encoding it utf-8 but still it's not working ?
Does anyone know the solution for this please ?
Thanks

Related

Convert UTF-8 characters to NCR Iconv

I'm trying to convert UTF-8 characters from a file to NCR Hexadecimal. I tried the following:
iconv -f UTF-8 -t //TRANSLIT file --unicode-subst=&#x%04X;'
However, it doesn't do anything, and I can't even find the appropriate encoding name for NCR in iconv --list.

How to read UTF-8 encoding with fscanf

Octave 4.2.2 reads ISO-8859-1 chars with the fscanf command (formatted to read white spaces):
foo = fscanf(foofile1, "%*s %[^\n]");
while fgetl reads native UTF-8:
foo = fgetl(foofile2);
Both files showed to be encoded with UTF-8:
$ file -i foofile1.csv
foofile1.csv: text/plain; charset=utf-8
$ file -i foofile2.html
foofile2.html: text/html; charset=utf-8
Is there any way to read the HTML file in UTF-8 format with fscanf?
Update: As pointed out by #TS, this has been reported as a bug in savannah.gnu.org.
No change in code is needed. UTF8 is designed to work with most of non-UTF8-aware single-byte string functions like the above as far as you don't have to work with decoded codepoints, for example, to print the string to screen.

bash fixing noisy data french character

I have this sample data in my csv:
VallÌÎÌãÌ´å©e,100
JoffÌÎÌãÌ´å©,240
I think this is because the csv doesn't support utf-8. How would I fix that using bash? I think its a french name.
The things that I have tried so far are using the SED bash to change all french characters to just an alphabet using SED:
sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/' data.csv
but it doesn't work so I'm not too sure how to fix it.
Could you please try following and let me know if this helps.
iconv -f utf8 -t ascii//TRANSLIT Input_file
Here is what man page of iconv says:
DESCRIPTION
The iconv program reads in text in one encoding and outputs the text in another encoding. If no input files are given, or if it
is given as a
dash (-), iconv reads from standard input. If no output file is given, iconv writes to standard output.
If no from-encoding is given, the default is derived from the current locale’s character encoding. If no to-encoding is given, the
default is
derived from the current locale’s character encoding.

iconv: cannot convert some strings from gb3212 to UTF-8

Good day! I have a problem with converting this string in gb3212: "е – с"
My actions:
[i.remen#win74 ~]$ iconv -f gb2312 -t utf-8 tst.txt
е iconv: illegal input sequence at position 3
[i.remen#win74 ~]$
I tried many different versions(both from separate iconv and as part of glibc). Is there any way to to this conversion?
maybe some characters is not in gb2312 ,try gb18030,it's a 'bigger' charset than gb2312

iconv in Mac OS X 10.7.3 does nothing

I am trying to convert a php file (client.php) from utf-8 to iso-8859-1 and the following command does nothing on the file:
iconv -f UTF-8 -t ISO-8859-1 client.php
Upon execution the original file contents are displayed.
In fact, when I check for the file's encoding after executing iconv with:
file -I client.php
The same old utf-8 is shown:
client.php: text/x-php; charset=utf-8
The iconv utility shall convert the encoding of characters in file from one codeset to another and write the results to standard output.
Here's a solution :
Write stdout to a temporary file and rename the temporary file
iconv -f UTF-8 -t ISO_8859-1 client.php > client_temp.php && mv -f client_temp.php client.php
ASCII, UTF-8 and ISO-8859 are 100% identical encodings for the lowest 128 characters. If your file only contains characters in that range (which is basically the set of characters you find on a common US English keyboard), there's no difference between these encodings.
My guess what's happening: A plain text file has no associated encoding meta data. You cannot know the encoding of a plain text file just by looking at it. What the file utility is doing is simply giving its best guess, and since there's no difference it prefers to tell you the file is UTF-8 encoded, which technically it may well be.
In addition to jackjr300 with the following One-Liner you can do it for all php files in the current folder:
for filename in *.php; do iconv -f ISO_8859-1 -t UTF-8 $filename > temp_$filename && mv -f ./temp_$filename ./$filename; done

Resources