This is my problem. A ".csv" file where delimiter is tab and content is a "Little-endian UTF-16 Unicode text". If i try to open it with gui in libreoffice there are my successfully
if i'm trying via shell with
unoconv -f ods -e FilterOptions="9,34,UNICODE,1" [FILE]
the result is a file with no separation. What's wrong?
And what's is the best shell command in order to convert this ods to file to a well generated csv (Unicode UTF8, comma separated, ecc.)?
this is my definitive solution
iconv -f UTF-16 -t UTF-8 /original/folder/file.csv > /tmp/file.csv
unoconv -f ods -i FilterOptions="9,34,UNICODE,1" /tmp/file.csv
unoconv -f csv -o /original/folder/file.csv -i FilterOptions="9,34,UNICODE,1" /tmp/file.ods
Related
I am trying to convert a .csv from UTF-16LE to UTF-8. The file is too large to be opened in Excel, and I am encountering the "incomplete character or shift sequence" error when using the following command:
iconv -f utf-16le -t -c utf-8 myfilename.csv > mynewfilename.csv
How do I get past this?
I'm using Bash on Mac OS Mojave.
Thanks!
Edit to add:
iconv -c -f utf-16le -t utf-8//IGNORE myfilename.csv > mynewfilename.csv
also didn't work, per suggestion below.
I have a file 500MB+ that has been generated by saving a large excel spreadsheet as unicode.
I am running windows 7.
I need to open the file with python pandas. So far I used to convert the file from ANSI to UTF-8 with notepad++ but the file is now too large and then open it with notepad++.
I have Hebrew, French, Swedish, Norwegian, Danish special characters.
Panda's read_excel is just too slow * I let it go for several minutes without seeing some output.
iconv: apparently I can not get the encoding right, I just get out a list of tab separated nulls when I have tried:
iconv -f "CP858" -t "UTF-8" file1.txt > file2.txt
iconv -f "windows-1252" -t "UTF-8" file1.txt > file2.txt
Edit
iconv -f "UTF-16le" -t "UTF-8" file1.txt > file2.txt leads to a very weird behaviour: a row in between lines is cut. All looks fine but only 80K rows are actually converted.
Edit 2
.. read_csv with encoding='utf-16le' reads properly the file. However, I still don't get why iconv messes it up.
read_csv with encoding='utf-16le' reads properly the file. However, I still don't get why iconv messes it up.
I have a document which contains various special characters such as é ÿ ° Æ oºi
I've written the following two commands which both work on 'single looking' characters such as à ± È.
However neither of which work with the special characters listed above.
This command works using two byte hex decimals (To replace é with A)
sed -i 's/\xc3\xA9/A/g' test.csv
This command uses utf8 to replace characters:
CHARS=$(python -c 'print u"\u00a9".encode("utf8")') sed -i 's/['"$CHARS"']/A/g' $filename
Either of these commands should work but neither do.
It looks like you are viewing UTF-8 data as ISO-8859-1 (aka latin1).
This is what you'd experience when handling a UTF-8 encoded file in a ISO-8859-1 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The café has crème brûlée.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
This usually only happens for PuTTY users, because PuTTY is one of the few terminal emulators that still uses ISO-8859-1 by default. You can set it to use UTF-8 in the PuTTY configuration.
Here's the same example in a UTF-8 terminal:
$ cat file
The café has crème brûlée.
$ iconv -f utf-8 -t iso-8859-1 < file
The caf� has cr�me br�l�e.
$ iconv -c -f utf-8 -t ascii//ignore < file
The caf has crme brle.
The only correct solution is to fix your setup so that it uses UTF-8 throughout. ISO-8859-1 does not support the languages and features we take for granted today, and is not a useful option.
I have some problems when i want to replace non-ascii characters from filename.
When I want to copy the file to do some test, it answer me with an "cannot open `FileName' for reading: No such file or directory.
And all of non-ascii file are changed by an "_".
Do you know how to get the real name or how to replace it l=with a good shell script?
Thank you a lot.
To get the non-ascii characters in file user can use the following sed statement.
sed 's/[^\d32-\d126]//g' <file_name>
Above instruction will print the non ASCII characters in the input file to stdout. By giving -i option to sed user can remove the ASCII characters from the file.
To replace the non-ascci characters with a particular character user can use the following statement.
sed 's/[\d32-\d126]/<replacing_char>/g' <file_name>
If you know the encoding that was used on the MacOS or Windows machine creating the file, you can use convmv to change that encoding to your like:
Re-encode a single file name from UTF16 to ASCII:
$ convmv -f utf8 -t ascii --notest <FILE NAME>
Re-encode a whole directory recursively from ISO8859-1 to UTF16 with Linux normalization:
$ convmv -f iso8859-1 -t utf16 --nfc -r --notest <DIRECTORY NAME>
For details see man convmv and man charsets.
Addendum:
If you do not have convmv installed, you can get it on its project page on freecode.com.
I am trying to convert a php file (client.php) from utf-8 to iso-8859-1 and the following command does nothing on the file:
iconv -f UTF-8 -t ISO-8859-1 client.php
Upon execution the original file contents are displayed.
In fact, when I check for the file's encoding after executing iconv with:
file -I client.php
The same old utf-8 is shown:
client.php: text/x-php; charset=utf-8
The iconv utility shall convert the encoding of characters in file from one codeset to another and write the results to standard output.
Here's a solution :
Write stdout to a temporary file and rename the temporary file
iconv -f UTF-8 -t ISO_8859-1 client.php > client_temp.php && mv -f client_temp.php client.php
ASCII, UTF-8 and ISO-8859 are 100% identical encodings for the lowest 128 characters. If your file only contains characters in that range (which is basically the set of characters you find on a common US English keyboard), there's no difference between these encodings.
My guess what's happening: A plain text file has no associated encoding meta data. You cannot know the encoding of a plain text file just by looking at it. What the file utility is doing is simply giving its best guess, and since there's no difference it prefers to tell you the file is UTF-8 encoded, which technically it may well be.
In addition to jackjr300 with the following One-Liner you can do it for all php files in the current folder:
for filename in *.php; do iconv -f ISO_8859-1 -t UTF-8 $filename > temp_$filename && mv -f ./temp_$filename ./$filename; done