change character ก to Ď during xls2csv -d utf-8 /source.xls >desination.csv - utf-8

When I convert an XLS file to CSV using xls2csv there is a problem that the character 'ก' is converted to 'Ď'.
My command is:
xls2csv -d utf-8 /test.xls>/test.csv
Other characters of Thai and other languages work normally.

Related

Decoding files containing hebrew characters and German eszett using iconv

I have a file that I'm pretty sure is in a weird encoding. I've successfully converted similar files to utf-8 previously by assuming they were encoded in windows-1255 using iconv (iconv -f windows-1255 -t utf-8 $file) and this has worked successfully.
My current file contains a ß character that is throwing me off - iconv breaks when it hits this (with an "illegal input sequence" error). Is there a different kind of encoding I should be using?
WINDOWS-1255 (= Hebrew) does not know an Eszett (ß), so ICONV behaves correctly. Other legacy codepages that know that character on code point 00DF:
WINDOWS-1250 = Latin 2 / Central European
WINDOWS-1252 = Latin 1 / Western European
WINDOWS-1254 = Turkish
WINDOWS-1257 = Baltic
WINDOWS-1258 = Vietnamese
Only the document owner knows which codepage is the correct one. If it's one of the WINDOWS-125x at all.

shell: how to make garbled from spanish in a txt file to be shown normally

i am read a txt file with some spanish characters into Postgresql, but found error message "invalid byte sequence for encoding "UTF8": 0Xdc, 0x45,
i used the following code to get the encoding for the file,
file -bi CAT_CELDAS_20190626.txt
the result is:
text/plain; charset=iso-8859-1,
Then i use iconv to convert the encoding from iso-8859-1 to utf-8,
iconv -f iso-8859-1 -t utf-8 CAT_CELDAS_20190626.txt -o CAT_CELDAS_20190626_new.txt
after conversion, i check the encoding of the new file, it is utf-8, but the garbled is still there,
503|706010004403418|3418|3418|13.959919|-89.1149|275|1900|GSM|3418|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3418|470||
503|706010004403417|3417|3417|13.959919|-89.1149|30|1900|GSM|3417|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3417|470||

Converting from ANSI to UTF-8 using script

I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?
You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.

Unexpected encoding error using JSON.parse

I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.
However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)
Here is the output of file on both machines
Windows:
λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
CentOS:
$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
Here is the error I get when trying to parse it:
$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)
What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.
"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.
You can get File.read to fix up the encoding for you by using the encoding options:
File.read('data.json',
:external_encoding => 'iso-8859-1',
:internal_encoding => 'utf-8'
)
That will give you a UTF-8 encoded string that you can hand to JSON.parse.
Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:
JSON.parse(
File.read('data.json',
:external_encoding => 'iso-8859-1',
)
)
You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.

Mac 10.8 > Cannot encode text file with UTF8

I cannot have a text or php file with utf8 charset ! I tried Netbeans, PhpStorm, iconv CLI etc... still file charset = us-ascii :
iconv -f us-ascii -t utf-8 toto.php > toto2.php
file -I toto2.php
toto2.php: text/plain; charset=us-ascii
What can I do ?
What's the content of toto2.php?
If the file only contains ASCII-compatible characters (i.e. mostly latin and the few "common" special characters), then there's no way for file (or any other tool) to distinguish a ASCII-encoded and an UTF-8 encoded file, because they will be byte-for-byte identical.
Given that iconv converted from "us-ascii", that's actually a given!
In other words: converting from a (real) ASCII file to a UTF-8 file is a no-op!

Resources