iconvert does not change encoding, no errors - bash

I have a "gzip" byte in my plain text file (I can read it, there is not weird symbol, I cant cat or nano) that causes an error with a tool in python:
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte in line 49
My file is actualy encoded in us-ascii, and I want to try to get rid of the error encoding conversion. It should go from us-ascii to utf-8.
Used the command :
iconv -f US-ASCII -t UTF-8 id_to_genome_file.tsv -o id_to_genome_file2.tsv
without error but checking the file shows it is still in us-ascii:
file -i id_to_genome_file2.tsv
id_to_genome_file2.tsv: text/plain; charset=us-ascii
Hwy isnt the conversion taking effect ? And would the conversiona really solve the issue ?
EDIT: the error wasn't in the file but somewhere else.The errors shown by the tool and its documentation are poorly written. Solved.

Related

Ruby invalid multibyte char error (Sep 2019)

My script fails on this bad encoding, even I brought all files to UTF-8 but still some won't convert or just have wrong chars inside.
It fails actually on var assignment step.
Can I set some kind of error handling for this case like below so my loop will continue. That ¿ causes all problem.
Need to run this script all the way without errors. Tried already encoding und force_encoding and shebang line. Is Ruby has any kind of error handling routing so I can handle that bad case and continue with the rest of script? How to get rid of this error invalid multibyte char (UTF-8)
line = '¿USE [Alpha]'
lineOK = ' USE [Alpha] OK line'
>ruby ReadFile_Test.rb
ReadFile_Test.rb:15: invalid multibyte char (UTF-8)
I could reproduce your issue by saving the file with ISO-8859-1 encoding.
Running your code with the file in this non UTF8-encoding the error popped up. My solution was to save the file as UTF-8.
I am using Sublime as text editor and there is the option 'file > save with encoding'. I have chosen 'UTF-8' and was able to run the script.
Using puts line.encoding showed me UTF-8 then and no error anymore.
I suggest to re-check the encoding of your saved script file again.

shell: how to make garbled from spanish in a txt file to be shown normally

i am read a txt file with some spanish characters into Postgresql, but found error message "invalid byte sequence for encoding "UTF8": 0Xdc, 0x45,
i used the following code to get the encoding for the file,
file -bi CAT_CELDAS_20190626.txt
the result is:
text/plain; charset=iso-8859-1,
Then i use iconv to convert the encoding from iso-8859-1 to utf-8,
iconv -f iso-8859-1 -t utf-8 CAT_CELDAS_20190626.txt -o CAT_CELDAS_20190626_new.txt
after conversion, i check the encoding of the new file, it is utf-8, but the garbled is still there,
503|706010004403418|3418|3418|13.959919|-89.1149|275|1900|GSM|3418|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3418|470||
503|706010004403417|3417|3417|13.959919|-89.1149|30|1900|GSM|3417|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3417|470||

Converting from ANSI to UTF-8 using script

I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?
You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.

Unexpected encoding error using JSON.parse

I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.
However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)
Here is the output of file on both machines
Windows:
λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
CentOS:
$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
Here is the error I get when trying to parse it:
$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)
What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.
"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.
You can get File.read to fix up the encoding for you by using the encoding options:
File.read('data.json',
:external_encoding => 'iso-8859-1',
:internal_encoding => 'utf-8'
)
That will give you a UTF-8 encoded string that you can hand to JSON.parse.
Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:
JSON.parse(
File.read('data.json',
:external_encoding => 'iso-8859-1',
)
)
You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.

Mac 10.8 > Cannot encode text file with UTF8

I cannot have a text or php file with utf8 charset ! I tried Netbeans, PhpStorm, iconv CLI etc... still file charset = us-ascii :
iconv -f us-ascii -t utf-8 toto.php > toto2.php
file -I toto2.php
toto2.php: text/plain; charset=us-ascii
What can I do ?
What's the content of toto2.php?
If the file only contains ASCII-compatible characters (i.e. mostly latin and the few "common" special characters), then there's no way for file (or any other tool) to distinguish a ASCII-encoded and an UTF-8 encoded file, because they will be byte-for-byte identical.
Given that iconv converted from "us-ascii", that's actually a given!
In other words: converting from a (real) ASCII file to a UTF-8 file is a no-op!

Resources