I have a "gzip" byte in my plain text file (I can read it, there is not weird symbol, I cant cat or nano) that causes an error with a tool in python:
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte in line 49
My file is actualy encoded in us-ascii, and I want to try to get rid of the error encoding conversion. It should go from us-ascii to utf-8.
Used the command :
iconv -f US-ASCII -t UTF-8 id_to_genome_file.tsv -o id_to_genome_file2.tsv
without error but checking the file shows it is still in us-ascii:
file -i id_to_genome_file2.tsv
id_to_genome_file2.tsv: text/plain; charset=us-ascii
Hwy isnt the conversion taking effect ? And would the conversiona really solve the issue ?
EDIT: the error wasn't in the file but somewhere else.The errors shown by the tool and its documentation are poorly written. Solved.
Related
My script fails on this bad encoding, even I brought all files to UTF-8 but still some won't convert or just have wrong chars inside.
It fails actually on var assignment step.
Can I set some kind of error handling for this case like below so my loop will continue. That ¿ causes all problem.
Need to run this script all the way without errors. Tried already encoding und force_encoding and shebang line. Is Ruby has any kind of error handling routing so I can handle that bad case and continue with the rest of script? How to get rid of this error invalid multibyte char (UTF-8)
line = '¿USE [Alpha]'
lineOK = ' USE [Alpha] OK line'
>ruby ReadFile_Test.rb
ReadFile_Test.rb:15: invalid multibyte char (UTF-8)
I could reproduce your issue by saving the file with ISO-8859-1 encoding.
Running your code with the file in this non UTF8-encoding the error popped up. My solution was to save the file as UTF-8.
I am using Sublime as text editor and there is the option 'file > save with encoding'. I have chosen 'UTF-8' and was able to run the script.
Using puts line.encoding showed me UTF-8 then and no error anymore.
I suggest to re-check the encoding of your saved script file again.
i am read a txt file with some spanish characters into Postgresql, but found error message "invalid byte sequence for encoding "UTF8": 0Xdc, 0x45,
i used the following code to get the encoding for the file,
file -bi CAT_CELDAS_20190626.txt
the result is:
text/plain; charset=iso-8859-1,
Then i use iconv to convert the encoding from iso-8859-1 to utf-8,
iconv -f iso-8859-1 -t utf-8 CAT_CELDAS_20190626.txt -o CAT_CELDAS_20190626_new.txt
after conversion, i check the encoding of the new file, it is utf-8, but the garbled is still there,
503|706010004403418|3418|3418|13.959919|-89.1149|275|1900|GSM|3418|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3418|470||
503|706010004403417|3417|3417|13.959919|-89.1149|30|1900|GSM|3417|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3417|470||
I have created a script (.sh file) to convert a CSV file from ANSI encoding to UTF-8.
The command I used is:
iconv -f "windows-1252" -t "UTF-8" $csvname -o $newcsvname
I got this from another Stack Overflow post.
but the iconv command doesn't seem to be working.
Snapshot of input file contents in Notepad++
Snapshot of firstcsv file below
Snapshot of second csv file below,
EDIT: I tried reducing the problematic input CSV file contents to a few lines (similar to the first file), and now it gets converted fine. Is there something wrong with the file contents itself then? How do I check that?
You can use python chardet Character Encoding Detector to ensure existing character encoding format.
iconv -f {character encoding} -t utf-8 {FileName} > {Output FileName}
This should work. Also check if any junk characters are exist in file or not, that may create error in conversion.
I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.
However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)
Here is the output of file on both machines
Windows:
λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
CentOS:
$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
Here is the error I get when trying to parse it:
$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)
What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.
"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.
You can get File.read to fix up the encoding for you by using the encoding options:
File.read('data.json',
:external_encoding => 'iso-8859-1',
:internal_encoding => 'utf-8'
)
That will give you a UTF-8 encoded string that you can hand to JSON.parse.
Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:
JSON.parse(
File.read('data.json',
:external_encoding => 'iso-8859-1',
)
)
You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.
I cannot have a text or php file with utf8 charset ! I tried Netbeans, PhpStorm, iconv CLI etc... still file charset = us-ascii :
iconv -f us-ascii -t utf-8 toto.php > toto2.php
file -I toto2.php
toto2.php: text/plain; charset=us-ascii
What can I do ?
What's the content of toto2.php?
If the file only contains ASCII-compatible characters (i.e. mostly latin and the few "common" special characters), then there's no way for file (or any other tool) to distinguish a ASCII-encoded and an UTF-8 encoded file, because they will be byte-for-byte identical.
Given that iconv converted from "us-ascii", that's actually a given!
In other words: converting from a (real) ASCII file to a UTF-8 file is a no-op!