I cannot have a text or php file with utf8 charset ! I tried Netbeans, PhpStorm, iconv CLI etc... still file charset = us-ascii :
iconv -f us-ascii -t utf-8 toto.php > toto2.php
file -I toto2.php
toto2.php: text/plain; charset=us-ascii
What can I do ?
What's the content of toto2.php?
If the file only contains ASCII-compatible characters (i.e. mostly latin and the few "common" special characters), then there's no way for file (or any other tool) to distinguish a ASCII-encoded and an UTF-8 encoded file, because they will be byte-for-byte identical.
Given that iconv converted from "us-ascii", that's actually a given!
In other words: converting from a (real) ASCII file to a UTF-8 file is a no-op!
Related
I have a file that I'm pretty sure is in a weird encoding. I've successfully converted similar files to utf-8 previously by assuming they were encoded in windows-1255 using iconv (iconv -f windows-1255 -t utf-8 $file) and this has worked successfully.
My current file contains a ß character that is throwing me off - iconv breaks when it hits this (with an "illegal input sequence" error). Is there a different kind of encoding I should be using?
WINDOWS-1255 (= Hebrew) does not know an Eszett (ß), so ICONV behaves correctly. Other legacy codepages that know that character on code point 00DF:
WINDOWS-1250 = Latin 2 / Central European
WINDOWS-1252 = Latin 1 / Western European
WINDOWS-1254 = Turkish
WINDOWS-1257 = Baltic
WINDOWS-1258 = Vietnamese
Only the document owner knows which codepage is the correct one. If it's one of the WINDOWS-125x at all.
i am read a txt file with some spanish characters into Postgresql, but found error message "invalid byte sequence for encoding "UTF8": 0Xdc, 0x45,
i used the following code to get the encoding for the file,
file -bi CAT_CELDAS_20190626.txt
the result is:
text/plain; charset=iso-8859-1,
Then i use iconv to convert the encoding from iso-8859-1 to utf-8,
iconv -f iso-8859-1 -t utf-8 CAT_CELDAS_20190626.txt -o CAT_CELDAS_20190626_new.txt
after conversion, i check the encoding of the new file, it is utf-8, but the garbled is still there,
503|706010004403418|3418|3418|13.959919|-89.1149|275|1900|GSM|3418|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3418|470||
503|706010004403417|3417|3417|13.959919|-89.1149|30|1900|GSM|3417|Hacienda Asunci髇|1|CUSCATLAN|SUCHITOTO|706|1|44|3417|470||
According to Mac OSX, I have a file with ISO-8859 encoding:
$ file filename.txt
filename.txt: ISO-8859 text, with CRLF line terminators
I try to read it with that encoding:
> filename = "/Users/myuser/Downloads/filename.txt"
> content = File.read(filename, encoding: "ISO-8859")
> content.encoding
=> #<Encoding:UTF-8>
It doesn't work. And consequently:
> content.split("\n")
ArgumentError: invalid byte sequence in UTF-8
Why doesn't it read the file as ISO-8859?
With your code, Ruby emits the following warning when reading the file:
warning: Unsupported encoding ISO-8859 ignored
This is because there is not only one ISO 8859 encoding but actually quite a bunch of variants. You need to specify the correct one explicitly, e.g
content = File.read(filename, encoding: "ISO-8859-1")
# or equivalently
content = File.read(filename, encoding: Encoding::ISO_8859_1)
When dealing with text files produced in Windows machines (which is hinted by the CRLF line endings), you might want to use Encoding:::Windows_1252 (resp. "Windows-1252") instead. This is a superset of ISO 8859-1 and used to be the default encoding used by many Windows programs and the system itself.
Try to use Encoding::ISO_8859_1 instead.
I have a question about converting UTF-8 to CP1252 in Ubuntu with PHP or SHELL.
Background : Converting a csv file from UTF-8 to CP1252 in Ubuntu with PHP or SHELL, copy file from Ubuntu to Windows, open file with nodepad++.
Environment :
Ubuntu 10.04
PHP 5.3
a file csv with letters (œ, à, ç)
Methods used :
With PHP
iconv("UTF-8", "CP1252", "content of file")
or
mb_convert_encoding("content of file", "UTF-8", "CP1252")
If I check the generated file with
file -i name_of_the_file
It displayed :
name_of_the_file: text/plain; charset=iso-8859-1
I copy this converted file to windows and opened with notepad++, in the bottom of the right, we can see the encoding is ANSI
And when I changed the encoding from ANSI to Windows-1252, the specials characters were well displayed.
With Shell
iconv -f UTF-8 -t CP1252" "content of file"
The rest will be the same .
Question :
1. Why the command file did not display directly CP1252 or ANSI but ISO-8895-1 ?
2. Why the specials characters could be well displayed when I changed the encoding from ANSI to Windows-1252.
Thank you in advance !
1.
CP1252 and ISO-8859-1 are very similar, quite often a file encoded in one of them would look identically as the file encoded in the second one. See Wikipedia to see which characters are in Windows-1252 and not in ISO-8859-1.
Letters à and ç are encoded identically in both encodings. While ISO-8859-1 doesn't have an œ and CP1252 does, file might have missed that. AFAIK it doesn't analyse the entire file.
2.
"ANSI" is a misnomer used for the default non-Unicode encoding in Windows. In case of Western European languages, ANSI means Windows-1252. In case of Central European, it's Windows-1250, in case of Russian it's Windows-1251, and so on. Nothing apart from Windows uses the term "ANSI" to refer to an encoding.
I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.
However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)
Here is the output of file on both machines
Windows:
λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
CentOS:
$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
Here is the error I get when trying to parse it:
$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)
What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.
"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.
You can get File.read to fix up the encoding for you by using the encoding options:
File.read('data.json',
:external_encoding => 'iso-8859-1',
:internal_encoding => 'utf-8'
)
That will give you a UTF-8 encoded string that you can hand to JSON.parse.
Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:
JSON.parse(
File.read('data.json',
:external_encoding => 'iso-8859-1',
)
)
You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.