I'm using a .jl file that has data as rows of strings. It seems when the file was created, the data was not saved in utf-8, and this is causing me some problems.
Python 3 should be automatically converting strings to utf-8, and my development environment is PyCharm, which I've configured to display utf-8 encoding.
It looks like the data itself is the problem, being saved as raw strings.
Any ideas what I can do? Any packages I can use? .decode() doesn't seem to work?
Related
I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.
I am trying to process some Google Adwords csv files. The files are available in UNICODE format. When I use Ruby CSV parser to parse the file. I am not able to read the file. The characters display as \x00a \x00b etc.
I ended up having to open the file in OpenOffice and choose UTF-8 to render the file and then save it. After that, Ruby CSV can process the file. I also have to remove the first character in the csv file that looks like number 8 in black circle because it is not a valid UTF-8 character. This special character was the result of UNICODE to UTF-8 conversion in OpenOffice.
So what is the best way to convert the csv file to a Ruby friendly encoding without illegal characters?
To see what I can mean, you can try open Ruby CSV to open this file and parse the lines.
https://github.com/zben/encoding_test/blob/master/encoding_test.csv
This page suggests using Iconv.iconv to convert:
doc = Iconv.iconv('UTF-8', 'UTF-16', doc)
If I have a text file consisting solely the word "NESTLÉ", how do I open this in Excel without mangling the accent?
This question isn't quite covered by other questions on the site, so far as I can tell. I don't see any difference in any import option. I try to tell Excel it's UTF-8 when I import it, and the best that happens is that the É => _.
If I create a Google Docs spreadsheet with just that word and save it out to Excel format, then open in Excel, I get the data un-mangled, so that's good, it's possible to represent the data.
I've never seen Excel 2011 do anything smart with a UTF8 BOM indicator at the start of a file.
Does anyone else have different experience there, or know how to get this data from a text file to Excel without any intermediate translation tools?
I saved a file with that word in multiple formats. The results when opened with Excel 2010 by simply dragging and dropping the appropriate .txt file on it:
Correct
ANSI1 (Windows-1252 encoding on my system, which is US Windows)
UTF-8 with BOM
UTF-16BE without BOM
UTF-16LE without BOM
UTF-16LE with BOM
Incorrect
UTF-8 without BOM (result NESTLÉ)
UTF-16BE with BOM (result þÿNESTLÉ)
Do you know the encoding of your text file? Interesting the UTF-16BE with BOM failed. Excel is probably using a heuristic function such as IsTextUnicode.
1The so-called ANSI mode on Windows is a locale-specific encoding.
I have been reading about the issue with trying to figure out the actual encoding of a file and all its complications.
But I just need to know what the encoding of a file was set to when it was saved. Does windows store this information somewhere similar to file type , date modified etc., ?
That's not available. The Windows file system (NTFS) doesn't store any metadata for a file beyond the trivial stuff like name, extension, last written date, etcetera. Nothing that's specific for the file type.
All you have available is the BOM, bytes at beginning of the file that indicate the UTF encoding and byte order. It only exists for files encoded in UTF and, unfortunately, is optional. The real troublemakers however are text files that were encoded with a particular 8-bit non-Unicode code page. Usually created by a legacy application. Nothing you can do for that but hope that the file wasn't created too far away from your machine so that the default system code page is a match.
No operating system stores the information about the encoding to a file. the encoding is a property of text file only. Since some text files do not have .txt extension and some .txt file is not really a text file, associating the encoding to a file does not make much sense.
Some UTF-8 files store the byte order mark (BOM) at the beginning of the file which can be used to check whether it is a UTF-8 file or not. However, BOM is not always present and a UTF-8 file does not need to have BOM. So the only way to determine the encoding of the text file is to open it up with different encoding method until you can read the file.
I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.