If I have a text file consisting solely the word "NESTLÉ", how do I open this in Excel without mangling the accent?
This question isn't quite covered by other questions on the site, so far as I can tell. I don't see any difference in any import option. I try to tell Excel it's UTF-8 when I import it, and the best that happens is that the É => _.
If I create a Google Docs spreadsheet with just that word and save it out to Excel format, then open in Excel, I get the data un-mangled, so that's good, it's possible to represent the data.
I've never seen Excel 2011 do anything smart with a UTF8 BOM indicator at the start of a file.
Does anyone else have different experience there, or know how to get this data from a text file to Excel without any intermediate translation tools?
I saved a file with that word in multiple formats. The results when opened with Excel 2010 by simply dragging and dropping the appropriate .txt file on it:
Correct
ANSI1 (Windows-1252 encoding on my system, which is US Windows)
UTF-8 with BOM
UTF-16BE without BOM
UTF-16LE without BOM
UTF-16LE with BOM
Incorrect
UTF-8 without BOM (result NESTLÉ)
UTF-16BE with BOM (result þÿNESTLÉ)
Do you know the encoding of your text file? Interesting the UTF-16BE with BOM failed. Excel is probably using a heuristic function such as IsTextUnicode.
1The so-called ANSI mode on Windows is a locale-specific encoding.
Related
I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).
I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.
When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.
Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.
The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.
Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.
I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.
VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark
Can anyone please advise me on the below issue.
I have an oracle program which will take a .CSV file as the input and will process it. We are now facing an issue that when there is an extended ASCII character appear in the input file, its trimming the next letter after that special character.
We are using the File utility function Utl_File.Fopen_Nchar() to open the file and Utl_File.Get_Line_Nchar() for reading the characters in the file. The program is written in such a way that it should handle multiple languages(Unicode characters) in the input file.
In the analysis its found that when the character encoding of the CSV file is UTF-8 its processing the file successfully even when extended ASCII characters as well as Unicode characters are there. But some times we are getting the file in 1252 (ANSI - Latin I) format which makes the trimming problem for extended ASCII characters.
So is there any way to handle this issue? Can we open a (CSV) file in oracle and save it in UTF-8 format if it's in any another formats?
Please let me know if any more info is needed.
Thanks in anticipation.
The problem is when you don't know in which encoding your CSV file is saved then it is not possible to determine any conversion either. You would screw up your CSV file.
What do you mean by "1252 (ANSI - Latin I)"?
Windows-1252 and ISO-8859-1 are not equal, see the difference here: ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode
(Sorry for posting the German Wikipedia, however the English version does not show such a nice table)
You could use the fix_latin command-line tool convert a file from an unknown mixture of ASCII / Latin-1 / CP1251 / UTF8 into UTF8:
fix_latin < input.csv > output.csv
The fix_latin utility is a simple Perl script which is shipped with the Encoding::FixLatin module on CPAN.
I have problem with Chinese characters when I export them from Oracle forms 10g to Excel on Windows 7. Although they look like Chinese but they are not Chinese characters. Take this into consideration that I have already changed the language of my computer to Chinese and restarted my computer. I use owa_sylk utility and call the excel report like:
v_url := 'http://....../excel_reports.rep?sqlString=' ||
v_last_query ||
'&font_name=' ||
'Arial Unicode MS'||
'&show_null_as=' ||
' ' ;
web.show_document(v_url,'_self');
Here you can see what it looks like:
Interestingly, when I change the language of my computer to English, this column is empty. Besides, I realized that if I open the file with a text editor then it has the right Chinese word, but when we open it with Excel we have problem.
Does anyone has a clue?
Thanks
Yes, the problem comes from different encodings. If DB uses UTF-8 and you need to send ASCII to Excel, you can convert data right inside the owa_sylk. Use function convert.
For ex. in function owa_sylk.print_rows change
p( line );
on
p(convert(line, 'ZHS32GB18030','AL32UTF8'));
Where 'ZHS32GB18030' is one of Chinese ASCII and 'AL32UTF8' - UTF-8.
To choose encoding parameters use Appendix A
You can also do
*SELECT * FROM V$NLS_VALID_VALUES WHERE parameter = 'CHARACTERSET'*
to see all the supported encodings.
This is a character encoding issue. What you need to make sure is that all tools in the whole chain (database, web service, Excel, text editor and web browser) use the same character encoding.
Changing your language can help here but a better approach is to nail the encoding down for each part of the chain.
The web browser, for example, will prefer the encoding supplied by the web server over the OS's language settings.
See this question how to set UTF-8 encoding (which can properly display Chinese in any form) for Oracle: export utf-8 data to text file with oracle sql developer
I'm not sure how to set the encoding for owa_sylk, you will have to check the documentation (couldn't find any, though). If you can't find anything, ask a question here or use a different tool.
So you need to find out who executes excel_reports.rep and configure that correctly. Use a developer tool of your web browser and check the "charset" or "encoding" of the page.
The problems in Excel are based on the file format which you feed into it. Excel (.xls and .xlsx files) files are Unicode safe, .csv isn't. So if you can read the file in your text editor, chances are that this is a non-Excel file format which Excel can parse but it doesn't contain the necessary encoding information.
If you were able to generate a UTF-8 encoded file with the steps above, you can load the file by using "Choose 65001: Unicode (UTF-8) from the drop-down list that appears next to File origin." in the "Text Import Wizard" (source)
I'm trying to export some text to an UTF-8 file with LotusScript. I checked the documentation and the following lines should output my text as UTF-8, but Notepad++ says it's ANSI.
Dim streamCompanies As NotesStream
Dim sesCurrent as New NotesSession
Set streamCompanies = sesCurrent.CreateStream
Call streamCompanies.Open("C:\companies.txt", "UTF-8")
Call streamCompanies.WriteText("Test")
streamCompanies.Close
When I try the same with UTF-16 instead of UTF-8, the generated fileformat is correct. Could anyone point me in the right direction on how to write an UTF-8 file with LotusScript on a Windows platform?
Notes is most likely doing its job and encoding properly. It is likely that Notepad++ is interpreting the UTF-8 file as ANSI if no UTF-8-only characters exist in the file. There is no other way to determine the encoding in this case other than to analyze its contents.
See this SO answer: How to avoid inadvertent encoding of UTF-8 files as ASCII/ANSI?
So a simple test to make sure Notes is working would be to output a non-ANSI character and then open in Notepad++ to confirm.
Closed - down the line while coding I stumbled across some data with Asian characters which where displayed correctly in my text editor. Rechecking file encodings I found the following:
If the output text only includes ASCII-chars, it is decoded as ANSI with Notepad++
If the output text contains e.g. Katakana, it is decoded as UTF-8 with Notepad++
-> problem solved for me.