I'm trying to export some text to an UTF-8 file with LotusScript. I checked the documentation and the following lines should output my text as UTF-8, but Notepad++ says it's ANSI.
Dim streamCompanies As NotesStream
Dim sesCurrent as New NotesSession
Set streamCompanies = sesCurrent.CreateStream
Call streamCompanies.Open("C:\companies.txt", "UTF-8")
Call streamCompanies.WriteText("Test")
streamCompanies.Close
When I try the same with UTF-16 instead of UTF-8, the generated fileformat is correct. Could anyone point me in the right direction on how to write an UTF-8 file with LotusScript on a Windows platform?
Notes is most likely doing its job and encoding properly. It is likely that Notepad++ is interpreting the UTF-8 file as ANSI if no UTF-8-only characters exist in the file. There is no other way to determine the encoding in this case other than to analyze its contents.
See this SO answer: How to avoid inadvertent encoding of UTF-8 files as ASCII/ANSI?
So a simple test to make sure Notes is working would be to output a non-ANSI character and then open in Notepad++ to confirm.
Closed - down the line while coding I stumbled across some data with Asian characters which where displayed correctly in my text editor. Rechecking file encodings I found the following:
If the output text only includes ASCII-chars, it is decoded as ANSI with Notepad++
If the output text contains e.g. Katakana, it is decoded as UTF-8 with Notepad++
-> problem solved for me.
Related
I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.
VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark
Can anyone please advise me on the below issue.
I have an oracle program which will take a .CSV file as the input and will process it. We are now facing an issue that when there is an extended ASCII character appear in the input file, its trimming the next letter after that special character.
We are using the File utility function Utl_File.Fopen_Nchar() to open the file and Utl_File.Get_Line_Nchar() for reading the characters in the file. The program is written in such a way that it should handle multiple languages(Unicode characters) in the input file.
In the analysis its found that when the character encoding of the CSV file is UTF-8 its processing the file successfully even when extended ASCII characters as well as Unicode characters are there. But some times we are getting the file in 1252 (ANSI - Latin I) format which makes the trimming problem for extended ASCII characters.
So is there any way to handle this issue? Can we open a (CSV) file in oracle and save it in UTF-8 format if it's in any another formats?
Please let me know if any more info is needed.
Thanks in anticipation.
The problem is when you don't know in which encoding your CSV file is saved then it is not possible to determine any conversion either. You would screw up your CSV file.
What do you mean by "1252 (ANSI - Latin I)"?
Windows-1252 and ISO-8859-1 are not equal, see the difference here: ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode
(Sorry for posting the German Wikipedia, however the English version does not show such a nice table)
You could use the fix_latin command-line tool convert a file from an unknown mixture of ASCII / Latin-1 / CP1251 / UTF8 into UTF8:
fix_latin < input.csv > output.csv
The fix_latin utility is a simple Perl script which is shipped with the Encoding::FixLatin module on CPAN.
I have been writing a vbs script in notepad that adds text to an excel file, and this is working fine.
I then needed to write unicode characters to the excel file, so saved the vbs file as unicode and again all worked fine.
I am noe trying to write the file dynamically from another program, which is possible, but it writes the unicode vbs file as utf-8 formatting, and then when I try to run the vbs file, it gives an error
saying error:invalid character
code:800A0408
Source:microsoft VBScript compilation error
Does this mean I cannot run a file saved as utf-8 formatting, or am I missing something?
Any help would be gratefully received!
Dave
Use UCS-2 Little Endian, that accepts unicode chars and runs VBS properly! You can convert any existing VBS file to this format with notepad++ for example.
C/WScript.exe can't run UTF-8 encoded .vbs files. If you can't change the encoding/write mode of that 'another program', you either have to convert the UTF-8 source to UTF-16 or write/generate the code in plain ASCII and inject the unicode data via ChrW() resp. an UTF-16 (easy) or UTF-8 (ADODB.Stream) encoded external file.
WRT comment:
As long as you don't use non-ascii characters in string literals - and you can avoid that for a few of them by using ".." & ChrW(..) & ".." - you can save the .vbs as ascii. If your 'another program' loads and saves such a file as UTF-8 (without BOM!) it doesn't matter; but if it adds a BOM you must convert the source file.
Perhaps you should add some more details/code to improve your chances of getting better advice.
I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).
I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.