How does visual studio resolve unicode string from different encoding source file ? - visual-studio

I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.

VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark

Related

How do WritePrivateProfileStringA() and Unicode go together?

I'm working on a legacy project that uses INI files and I'm currently trying to understand how Microsoft deals with INI files.
The documentation of WritePrivateProfileStringA() [MSDN] says for lpFileName:
If the file was created using Unicode characters, the function writes Unicode characters to the file. Otherwise, the function writes ANSI characters.
What exactly does that mean? What is a file "created using Unicode characters"? How does Microsoft determine whether a file was created using Unicode characters or not?
Since this is documented under lpFileName, do they refer to Unicode characters in the file name, like "if the file has a Japanese file name, we'll read it as Unicode"?
By default neither the ...A() nor the ...W() method supports Unicode as file contents for INI files. If e.g. the file does not exist, they will both create a file with ANSI content.
However, if you create the INI file first and you give it a UTF-16 BOM (byte-order-mark), both ...A() and ...W() will respect that BOM and write UTF-16 characters to the file.
Other than the BOM, the file can be empty, so a 2 byte file with 0xFF 0xFE content is enough to get the Microsoft API to write Unicode characters.
Both methods will not recognize and respect a UTF-8 BOM. In fact, a UTF-8 BOM can break an existing file if the UTF-8 BOM and the first section are both in line 1. In that case you can't access any of the keys in the affected section. If the first section is in line 2, the UTF-8 BOM will have no effect.
My tests on Windows 10 21H1 cannot confirm a statement about UTF16-BE support from 2006:
Just for fun, you can even reverse the BOM bytes and WritePrivateProfileString will write to it as a UTF-16 BE (Big Endian) file!

What encoding type does VB6 use to encode forms, classes and modules?

I am writing a C# script that instruments some VB6 code.
I need to output the generated code using the same encoding as VB6 to avoid losing character data. Right now it keeps changing Latin characters, like é, to null characters.
Notepad++ guesses that VB6 files are ANSCII by default, but since they don't have a BOM that's just an educated guess. I have tried encoding with ANSCII and it still loses character data.
Any ideas?
After testing this encoding as windows-1252 preserves the characters.

What does rb:bom|utf-8 mean in CSV.open in Ruby?

What does the 'rb:bom|utf-8' mean in:
CSV.open(csv_name, 'rb:bom|utf-8', headers: true, return_headers: true) do |csv|
I can understand that:
r means read
bom is a file format with \xEF\xBB\xBF at the start of a file to
indicate endianness.
utf-8 is a file format
But:
I don't know how they fits together and why is it necessary to write all these for reading a csv
I'm struggling to find the documentation for
this. It doesn't seem to be documented in
https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html
Update:
Found a very useful documentation:
https://ruby-doc.org/core-2.6.3/IO.html#method-c-new-label-Open+Mode
(The accepted answer is not incorrect but incomplete)
rb:bom|utf-8 converted to a human readable sentence means:
Open the file for reading (r) in binary mode (b) and look for an Unicode BOM marker (bom) to detect the encoding or, in case no BOM marker is found, assume UTF-8 encoding (utf-8).
A BOM marker can be used to detect if a file is UTF-8 or UTF-16 and in case it is UTF-16, whether that is little or big endian UTF-16. There is also a BOM marker for UTF-32, yet Ruby doesn't support UTF-32 as of today. A BOM marker is just a special reserved byte sequence in the Unicode standard that is only used for the purpose of detecting the encoding of a file and it must be the first "character" of that file. It's recommended and typically used for UTF-16 as it exists in two different variants, it's optional for UTF-8 and usually if a file is Unicode but has no BOM marker, it is assumed to be UTF-8.
When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.
If you're reading CSV files that are BOM encoded then you need to do it that way.
Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.

Batch convert to UTF8 using Ruby

I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Converting Multibyte characters to UTF-8

My application has to write data to an XML file which will be read by a swf file. The swf expects the data in the XML to be in UTF-8 encoding. I have to convert some Multibyte characters in my app(Chinese simplified, Japanese, Korean etc..) to UTF-8.
Are there any API calls which could allow me to do this?I would prefer not to use any 3rd party dlls. I need to do it both on Windows and on Mac and would prefer any system API's if available.
Thanks
jbsp72
UTF-8 is a multibyte encoding (Well, a variable byte-length encoding to be precise). Stating that you need to convert from a multibyte encoding is not enough. You need to specify which multibye encoding your source is?
I have to convert some Multibyte
characters in my app(Chinese
simplified, Japanese, Korean etc..) to
UTF-8.
if your original string is in multibyte (chinese/arabic/thai/etc..) and you need to convert it to other multibyte (UTF-8), One way is to convert to WideCharacter(UTF-16) first, then convert back to multibyte.
multibyte(chinese/arabic/thai/etc) -> widechar(UTF-16) -> multibyte(UTF-8)
if your original string is already in Unicode(UTF-16), you can skip the first conversion in the above illustration
you can refer the codepage from MSDN.
Google Chrome has some string conversion implementations for Windows, Linux, and Mac. You can see it here or here. the files are under src/base:
+ sys_string_conversions.h
+ sys_string_conversions_linux.cc
+ sys_string_conversions_win.cc
+ sys_string_conversions_mac.mm
The code uses BSD license so you can use it for commercial projects.

Resources