I'm working on a legacy project that uses INI files and I'm currently trying to understand how Microsoft deals with INI files.
The documentation of WritePrivateProfileStringA() [MSDN] says for lpFileName:
If the file was created using Unicode characters, the function writes Unicode characters to the file. Otherwise, the function writes ANSI characters.
What exactly does that mean? What is a file "created using Unicode characters"? How does Microsoft determine whether a file was created using Unicode characters or not?
Since this is documented under lpFileName, do they refer to Unicode characters in the file name, like "if the file has a Japanese file name, we'll read it as Unicode"?
By default neither the ...A() nor the ...W() method supports Unicode as file contents for INI files. If e.g. the file does not exist, they will both create a file with ANSI content.
However, if you create the INI file first and you give it a UTF-16 BOM (byte-order-mark), both ...A() and ...W() will respect that BOM and write UTF-16 characters to the file.
Other than the BOM, the file can be empty, so a 2 byte file with 0xFF 0xFE content is enough to get the Microsoft API to write Unicode characters.
Both methods will not recognize and respect a UTF-8 BOM. In fact, a UTF-8 BOM can break an existing file if the UTF-8 BOM and the first section are both in line 1. In that case you can't access any of the keys in the affected section. If the first section is in line 2, the UTF-8 BOM will have no effect.
My tests on Windows 10 21H1 cannot confirm a statement about UTF16-BE support from 2006:
Just for fun, you can even reverse the BOM bytes and WritePrivateProfileString will write to it as a UTF-16 BE (Big Endian) file!
Related
What does the 'rb:bom|utf-8' mean in:
CSV.open(csv_name, 'rb:bom|utf-8', headers: true, return_headers: true) do |csv|
I can understand that:
r means read
bom is a file format with \xEF\xBB\xBF at the start of a file to
indicate endianness.
utf-8 is a file format
But:
I don't know how they fits together and why is it necessary to write all these for reading a csv
I'm struggling to find the documentation for
this. It doesn't seem to be documented in
https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html
Update:
Found a very useful documentation:
https://ruby-doc.org/core-2.6.3/IO.html#method-c-new-label-Open+Mode
(The accepted answer is not incorrect but incomplete)
rb:bom|utf-8 converted to a human readable sentence means:
Open the file for reading (r) in binary mode (b) and look for an Unicode BOM marker (bom) to detect the encoding or, in case no BOM marker is found, assume UTF-8 encoding (utf-8).
A BOM marker can be used to detect if a file is UTF-8 or UTF-16 and in case it is UTF-16, whether that is little or big endian UTF-16. There is also a BOM marker for UTF-32, yet Ruby doesn't support UTF-32 as of today. A BOM marker is just a special reserved byte sequence in the Unicode standard that is only used for the purpose of detecting the encoding of a file and it must be the first "character" of that file. It's recommended and typically used for UTF-16 as it exists in two different variants, it's optional for UTF-8 and usually if a file is Unicode but has no BOM marker, it is assumed to be UTF-8.
When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.
If you're reading CSV files that are BOM encoded then you need to do it that way.
Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.
I have a file of character encoding set to ANSI, however I can still copy a character of character set UTF-8. Are character sets defined on the file forced on the entire file? I am trying to understand how character sets works. Thanks
Files are bytes. They are long sequences of numbers. In most operating systems, that's all they are. There is no "encoding" attached to the file. The file is bytes.
It is up to software to interpret those bytes as having some meaning. For example, there is nothing fundamentally different between an "picture file" and a "text file." Both are just long sequences of numbers. But software interprets the "picture file" using some encoding rules to create a picture. Similarly, software interprets the "text file" using some encoding rules.
Most text file formats do not include their encoding anywhere the format. It's up to the software to know or infer what it is. Sometimes the operating system assists here and provides additional metadata that's not in the file, like filename extensions. This generally doesn't help for text files, since in most systems text files do not have different extensions based on their encoding.
There are many character encodings in ANSI that exactly match UTF-8 encodings. So just looking at a file, it may be impossible to tell which encoding it was written with, since it could be identical in both. There are byte sequences that are illegal in UTF-8, so it is possible to determine that file is not valid UTF-8, but all byte sequences are valid ANSI (though there are byte sequences that are very rare, and so can be used to guess that it's not ANSI).
(I assume you mean Windows-1252; there isn't really such a thing as "ANSI" encoding.)
I know if I using unicode charset in vs, I can use L"There is a string" to present an unicode string. I think There is a string will be read from srouce file when vs is doing lexical parsing, it will decode There is a string to unicode from source file's encoding.
I have change source file to some different encodings, but I always got the correct unicode data from L marco. Dose vs detect the encoding of source file to covert There is a string to correct unicode ? If not, how does vs achieve this ?
I'm not sure whether this question could be asked in SO, if not , where should I ask? Thanks in advance.
VS won't detect the encoding without a BOM1 signature at the start of a source file. It will just assume the localized ANSI encoding if no BOM is present.
A BOM signature identifies the UTF8/16/32 encoding used. So if you save something as UTF-8 (VS will add a BOM) and remove the first 3 bytes (EF BB BF), then the file will be interpreted as CP1252 on US Windows, but GB2312 on Chinese Windows, etc.
You are on Chinese Windows, so either save as GB2312 (without BOM) or UTF8 (with BOM) for VS to decode your source code correctly.
1https://en.wikipedia.org/wiki/Byte_order_mark
Can anyone please advise me on the below issue.
I have an oracle program which will take a .CSV file as the input and will process it. We are now facing an issue that when there is an extended ASCII character appear in the input file, its trimming the next letter after that special character.
We are using the File utility function Utl_File.Fopen_Nchar() to open the file and Utl_File.Get_Line_Nchar() for reading the characters in the file. The program is written in such a way that it should handle multiple languages(Unicode characters) in the input file.
In the analysis its found that when the character encoding of the CSV file is UTF-8 its processing the file successfully even when extended ASCII characters as well as Unicode characters are there. But some times we are getting the file in 1252 (ANSI - Latin I) format which makes the trimming problem for extended ASCII characters.
So is there any way to handle this issue? Can we open a (CSV) file in oracle and save it in UTF-8 format if it's in any another formats?
Please let me know if any more info is needed.
Thanks in anticipation.
The problem is when you don't know in which encoding your CSV file is saved then it is not possible to determine any conversion either. You would screw up your CSV file.
What do you mean by "1252 (ANSI - Latin I)"?
Windows-1252 and ISO-8859-1 are not equal, see the difference here: ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode
(Sorry for posting the German Wikipedia, however the English version does not show such a nice table)
You could use the fix_latin command-line tool convert a file from an unknown mixture of ASCII / Latin-1 / CP1251 / UTF8 into UTF8:
fix_latin < input.csv > output.csv
The fix_latin utility is a simple Perl script which is shipped with the Encoding::FixLatin module on CPAN.
I have been reading about the issue with trying to figure out the actual encoding of a file and all its complications.
But I just need to know what the encoding of a file was set to when it was saved. Does windows store this information somewhere similar to file type , date modified etc., ?
That's not available. The Windows file system (NTFS) doesn't store any metadata for a file beyond the trivial stuff like name, extension, last written date, etcetera. Nothing that's specific for the file type.
All you have available is the BOM, bytes at beginning of the file that indicate the UTF encoding and byte order. It only exists for files encoded in UTF and, unfortunately, is optional. The real troublemakers however are text files that were encoded with a particular 8-bit non-Unicode code page. Usually created by a legacy application. Nothing you can do for that but hope that the file wasn't created too far away from your machine so that the default system code page is a match.
No operating system stores the information about the encoding to a file. the encoding is a property of text file only. Since some text files do not have .txt extension and some .txt file is not really a text file, associating the encoding to a file does not make much sense.
Some UTF-8 files store the byte order mark (BOM) at the beginning of the file which can be used to check whether it is a UTF-8 file or not. However, BOM is not always present and a UTF-8 file does not need to have BOM. So the only way to determine the encoding of the text file is to open it up with different encoding method until you can read the file.