Does a file on windows have an encoding attribute? - windows

I have been reading about the issue with trying to figure out the actual encoding of a file and all its complications.
But I just need to know what the encoding of a file was set to when it was saved. Does windows store this information somewhere similar to file type , date modified etc., ?

That's not available. The Windows file system (NTFS) doesn't store any metadata for a file beyond the trivial stuff like name, extension, last written date, etcetera. Nothing that's specific for the file type.
All you have available is the BOM, bytes at beginning of the file that indicate the UTF encoding and byte order. It only exists for files encoded in UTF and, unfortunately, is optional. The real troublemakers however are text files that were encoded with a particular 8-bit non-Unicode code page. Usually created by a legacy application. Nothing you can do for that but hope that the file wasn't created too far away from your machine so that the default system code page is a match.

No operating system stores the information about the encoding to a file. the encoding is a property of text file only. Since some text files do not have .txt extension and some .txt file is not really a text file, associating the encoding to a file does not make much sense.
Some UTF-8 files store the byte order mark (BOM) at the beginning of the file which can be used to check whether it is a UTF-8 file or not. However, BOM is not always present and a UTF-8 file does not need to have BOM. So the only way to determine the encoding of the text file is to open it up with different encoding method until you can read the file.

Related

How do WritePrivateProfileStringA() and Unicode go together?

I'm working on a legacy project that uses INI files and I'm currently trying to understand how Microsoft deals with INI files.
The documentation of WritePrivateProfileStringA() [MSDN] says for lpFileName:
If the file was created using Unicode characters, the function writes Unicode characters to the file. Otherwise, the function writes ANSI characters.
What exactly does that mean? What is a file "created using Unicode characters"? How does Microsoft determine whether a file was created using Unicode characters or not?
Since this is documented under lpFileName, do they refer to Unicode characters in the file name, like "if the file has a Japanese file name, we'll read it as Unicode"?
By default neither the ...A() nor the ...W() method supports Unicode as file contents for INI files. If e.g. the file does not exist, they will both create a file with ANSI content.
However, if you create the INI file first and you give it a UTF-16 BOM (byte-order-mark), both ...A() and ...W() will respect that BOM and write UTF-16 characters to the file.
Other than the BOM, the file can be empty, so a 2 byte file with 0xFF 0xFE content is enough to get the Microsoft API to write Unicode characters.
Both methods will not recognize and respect a UTF-8 BOM. In fact, a UTF-8 BOM can break an existing file if the UTF-8 BOM and the first section are both in line 1. In that case you can't access any of the keys in the affected section. If the first section is in line 2, the UTF-8 BOM will have no effect.
My tests on Windows 10 21H1 cannot confirm a statement about UTF16-BE support from 2006:
Just for fun, you can even reverse the BOM bytes and WritePrivateProfileString will write to it as a UTF-16 BE (Big Endian) file!

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.

How does Windows Notepad interpret characters?

I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?
If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.
Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.
...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/
There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.
how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).

Unexpected Newlines in files uploaded to z/OS

I'm uploading a file that was originally ASCII and converted to EBCDIC from Windows OS to z/OS. My problem is that when I checked the file after uploading it, I see a lot of new lines.
When I tried to check it with its hex dump I discovered that when mainframe sees a x'15' it translates it into a newline. In the file there are packed decimals so the hex could contain let say a x'001500001c' but when I upload it, mainframe mistook it as a new line. Can anyone help me with this problem?
You should put your FTP client (or library if the upload is done by your code) into binary (IMAGE TYPE) mode instead of ascii/EBCDIC if you are sending a file already in EBCDIC i believe.
It depends on the type of target "file" that you're uploading to.
If you're uploading to a member that has fixed block size (e.g., FB80), you'll need to ensure all the lines are padded out with spaces before you transmit it up (in binary mode).
Text mode transfers are not suitable for binary files (and your files are binary if they contain packed decimals - there's no reliable way for FTP to detect real line-end characters).
You'll need to fix your Windows ASCII-to-EBCDIC converter to be able to generate fixed length records.
The only other option is with a REXX script on the mainframe but this would still require being able to tell the difference between a real end-of-line marker and that marker within the binary data.
You could possibly tell the presence of a packed decimal by virtue of the fact that it consisted of BCD nybbles, the last of which is 0xC or 0xD, but that could also cause false positives or negatives.
My advice: when you convert it from ASCII to EBCDIC, pad out the lines to the desired record length at the same time.
The other point I'd like to raise is that if you just want to look at the files on the mainframe (not use them from any code that requires EBCDIC), the ISPF editor includes a few new commands (as of z/OS 1.9 if I remember correctly).
SOURCE ASCII will display the data as ASCII rather than EBCDIC. In addition, the LF command allows you to massage the ASCII stream in an FB member to correctly fix up line endings.

OS X file duplication converts text encoding by default

All the PHP files in my workspace are encoded in Unicode (UTF-8, no BOM). I often duplicate an existing source file to use as a base for a new script. Invariably (with Path Finder or the original Finder), OS X will convert the encoding of the duplicate file to Western (Mac OS Roman).
Is there any way to make OS X behave and not convert the text encoding when duplicating a text file? Or make it use a specific text encoding (other than Western!) by default for all files with .php extension?
I highly doubt that the "conversion" you're seeing is an actual difference in the two copies of the file. OS X doesn't store any sort of encoding value for a file's content, so it's much more likely that it's whatever tool you're using to view/edit your files that's interpreting the content differently, whether it be Xcode, TextWrangler, or whatever other editor you're using. Doing a straight copy of the file, whether it be from the Finder, Path Finder, or the command line, won't ever change the actual data in the file in the process of copying.
UTF-8 and Mac OS Roman are actually the same if you don't have any characters above 0x7f, so it's very possible that a valid UTF-8 file with no BOM could be interpreted as a Mac OS Roman file. In any case, I would look into whatever program it is that's showing you that encoding to find the solution, not the file copying procedure of OS X itself.
Please note that there is indeed an extended attribute in OS X that stores the file's encoding for text files produced through -[NSString writeToFile:...] or similar facilities, which may not be getting copied along with the file.
Are you using the Finder to copy your files, or some other tool (an IDE)?
It's possible your original file has some extended Finder metadata in its Resource Fork that's being lost in the copying, if you don't use Finder to copy it.
-W

Resources