How does Windows Notepad interpret characters? - windows

I was wondering how Windows interprets characters.
I made a file with a hex editor with the 3 bytes E3 81 81.
Those bytes are the ぁ character in UTF-8.
I opened the notepad and it displayed ぁ. I didn't specify the encoding of the file, I just created the bytes and the notepad interpreted it correctly.
Is notepad somehow guessing the encoding?
Or is the hex editor saving those bytes with a specific encoding?

If the file only contains these three bytes, then there is no information at all about which encoding to use.
A byte is just a byte, and there is no way to include any encoding information in it. Besides, the hex editor doesn't even know that you intended to decode the data as text.
Notepad normally uses ANSI encoding, so if it reads the file as UTF-8 then it has to guess the encoding based on the data in the file.
If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file.

Notepad makes an educated guess. I don't know the details, but loading the first few kilobytes and trying to convert them from UTF-8 is very simple, so it probably does something similar to that.

...and sometimes it gets it wrong...
https://ychittaranjan.wordpress.com/2006/06/20/buggy-notepad/

There is an easy and efficient way to check whether a file is in UTF-8. See Wikipedia: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=581360767#Advantages, fourth bullet point. Notepad probably uses this.
Wikipedia claims that Notepad used the IsTextUnicode function, which checks whether a patricular text is written in UTF-16 (it may have stopped using it in Windows Vista, which fixed the "Bush hid the facts" bug): http://en.wikipedia.org/wiki/Bush_hid_the_facts.

how to identify the file is in which encoding ....?
Go to the file and try to Save As... and you can see the default (current) encoding of the file (in which encoding it is saved).

Related

How to find file encoding type or convert any encoding type to UTF-8 in shell?

I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.

How to open a (CSV) file in oracle and save it in UTF-8 format if it is in another formats

Can anyone please advise me on the below issue.
I have an oracle program which will take a .CSV file as the input and will process it. We are now facing an issue that when there is an extended ASCII character appear in the input file, its trimming the next letter after that special character.
We are using the File utility function Utl_File.Fopen_Nchar() to open the file and Utl_File.Get_Line_Nchar() for reading the characters in the file. The program is written in such a way that it should handle multiple languages(Unicode characters) in the input file.
In the analysis its found that when the character encoding of the CSV file is UTF-8 its processing the file successfully even when extended ASCII characters as well as Unicode characters are there. But some times we are getting the file in 1252 (ANSI - Latin I) format which makes the trimming problem for extended ASCII characters.
So is there any way to handle this issue? Can we open a (CSV) file in oracle and save it in UTF-8 format if it's in any another formats?
Please let me know if any more info is needed.
Thanks in anticipation.
The problem is when you don't know in which encoding your CSV file is saved then it is not possible to determine any conversion either. You would screw up your CSV file.
What do you mean by "1252 (ANSI - Latin I)"?
Windows-1252 and ISO-8859-1 are not equal, see the difference here: ISO 8859-1 vs. ISO 8859-15 vs. Windows-1252 vs. Unicode
(Sorry for posting the German Wikipedia, however the English version does not show such a nice table)
You could use the fix_latin command-line tool convert a file from an unknown mixture of ASCII / Latin-1 / CP1251 / UTF8 into UTF8:
fix_latin < input.csv > output.csv
The fix_latin utility is a simple Perl script which is shipped with the Encoding::FixLatin module on CPAN.

Batch convert to UTF8 using Ruby

I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Possible to repair garbled Chinese filenames?

I'm downloading via FTP some files with chinese names (BIG5 encoded), and Filezilla displays those filenames as garbage (as FTP cannot handle any encoding other than ASCII and UTF-8, as least the standard compliant ones).
Given a filename with garbled characters, is it possible for me to repair the encoding and get a proper filename String given that I already know the source encoding? Will the FTP client misinterpreting BIG5 as UTF-8 insert bytes that make conversion back to BIG5 difficult?
My proposed steps (in Java):
1. get the garbled filename using File object.
2. getbytes using UTF-8.
3. create a new string using those bytes in BIG5.
4. Write the decoded filename back to the file.
Will the above method work?
Not every sequence of bytes is a valid ASCII or UTF-8 string so it's quite likely that some of the bytes will have been discarded, converted to the replacement character, or otherwise irreversibly mangled. So it looks like you won't be able to retrieve the original filenames if they have been modified by FileZilla to become correctly formed UTF-8 or ASCII.
You might be lucky to be able to get a certain percentage of the original characters back, where they just happened to be both valid BIG5 and valid UTF-8, but I doubt you will be able to recover the entire filename.
You could post a few examples of your garbled filenames (as raw bytes encoded in hex) to get a more definite answer. That way we can see exactly what the damage is.

Unexpected Newlines in files uploaded to z/OS

I'm uploading a file that was originally ASCII and converted to EBCDIC from Windows OS to z/OS. My problem is that when I checked the file after uploading it, I see a lot of new lines.
When I tried to check it with its hex dump I discovered that when mainframe sees a x'15' it translates it into a newline. In the file there are packed decimals so the hex could contain let say a x'001500001c' but when I upload it, mainframe mistook it as a new line. Can anyone help me with this problem?
You should put your FTP client (or library if the upload is done by your code) into binary (IMAGE TYPE) mode instead of ascii/EBCDIC if you are sending a file already in EBCDIC i believe.
It depends on the type of target "file" that you're uploading to.
If you're uploading to a member that has fixed block size (e.g., FB80), you'll need to ensure all the lines are padded out with spaces before you transmit it up (in binary mode).
Text mode transfers are not suitable for binary files (and your files are binary if they contain packed decimals - there's no reliable way for FTP to detect real line-end characters).
You'll need to fix your Windows ASCII-to-EBCDIC converter to be able to generate fixed length records.
The only other option is with a REXX script on the mainframe but this would still require being able to tell the difference between a real end-of-line marker and that marker within the binary data.
You could possibly tell the presence of a packed decimal by virtue of the fact that it consisted of BCD nybbles, the last of which is 0xC or 0xD, but that could also cause false positives or negatives.
My advice: when you convert it from ASCII to EBCDIC, pad out the lines to the desired record length at the same time.
The other point I'd like to raise is that if you just want to look at the files on the mainframe (not use them from any code that requires EBCDIC), the ISPF editor includes a few new commands (as of z/OS 1.9 if I remember correctly).
SOURCE ASCII will display the data as ASCII rather than EBCDIC. In addition, the LF command allows you to massage the ASCII stream in an FB member to correctly fix up line endings.

Resources