Does FTP support UTF Character Set? - utf-8

If I want to transfer some file from one system to another system which contains non-English characters. So does FTP support UTF character set ?

What we're talking about is filename Unicode support. To transfer a file which is UTF-8 encoded, use "binary" mode.
Although RFC 2640 extended the original FTP specification to support non-ASCII filenames, not every FTP server or FTP client supports it.
You can check your server implementation by running the following on the client's command terminal:
FEAT
and check for:
UTF8
in the response. If not, you will have to guess the 8 bit encoding of the remote side or convert your filename to ascii.

if you are talking about the FTP protocol it seems that it is supported.
The FTP protocol is specified in RFC 959, which was published in 1985.
The FTP protocol is designed on top of the original Telnet protocol,
which is specified in RFC 854. The relevant sections of the Telnet
specification regarding FTP are those covering the Network Virtual
Terminal (NVT). According to RFC 854, the NVT requires the use of
(7-bit) ASCII as the character set. Use of any other character set
requires explicit negotiation. This character set only contains 127
different characters: English letters and numbers, punctuation
characters and a few control characters. Accented letters, umlauts or
other scripts are not contained in the ASCII character set. In order
to support non-English characters, the FTP specifications were
extended in 1999 in RFC 2640. This extension requires the use of UTF-8
as the character set. This character set is a strict superset of
ASCII, every valid ASCII character is also the same character in
UTF-8. The UTF-8 character set can display any valid Unicode
character. That includes umlauts, accented letters and also different
scripts. This extension is fully backwards compatible with RFC 959. As
long as you're using only English characters, it doesn't matter if the
software you are using supports RFC 2640 or not. However, if you use
non-English characters without using RFC 2640 compatible software,
there will be problems--problems which are entirely self-made by not
obeying the specifications.
you can read more here

Related

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

Character conversion problem.
I have a few strings which are incorrectly encoded or decoded.
The strings came in an ASCII format CSV file.
The current strings I have are:
N‚met
Tet‹
I know, that the:
"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)
How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?
I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.
I put an expanded version of my comment as answer:
Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.
That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)
The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).
For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.
I wrote my own utility that helped me to diagnose and fix many thorny encoding issues. It is available as part of an Open source library. The utility converts any String to unicode sequence and vise-versa. All you will have to do is:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
And it will return String "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"
The same would work for any String in any language including special characters. Here is the link to the article Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison that explains about the library and where to get it (available both on Maven central and github. In the article search for paragraph: "String Unicode converter". So when you read your String convert it and see what comes up. This way you will see what symbols are there and if the info is correct and only distorted by some wrong encoding or the info itself is lost. You can easily find info on internet that provides tables of mapping of any symbol to a unicode

Why does Windows use ANSI Code page instead of UNICODE?

When I run the command chcp in a cmd.exe window, it represents the code page used in Windows.
I think Windows uses the UNICODE character set.
So, my questions are:
Why does Windows use ANSI codepages instead of Unicode?
Windows uses UTF-16 or UCS-2? Can I check this (by command or MSDN link)?
UTF-16 or UCS-2 is just an encoding? or is also a character set?
UTF-8, UTF-16, UTF-32, etc .. do they have different character set size?
I'm so confused. please somebody define them.
Historical reasons, and backwards compatibility. Windows itself is a Unicode-based OS, and has been since the NT days. But many legacy (and even current) apps are not written for Unicode. Unicode-enabled apps do not use ANSI codepages, unless they need to convert runtime data between ANSI and Unicode.
Microsoft switched to UTF-16 in Windows 2000. Before that, it used UCS-2. See Unicode in Microsoft Windows.
Both UTF-16 and UCS-2 are just encodings of the same Unicode character set. UTF-16 was invented to support encoding codepoints above U+FFFF, which UCS-2 cannot handle.
All UTFs (including many you haven't named) are just encodings of the same Unicode character set. The number specified in the name is the number of bits used in encoded codeunits (UTF-8 uses 8bit codeunits, UTF-16 uses 16bit codeunits, etc).

Oracle server encoding and file encoding

I have a Oracle server with a DAD defined with PlsqlNLSLanguage DANISH_DENMARK.WE8ISO8859P1.
I also have a JavaScript file that is loaded in the browser. The JavaScript file contains the danish letters æøå. When the js file is saved as UTF8 the danish letters are misencoded. When I save js file as UTF8-BOM or ANSI then the letters are shown correctly.
I am not sure what is wrong.
Try to set your DAD
PlsqlNLSLanguage DANISH_DENMARK.UTF8
or even better
PlsqlNLSLanguage DANISH_DENMARK.AL32UTF8
When you save your file as ANSI it typically means "Windows Codepage 1252" on Western Windows, see column "ANSI codepage" at National Language Support (NLS) API Reference. CP1252 is very similar to ISO-8859-1, see ISO 8859-1 vs. Windows-1252 (it is the German Wikipedia, however that table shows the differences much better than the English Wikipedia). Hence for a 100% correct setting you would have to set PlsqlNLSLanguage DANISH_DENMARK.WE8MSWIN1252.
Now, why do you get correct characters when you save your file as UTF8-BOM, although there is a mismatch with .WE8ISO8859P1?
When the browser opens the file it first reads the BOM 0xEF,0xBB,0xBF and assumes the file encoded as UTF-8. However, this may fail in some circumstances, e.g. when you insert text from a input field to database.
With PlsqlNLSLanguage DANISH_DENMARK.AL32UTF8 you tell the Oracle Database: "The web-server uses UTF-8." No more, no less (in terms of character set encoding). So, when your database uses character set WE8ISO8859P1 then the Oracle driver knows he has to convert ISO-8859-1 characters coming from database to UTF-8 for the browser - and vice versa.

Windows encoding clarification

I would like to translate a game, this game loads the strings from a text file.
The destination language uses non-ascii characters, so I naïvely saved my file in utf8, but it does not work as letters with diacritics are not shown correctly.
Studying better in the configuration file where the string text filename is stored, I found a CHARSET option that can assume any of those values:
ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET MAC_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET JOHAB_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET GREEK_CHARSET TURKISH_CHARSET VIETNAMESE_CHARSET HEBREW_CHARSET ARABIC_CHARSET BALTIC_CHARSET RUSSIAN_CHARSET THAI_CHARSET EASTEUROPE_CHARSET OEM_CHARSET
That as far as I understood are fairly standard values in WinAPIs and charset and character encoding are synonymous.
So my question is, is there a correspondence between this names and standard names like utf8 or iso-8859-2? If it is the case what is it?
Try using EASTEUROPE_CHARSET
ISO 8859-2 is mostly equivalent to Windows-1250. According to this MSDN article, the 1250 code page is accessed using EASTEUROPE_CHARSET.
Note that you will need to save your text file in the 1250 code page as ISO 8859-2 is not exactly equivalent. From Wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged (unlike Windows-1252, which keeps all printable characters from ISO-8859-1 in the same place). Most of the rearrangements seem to have been done to keep characters shared with Windows-1252 in the same place as in Windows-1252 but three of the characters moved (Ą,Ľ,ź) cannot be explained this way.
The names are symbolic identifiers for Windows code pages, which are character encodings (= charsets) defined or adopted by Microsoft. Many of them are registered at IANA with the prefix windows-. For example, EASTEUROPE_CHARSET stands for code page 1250, which has been registered as windows-1250 and is often called Windows Latin 2.
UTF-8 is something different. You need special routines to read and write UTF-8 encoded data. UTF-8 or UTF-16 is generally the only sensible choice for character encoding when you want to be truly global (support different languages and writing systems). For a single specific language, some of the code pages might be more practical in some cases.
You can get the the standard encoding names (as registered by IANA) using the table under the remarks section of this MSDN page.
Just find the Character set row and read the Code page number, the standard name is windows-[code page number].

How to verify browser support UTF-8 characters properly?

Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.

Resources