I am dealing with ESC/POS commands.
With ESC R n command you can select an international character set.
Epson offers, among others: Spain II and Latin America. In those charsets, for example Hex 5E/Dec 94 means "é" (LATIN SMALL LETTER E), but in UTF8/ISO 5E/94 means "^" (CIRCUMFLEX ACCENT).
I am looking a way to translate those characters sets, or known their entire table (without looping and printing it through the printer, how do you know which is 'space'?)... It is not iso-8859-1/2/3/9 nor oem858/858 nor latin2/code page852 or similar charsets/codepages that I've checked.
Related
In Windows:
when I press Alt + 251, I get a √ character
when I press Alt + 0251 get û character!
A leading zero doesn't have value.
Actually, I want get check mark(√) from Chr(251) function in Client Report Definition (RDLC) but it gets me û!
I think it interprets four numbers as hex not decimal.
Using a leading zero forces the Windows to interpret the code in the Windows-1252 set. Without 0 the code is interpreted using the OEM set.
Alt+251:
You'll get √, because you'll use OEM 437, where 251 is for square root.
I'll get ¹, because I'll use OEM 850, where 251 is for superscript 1.
Alt+0251:
Both of us will get û, because we'll use Windows-1252, where 251 is for u-circumflex.
This is historical.
From ASCII to Unicode
At the beginning of DOS/Windows, characters were one byte wide and were from the American alphabet, the conversion was set using the ASCII encoding.
Additional characters were needed as soon as the PC was used off the US (many languages use accents for instance). So different codepages were designed and different encoding tables were used for conversion.
But a computer in the US wouldn't use the same codepage than one in Spain. This required the user and the programmer to assume the currently active codepage, and this has been a great period in the history of computing...
At the same period it was determined that using only one byte was not going to make it, more than 256 characters were required to be available at the same time. Different encoding systems were designed by a consortium, and collectively known as Unicode.
In Unicode "characters" can be one to four bytes wide, and the number of bytes for one character may vary in the same string.
Other notions have been introduced, such as codepoint and glyph to deal with the complexity of written language.
While Unicode was being adopted as a standard, Windows retained the old one-byte codepages for efficiency, simplicity and retro-compatibility. Windows also added codepages to deal with glyphs found only in Unicode.
Windows has:
A default OEM codepage which is usually 437 in the US -- your case -- or 850 in Europe -- my case --, used with the command line ("DOS"),
the Windows-1252 codepage (aka Latin-1 and ISO 8859-1, but this is a misuse) to ease conversion to/from Unicode. Current tendency is to replace all such extended codepages by Unicode. Java designers make a drastic decision and use only Unicode to represent strings.
When entering a character with the Alt method, you need to tell Windows which codepage you want to use for its interpretation:
No leading zero: You want the OEM codepage to be used.
Leading zero: You want the Windows codepage to be used.
Note on OEM codepages
OEM codepages are so called because for the first PC/PC-Compatible computers the display of characters was hard-wired, not software-done. The computer had a character generator with a fixed encoding and graphical definitions in a ROM. The BIOS would send a byte and a position (line, position in line) to the generator, and the generator would draw the corresponding glyph at this position. This was named "text-mode" at the time.
A computer sold in the US would have a different character ROM than one sold in Germany. This was really dependent on the manufacturer, and the BIOS was able to read the value of the installed codepage(s).
Later the generation of glyphs became software-based, to deal with unlimited fonts, style, and size. It was possible to define a set of glyphs and its corresponding encoding table at the OS level. This combination could be used on any computer, independently of the installed OEM generator.
Software-generated glyphs started with VGA display adapters, the code required for the drawing of glyphs was part of the VGA driver.
As you understood, +0251 is ASCII character, it does not represent a number.
You must understand that when you write 0 to the left of numbers it does not have any value but here it is ASCII codes and not numbers.
We've recently had a user enter english text, but it appears to have been done on a computer set up for Cyrillic as some of the letters such as "a" are actually CYRILLIC SMALL LETTER A, as opposed to LATIN SMALL LETTER A.
I thought that normalising would convert the Cyrillic to the Latin equivalent, but it does not (I guess that they are only equivalent in how they are displayed rather than their meaning).
Is this a common problem - user's who have computers setup for Cyrillic may be writing english, but with the Cyrillic alphabet instead?
What would be a safe way to spot this in general, and convert it appropriately?
To detect Cyrillic just use Regex match [\p{IsCyrillic}]. A more generic approach would be to search for any characters which are non-Latin ones.
Ones you've got a match, you'll need to replace the characters with their Latin equivalents.
Commonly used ofc, Klingon doesnt count :-)
thanks, guys, let me run willItFit() testcases
OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again
Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.
For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:
0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo
Here we go:
So the first 128 characters (US-ASCII)
need one byte. The next 1,920
characters need two bytes to encode.
This includes Latin letters with
diacritics and characters from Greek,
Cyrillic, Coptic, Armenian, Hebrew,
Arabic, Syriac and Tāna alphabets.
Three bytes are needed for the rest of
the Basic Multilingual Plane (which
contains virtually all characters in
common use). Four bytes are needed for
characters in the other planes of
Unicode, which include less common CJK
characters and various historic
scripts.
More details:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.
Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.
You even need three bytes just for English. For example, the typographically correct apostrophe is encoded in UTF-8 as 0xE2 0x80 0x99, opening quote marks are 0xE2 0x80 0x9C and closing quote marks are 0xE2 0x80 0x9D. The ellipsis is 0xE2 0x80 0xA6. And that's not even talking about all the different dashes, spaces or the inch and feet signs.
“It’s kinda hard to write English without the apostrophe’s help …”
There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.
I believe, there is an algorithm, which can equal two strings with similar typefaces of a characters, but different symbols (digits, Cyrillic, Latin or other alphabets). For example:
"hello" (Latin symbols) equals to "he11o" (digits and Latin symbols)
"HELLO" (Latin symbols) equals to "НЕLLО" (Cyrillic and Latin symbols)
"really" (Latin symbols) equals to "геа11у" (digits and Cyrillic symbols)
You may be thinking of the algorithm that Paul E. Black developed for ICANN that determines whether two TLDs are "confusingly similar", though it currently does not work with mixed-script input (e.g. Latin and Cyrillic). See "Algorithm Helps ICANN Manage Top-level Domains" and the ICANN Similarity Assessment Tool.
Also, if you are interested in extending this algorithm, then you might want to incorporate information from the Unicode code charts, which commonly list similar glyphs and sequences of code points that render similarly.
I am not exactly sure what you are asking for.
If you want to know whether two characters look the same under a given typeface then you need to render each character in the chosen fonts into bitmaps and compare them to see if they are close to being identical.
If you just want to always consider lower-case latin 'l' to be the same as the digit '1' regardless of the font used, then you can simply define a character mapping table. Probably the easiest way to do this would be to pick a canonical value for each set of characters that looks the same and map all members of the set to that character. When you compare the strings, compare the canonical instance of each character from the table.
I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.