Does Unicode have a code point for a garbled digit? - user-interface

I'm looking for a placeholder glyph to display "insert any digit here", to tersly communicate in limited GUI space that a range of numbers is meant.
For decimal numbers I would use x, e.g.
1xx - room numbers on first floor
2xx - room numbers on second floor
but my ranges are hexadecimal, so
0x00xx - IDs reserved for future use
0x01xx - IDs reserved for development
0x02xx - IDs managed by team Bravo
looks a bit odd, as the x would have two different meanings.

There is no Unicode character that simply means "any digit here". Unicode does offer an extensive range of symbols to choose from though, which will not be confused with 'x'. An underscore has the benefit that, in many fonts, it has the same width as a numeral. If you choose something more exotic, like ◌ DOTTED CIRCLE or ⯑ UNCERTAINTY SIGN, just ensure that it will be present in the font used for your interface.

Related

What are Unicode codepoint types for?

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?
Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.
The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

ASCII - Whats the point of it?

I always wanted to ask this, I know that ASCII uses numbers to represent characters like 65 = A
Whats the point? computer understand when i press A is A why we need to convert to 65?
You have it backwards: computers understand when you press an A because of codes like ASCII. Or rather, one part of the computer is able to tell another part of the computer that you pressed an A because they agree on conventions of binary signals like ASCII.
At its lowest level, each part of the computer "knows" that it is in one of two states - maybe off and on, maybe high voltage and low voltage, maybe two directions of magnetism, and so on. For convenience, we label these two states 0 and 1. We then build elaborate (and microscopic) sequences of machinery that each say "if this thing's a 1, then do this, if it's a 0 do this".
If we string a sequence of 1s and 0s together, we can write a number, like 1010; and we can make machinery that does maths with those numbers, like 1010 + 0001 = 1011. Alternatively, we can string a much longer sequence together to represent the brightness of pixels from the top left to bottom right of a screen, in order - a bitmap image. The computer doesn't "know" which sequences are numbers and which are images, we just tell it "draw the screen based on this sequence" and "calculate my wages based on this sequence".
If we want to represent not numbers or images, but text, we need to come up with a sequence of bits for each letter and symbol. It doesn't really matter what sequence we use, we just need to be consistent - we could say that 000001 is A, and as long as we remember that's what we chose, we can write programs that deal with text. ASCII is simply one of those mappings of sequences of bits to letters and symbols.
Note that A is not defined as "65" in ASCII, it's defined as the 7 bit sequence 1000001; it just happens that that's the same sequence of bits we generally use for the number 65. Note also that ASCII is a very old mapping, and almost never used directly in modern computers; it is however very influential, and a lot of more recent mappings are designed to use the same or similar sequences for the letters and symbols that it covers.

Bash string compression

I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you
Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.

String typeface comparison algorithm

I believe, there is an algorithm, which can equal two strings with similar typefaces of a characters, but different symbols (digits, Cyrillic, Latin or other alphabets). For example:
"hello" (Latin symbols) equals to "he11o" (digits and Latin symbols)
"HELLO" (Latin symbols) equals to "НЕLLО" (Cyrillic and Latin symbols)
"really" (Latin symbols) equals to "геа11у" (digits and Cyrillic symbols)
You may be thinking of the algorithm that Paul E. Black developed for ICANN that determines whether two TLDs are "confusingly similar", though it currently does not work with mixed-script input (e.g. Latin and Cyrillic). See "Algorithm Helps ICANN Manage Top-level Domains" and the ICANN Similarity Assessment Tool.
Also, if you are interested in extending this algorithm, then you might want to incorporate information from the Unicode code charts, which commonly list similar glyphs and sequences of code points that render similarly.
I am not exactly sure what you are asking for.
If you want to know whether two characters look the same under a given typeface then you need to render each character in the chosen fonts into bitmaps and compare them to see if they are close to being identical.
If you just want to always consider lower-case latin 'l' to be the same as the digit '1' regardless of the font used, then you can simply define a character mapping table. Probably the easiest way to do this would be to pick a canonical value for each set of characters that looks the same and map all members of the set to that character. When you compare the strings, compare the canonical instance of each character from the table.

Is there any logic behind ASCII codes' ordering?

I was teaching C to my younger brother studying engineering. I was explaining him how different data-types are actually stored in the memory. I explained him the logistics behind having signed/unsigned numbers and floating point bit in decimal numbers. While I was telling him about char type in C, I also took him through the ASCII code system and also how char is also stored as 1 byte number.
He asked me why 'A' has been given ASCII code 65 and not anything else? Similarly why 'a' is given the code 97 specifically? Why is there a gap of 6 ASCII codes between the range of capital letters and small letters? I had no idea of this. Can you help me understand this, since this has created a great curiosity to me as well. I've never found any book so far that has discussed this topic.
What is the reason behind this? Are ASCII codes logically organized?
There are historical reasons, mainly to make ASCII codes easy to convert:
Digits (0x30 to 0x39) have the binary prefix 110000:
0 is 110000
1 is 110001
2 is 110010
etc.
So if you wipe out the prefix (the first two '1's), you end up with the digit in binary coded decimal.
Capital letters have the binary prefix 1000000:
A is 1000001
B is 1000010
C is 1000011
etc.
Same thing, if you remove the prefix (the first '1'), you end up with alphabet-indexed characters (A is 1, Z is 26, etc).
Lowercase letters have the binary prefix 1100000:
a is 1100001
b is 1100010
c is 1100011
etc.
Same as above. So if you add 32 (100000) to a capital letter, you have the lowercase version.
This chart shows it quite well from wikipedia: Notice the two columns of control 2 of upper 2 of lower, and then gaps filled in with misc.
Also bear in mind that ASCII was developed based on what had passed before. For more detail on the history of ASCII, see this superb article by Tom Jennings, which also includes the meaning and usage of some of the stranger control characters.
Here is very detailed history and description of ASCII codes: http://en.wikipedia.org/wiki/ASCII
In short:
ASCII is based on teleprinter encoding standards
first 30 characters are "nonprintable" - used for text formatting
then they continue with printable characters, roughly in order they are placed on keyboard. Check your keyboard:
space,
upper case sign on number caps: !, ", #, ...,
numbers
signs usually placed at the end of keyboard row with numbers - upper case
capital letters, alphabetically
signs usually placed at the end of keyboard rows with letters - upper case
small letters, alphabetically
signs usually placed at the end of keyboard rows with letters - lower case
The distance between A and a is 32. That's quite round number, isn't it?
The gap of 6 characters between capital letters and small letters is because (32 - 26) = 6. (Note: there are 26 letters in the English alphabet).
If you look at the binary representations for 'a' and 'A', you'll see that they only differ by 1 bit, which is pretty useful (turning upper case to lower case or vice-versa is just a matter of flipping a bit). Why start there specifically, I have no idea.
'A' is 0x41 in hexidecimal.
'a' is 0x61 in hexidecimal.
'0' thru '9' is 0x30 - 0x39 in hexidecimal.
So at least it is easy to remember the numbers for A, a and 0-9. I have no idea about the symbols. See The Wikipedia article on ASCII Ordering.
Wikipedia:
The code itself was structured so that
most control codes were together, and
all graphic codes were together. The
first two columns (32 positions) were
reserved for control characters.[14]
The "space" character had to come
before graphics to make sorting
algorithms easy, so it became position
0x20.[15] The committee decided it was
important to support upper case
64-character alphabets, and chose to
structure ASCII so it could easily be
reduced to a usable 64-character set
of graphic codes.[16] Lower case
letters were therefore not interleaved
with upper case. To keep options open
for lower case letters and other
graphics, the special and numeric
codes were placed before the letters,
and the letter 'A' was placed in
position 0x41 to match the draft of
the corresponding British
standard.[17] The digits 0–9 were
placed so they correspond to values in
binary prefixed with 011, making
conversion with binary-coded decimal
straightforward.

Resources