Why is there no super or subscript "q" or "Q" characters defined in UTF-8? - utf-8

The header says it all. I just would like to be able to write,
inline but I am stuck with eᵖᐟᵠ. Is there any way at all to achieve this simple thing?

There are superscript q and Q characters defined in Unicode version 14 (column CodePoint contains Unicode (U+hhhh) and UTF-8 bytes; column Description contains surrogates in parentheses):
Char CodePoint Description
---- --------- -----------
ꟴ {U+A7F4, 0xEA,0x9F,0xB4} MODIFIER LETTER CAPITAL Q
𐞥 {U+107A5, 0xF0,0x90,0x9E,0xA5} MODIFIER LETTER SMALL Q (0xd801,0xdfa5)
Appear in UnicodeData.txt as follows (the file was previously downloaded from there):
findstr "Q;Lm;" D:\Utils\CodePages\UnicodeData.txt
A7F4;MODIFIER LETTER CAPITAL Q;Lm;0;L;<super> 0051;;;;N;;;;;
107A5;MODIFIER LETTER SMALL Q;Lm;0;L;<super> 0071;;;;N;;;;;
You need to find a font containing glyph for Modifier Letter Small Q (maybe Last Resort font family?) Try my answer to another question How to determine if a Glyph can be displayed?

Related

How to encode a TAB character in a Code128 barcode using only raw ZPL

In the past, we've used ZPL to create Code39 barcodes with a TAB character encoded in the middle using something similar to the following:
*USERNAME$IPASSWORD*
The $I in the middle gets translated to a TAB by the barcode scanners we use.
Now we have a need to do the same thing, but using Code128. With Code39, all the text needs to be uppercase (unless you're using Code39Extended, which supports lowercase letters). Because some of the data that is going to be encoded will be lowercase, we need to use Code128 B for most of the barcode, switching to Code128 A in the middle to encode the TAB character, then back to Code128 B for the final part.
Looking through the "ZPL II Programming Guide", it should be as easy as:
>:username>7{TAB}>6PA55w0rd
The >: at the beginning sets the subset to B, the >7 changes the subset to A, and the >6 changes the subset back to B. The problem I'm having (and haven't found a solution after almost a week of searching) is: How do I encode a TAB character using only text?
Use the ^FH (field hexidecimal encoding) command immediately prior to your field data. Based on your example:
^FH_^FD>:username>7_09>6PA55w0rd^FS
Where the underscore '_' is used as the escape character and 09 is the hex value for tab.
Also note that if the chosen escape character appears in the user name or password, you will need to escape it as well.
I tried what Mark Warren suggested, but unfortunately, it didn't work. It did, however, get me looking back through the ZPL II Programming Guide and I found the following, which I had overlooked before:
Code 128, Subsets A and C are programmed in pairs of digits, 00 to 99, in the field data string.
...
In Subset A, each pair of digits results in a single character being encoded in the bar code...
So, since 73 equates to a TAB in Subset A, I tried the following:
>:username>773>6PA55w0rd
And it worked!

How do I use a Unicode table to print a Tibetan character

Here is the Unicode characters table for the Tibetan language,
https://en.m.wikipedia.org/wiki/Tibetan_(Unicode_block)
How to I use the codes in that chart in a fmt.Printf(mycode) statement, in order to print, say the Tibetan letter ཏ, which is located at line U+0F4x and column F of that unicode chart.
Do I have to write:
Fmt.Printf(“U+0F4xF”)
or something like that, or do I have to drop the “U” or the “U+“ ?
To print ཏ (U+0F4F TIBETAN LETTER TA) (or any other Unicode character), you can put the character directly into your string literal, use a \u0F4F escape, or use the correspoding rune (Unicode codepoint):
fmt.Printf("Direct: ཏ\n")
fmt.Printf("Escape: \u0F4F\n")
fmt.Printf("Rune: %c\n", rune(0x0F4F))
The Go blog has some details...

^ meaning in "let's build a compiler" code

In Jack Crenshaw's "Lets' build a compiler" what does the ^I mean in this statement:
const TAB = ^I;
He also uses ^G in one his functions.
From the Free Pascal Language Reference:
Also, the caret character ( ^ ) can be used in combination with a
letter to specify a character with ASCII value less than 27. Thus ^G
equals #7 - G is the seventh letter in the alphabet.
The compiler is rather sloppy about the characters it allows after the caret, but in general
one should assume only letters.
The result is a one-byte ASCII character constant. I is the 9th letter in the alphabet. And the ASCII value 9 is – no surprise – the TAB character.
It's Control-I. This translates to ASCII char-9, which is the character for Tab. Similarly Ctrl-G is ASCII char-7, which is the character for the BEL (literally bell), which usually generates a beep from a console.

Replace accented character in Ruby

I had a script or Ruby, and when I try to replace accented charcater gsub doesn't work with me :
my floder name is "Réé Ab"
name = File.basename(Dir.getwd)
name.downcase!
name.gsub!(/[àáâãäå]/,'a')
name.gsub!(/æ/,'ae')
name.gsub!(/ç/, 'c')
name.gsub!(/[èéêë]/,'e')
name.gsub!(/[ìíîï]/,'i')
name.gsub!(/[ýÿ]/,'y')
name.gsub!(/[òóôõö]/,'o')
name.gsub!(/[ùúûü]/,'u')
the output "réé ab", why the accented characters stil there ?
The é in your name are actually two different Unicode codepoints: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).
p 'é'.each_codepoint.map{|e|"U+#{e.to_s(16).upcase.rjust(4,'0')}"} * ' ' # => "U+0065 U+0301"
However the é in your regex is only one: U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Wikipedia has an article about Unicode equivalence. The official Unicode FAQ also contains explanations and information about this topic.
How to normalize Unicode strings in Ruby depends on its version. It has Unicode normalization support since 2.2. You don't have to require a library or install a gem like in previous versions (here's an overview). To normalize name simpy call String#unicode_normalize with :nfc or :nfkc as argument to compose é (U+0065 and U+0301) to é (U+00E9):
name = File.basename(Dir.getwd)
name.unicode_normalize! # thankfully :nfc is the default
name.downcase!
Of course, you could also use decomposed characters in your regular expressions but that probably won't work on other file systems and then you would also have to normalize: NFD or NFKD to decompose.
I also like to or even should point out that converting é to e or ü to u causes information loss. For example, the German word Müll (trash) would be converted to Mull (mull / forest humus).

Why 'аnd' == 'and' is false?

I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)
EXAMPLE:
If you copy and paste the following line into the rails console you will get false.
'аnd' == 'and' #=> false
If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.
'and' == 'and' #=> true
The difference is, in the first example, the first 'аnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.
Is this an encoding issue?
How to fix my issue?
This isn’t really an encoding problem, in the first case the strings compare as false simply because they are different.
The first character of the first string isn’t a ”normal“ a, it is actually U+0430 CYRILLIC SMALL LETTER A — the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a “normal” Latin a, which is U+0061 LATIN SMALL LETTER A.
Here’s the “normal” a: a, and this is the Cyrillic a: а, they appear pretty much identical.
The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.
You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find you’re too strict about conversions.
Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Ruby‘s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.
The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.
The problem is also not with normalization or folding.
The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.
These three codepoints resolve to the following three characters:
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.
These three codepoints resolve to the following three characters:
LATIN SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.
This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.
What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:
Where does that cyrillic a come from?
Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.
This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.
I think it is eccoding issue, you can have a try like this.
irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"
irb(main):011:0> 'аnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "аnd"

Resources