Why is facepalm emoji 🤦‍♀️ followed by U+200D♀ - utf-8

I am trying to send utf-8 symbols via serial device to browser and display them. I have found out when I print facepalm emoji 🤦‍♀️ (on windows 10 Win+.) it has U+200D and ♀ characters behind. Others emojis don't have that. I was using View non-printable unicode characters tool. Also I found, if you print it in notepad it will show you ♀, when you print it in browser address bar ♀ is invisible but if you press backspace you delete it. And finally, if you print it in some html text input, you can delete whole emoji with single backspace. Why is that?

Emoji sequences have more than one code point to signify variations (below may or may not look different for each sequence depending on browser):
🤦 PERSON FACEPALMING U+1F926
🤦‍♂️ MAN FACEPALMING U+1F926 U+200D U+2642 U+FE0F
🤦‍♀️ WOMAN FACEPALMING U+1F926 U+200D U+2640 U+FE0F
References:
Emoji List, v13.1 No. 260-262.
Full Emoji List, v13.1, No. 260-262 (With browser-specific images)
Unicode® Standard Annex #29, UNICODE TEXT SEGMENTATION
Some editors/browsers handle the sequences better than others and may not show differences in all variations or may not recognize the latest Unicode specfication and newer emojis.

Related

How to insert/copy+paste unicode whitespace into a text file using editors like Textmate?

I am trying to create a test csv file for a file cleaning script that is supposed to normalize all whitespace into "normal"/ "regular" whitespace character. The idea is I will insert a bunch of these oddball whitespace characters into this test file in some various locations.
Here are some sites that show these various and oddball whitespaces
https://en.wikipedia.org/wiki/Whitespace_character
http://jkorpela.fi/chars/spaces.html
I've tried to copy and paste from sources like that website but it seems like they always paste in as a normal space in Textmate. It could be that I am not copying what I think I am copying. In the past I've been able to copy and paste into Textmate special / unicode characters when I can clearly see what I am copying but with whitespace characters, I can't confirm since I can can't see it, so I am not sure if the problem is where I am copying from or that Textmate is converting it to the normal space when I paste it in.
If it is easier to use Textedit (the built in editor) or nano (command line editor) to do this I could use those. Or if there is another way other than copying and pasting that is better to get these into Textmate that would be an option.
I am on a MacbookPro running High Sierra MacOS.
If you have LibreOffice installed you can use the spreadsheet application to create these using their hexidecimal equivalent in 1 cell then doing a conversion using
=unichar(hex2dec(cell_ref_to_1rst_cell)).
Far less confusing and you can save the spreadsheet complete with comments as a handy reference. Then you should just be able to copy paste the cell with the unicode character when required.
If you’re using TextMate, various functions provided by the Unicode bundle could be helpful here (install via Preferences → Bundles → Unicode).
With this bundle installed you can use Insert Unicode Character ⌃⌥⌘I to insert a character by name. Search for “space” to get a list of all space characters, then simply click on the desired character (the full title of a character is shown on hover):
Of course once inserted all the space characters look almost identical. To identify them, use Show Unicode Name(s) ⌃⌥⌘U 6. This will display a tooltip showing the unicode of name of the character directly before the cursor (or the names of all selected characters, if a selection is active).
Also have a look at Show Character Inventory (press ⌃⌥⌘U and then select the command from the popup menu): This provides a convenient overview of all the characters in your document (or in the selected text, if a selection is active).

Has anyone heard of this strange bug with the standard Windows message box?

Years ago, I was messing around with Visual Basic and I discovered a bug with the MsgBox function. I tried searching for it, but nobody had ever said anything about it. It's not just with Visual Basic though; it's with anything that uses the standard Windows MessageBox API call.
The bug is triggered when the title text has more than one character, and the first character is a lowercase 'y' with an umlaut ('ÿ'). What's so special about this character? It almost definitely not the character itself, but rather its ASCII value that's special. 'ÿ' is character 255 (0xFF), meaning it's the highest value that can be stored in an unsigned byte, and all its bits are set to 1.
What does this bug do? Well, there are two different possibilities, which depend on the number of characters in the title text. If there are an even number of characters (unless it's 2) in the title text, no message box appears, and you just hear the alert sound. If there are two characters in the title text, or any odd number other than 1 (in which case the bug wouldn't be triggered)...then this happens:
And that's not all--the message will also be truncated to one line. It seems like the kind of bug that would occur in at least one semi-high-profile incident, considering how often this API call is used. Are there any reports of this on the Internet, or anything showing what could cause it? Maybe it's a Unicode-related glitch, like that "bush hid the facts" glitch in Notepad?
I made a program in case you want to play around with this; download it here.
Alternatively, copy the following into Notepad, save it with a .vbs extension, and double-click it to display the dialog box seen above:
MsgBox "Windows 3.1 font, anyone?", 0, "ÿ ODD NUMBER!"
Or for a different font:
MsgBox "I CAN HAS CHEEZBURGER?", 0, "ÿ HImpact"
EDIT: It seems that if the first four characters are ÿ's, it doesn't ever display the message, even if there's an odd number of characters.
This is a bug with dialog templates generally. It is not a message box bug as such.
For example, in Visual Studio create the default win32 application. In the .rc file, change the caption in the template for the about box from
CAPTION "About sampleapp"
to
CAPTION "ÿT"
and the bug will manifest itself when you display the about box.
In the DLGTEMPLATEEX documentation note that the menu and class name have type sz_Or_Ord which means either a null-terminated string or 0xFFFF followed by a single word resource identifier.
Windows incorrectly applies a similar scheme to the dialog title: if the first character is 0xFF then it treats the title as being two WORDs long, but only when it is trying to locate the font information. When it is displaying the title it correctly treats the title as a string.
In other words, Windows is looking for the font information inside the title string. In most case this won't specify a valid font, so Windows defaults to the system font.
To prove this, I constructed a dialog template in memory (based on this). Once this was working I deleted the code that writes the font information to the template and used the dialog title "ÿa\xd\x200\x21SimSun". This displays the dialog in italic SimSun because windows is reading the font information from the title string.
This bug is likely a hangover from 16-bit Windows, where (I guess) 0xFF was used as the resource ID marker.
A strange bug. I suspect the symptoms are the result of the way the MessageBox() actually displays the dialog.
Internally, MessageBox() builds a dialog template dynamically. If you look at the description of a DLGTEMPLATE structure you'll find the following nugget of information:
In a standard template for a dialog box, the DLGTEMPLATE structure is
always immediately followed by three variable-length arrays that
specify the menu, class, and title for the dialog box. When the
DS_SETFONT style is specified, these arrays are also followed by a
16-bit value specifying point size and another variable-length array
specifying a typeface name.
So, the in-memory layout of a dialog template has the font specification immediately following the dialog box title.
Visual Basic does not use Unicode and so the function you're calling is actually MessageBoxA(). This is simply a thunk that converts the passed-in strings from multibyte to Unicode and then calls MessageBoxW().
I believe what's happening is that, for some reason, the conversion of that string from multibyte to Unicode is either going wrong, or returning a spurious length value. This has the knock-on effect, when the dialog template is built in memory, of corrupting the memory immediately following the title string - which, as we know, is the font specification.

View special chars in Sublime Text

I am using both Notepad2 and Sublime Text 3 and I prefer ST3 over Notepad2 as it has a lot of great features. One thing I miss very much though is the possibility to view special characters in a logfile.
If I have a logfile with this one line in it (<null> is the HEX char 0x00):
ERROR: Received invalid data string [<null><null>e<null><null>test</null>]
If I open it in Notepad2 I get this view:
If I open it in ST3 I get this HEX view:
Is it possible to get the same view in ST3 as in Notepad2, so I can see the special characters?
I just found this option which can be set in the User Settings:
// Files containing null bytes are opened as hexadecimal by default
"enable_hexadecimal_encoding": false
This gives exactly what I wanted:
I've been using this:
https://sublime.wbond.net/packages/HexViewer
But that does not map \0 to NUL, this may cause alignment issue (unless you have a fixed-width NUL glyph in your font).

Control Characters and How OS/TextEditors interprets them?

I was going thru some content about control characters especially newline character(will focus on this).After going thru
http://en.wikipedia.org/wiki/Control_characters, got to know that \n is the line character in unix
while it is \r\n in windows. Now i got the question how OS comes into picture when iterpreting
ASCII Codes becoz i was under impression when we type any given character on keyboard, any OS send the same
bits and editor interprets that bit and display the corresponding character. Looks like this understanding is
wrong, Because different bit is sent in case of unix(\n) and windows(\r\n) when we press ENTER(new line terminator).As per
new understanding if we press ENTER on diff OS(say unix and windows),different bits are sent to editor and its
responsibilty of text editor to show the typed stuff in new line keeping the underlying OS in picture.Please let me
know if my understanding is correct as this will help me to understand other basics also?
Next question is if above is correct, what can be the reason different OS treat some control characters differently
when they treat all other characters equally? Is it becoz specific bits are already reserved in specific OS?
How an application treats keyboard input varies a bit, actually. When you press return the application is under no obligation to actually generate LF or CR+LF anywhere. E.g. it might decide to just end the current paragraph object and start a new one (e.g. in a word processor). If it's a Windows text editor then it will probably just write CR+LF into the file, while on Unix it just writes an LF.
They keyboard itself is very, very far removed from things you see on the screen or even on the disk. This goes through scan codes, keyboard layouts and other transformations before it ends up as text or markup somewhere.

How to display unicode control characters in visual studio text visualizer?

I get some text string from service, which contains Unicode control characters
(i.e \u202B or \u202A and others for Arabic language support).
But while debugging I can't see them in default text visualizer. So I need to enable display for such characters to determine which of them my text consists of. There is checkbox in text visualizer "show all characters", but it doesn't work as I expect.
Any suggestions?
Thanks in advance
Those are codes for explicit RLE and LRE order, ie if in RLE something should be displayed in LRE order.
http://unicode.org/reports/tr9/#Directional_Formatting_Codes

Resources