I'm using the GDI GetGlyphOutlineW function to get the outline of unicode characters, and it works fine except that it does not work with surrogate pairs (U+10000 and higher). I've tried converting the surrogate pair into a UTF-32 character, but this does not appear to work.
How can I get glyph outlines of Supplementary Multilingual Plane characters?
Some Suggestions:
Does the particular Unicode code point you are trying to get actually exist in the font that is selected in the DC passed to the GetGlyphOutlineW function?
Follow the directions on this page to enable surrogate pairs in Windows.
Use the Uniscribe functions for character manipulation.
Related
When I test using emoji as variable names in Swift (XCode 7.2), I have major problems in that only some emoji work. I've determined that it is because some emoji include their base Unicode codepoint followed by a "variation selector".
Is there a way to insert characters into my source code without the unprintable variation selector coming along with the ride and ruining my code?
This mostly seems to be the fault of Apple's implementation of Unicode, where using the character selection pane to insert ♣️ inserts the codepoint for Clubs followed by the variation selector for "use the emoji version, not the plain text style". Inserting ♣ injects both the Clubs codepoint and the "use the text-style" selector.
If not, is there a list of characters to avoid using in variable names? So far, I've found anything involving a card suit to be problematic, as well as any human-like emoji that Apple has enabled for skin tone selection.
You can input individual code points using the "Unicode Hex Input" input source that comes with Mac OS X. See this guide for enabling and using it.
In general, you probably want to avoid non-ASCII identifiers for exactly the reasons you mention. There are many other examples where non-ASCII identifiers can cause problems, and they offer almost no practical benefit.
On Windows, if you have a UTF-16 sequence containing surrogate and that you insert that sequence in a RichEdit control, the RichEdit control handles this well and for each surrogate pair, it will only show one character.
The difficulty I'm facing is that when I query the selection, I'm getting the position in the UTF-16 stream, and not the character position as the number of visible characters in the control. I have a slow solution to find out the actual position, but it requires retrieving the text up to the selection in UTF-16 and then count myself the number of actual characters.
Did I miss something? Is there anything more efficient than that?
Thanks,
Manu
PS: To query the selection I'm using the EM_EXGETSEL message to fill a CHARRANGE structure.
The problem is real enough and it's only going to get more frequent. Single code points in UTF-16 only reach 64K characters and there are nearly 300K of them now.
What you will see is a pair of character positions (short values) that display as a single character. There will only ever be two, under the current standards.
In .Net code there are particular functions to do this work for you. I am not aware of any in WinApi. You can process the text using functions that test using the macros IS_HIGH_SURROGATE, IS_LOW_SURROGATE, and IS_SURROGATE_PAIR. I see no reason they should be any slower than built-in functions, but you have to write them (unless you can find some source code somewhere).
This article may be helpful: Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?.
I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.
Maybe this is a non-issue but I look to the collected wisdom of SO to help me find out.
We're trying to ensure encodings are consistent across platforms. The way to go is clearly UTF8. However, some platforms unfortunately use extended ASCII (typically some form of Windows codepage), We're concerned that when encoding something with say, an umlaut, from a Windows codepage to UTF8, there are multiple possible choices within UTF8 for the character.
On a different platform (Linux, Mac OS), how do we ensure that the UTF8 character chosen there is consistent?
As I said, maybe this is a non-issue. Maybe there is some standard mapping I'm unaware of. We haven't seen any problems but a colleague just raised the concern so I'm on the hunt for information.
Thank you all in advance.
As long as you properly convert original text to Unicode first and than use Utf8 to store/transfer data there should be no problems.
The Unicode Consortium has compiled a set of mapping tables. Nominally informational, they constitute a de facto standard. Moreover, many of the mappings there reflect formal standards, as it has become normal to define any new character encoding in terms of Unicode, i.e. by specifying the Unicode number (and/or Unicode name) of each character.
Once a character has been mapped to Unicode (i.e., to a Unicode code point, or Unicode number), its encoding in each Unicode encoding, such as UTF-8, has been defined unambiguously.
So the issue is how you ensure that the conversion routines you use work according to those tables. Using ICU can be regarded as safe in this respect.
P.S. There is no extended ASCII. There are various character encodings, some of which coincide with ASCII in the range from 0 to 0x7F, some don’t.
I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.