Getting the cursor position in RichEdit when text has surrogates - winapi

On Windows, if you have a UTF-16 sequence containing surrogate and that you insert that sequence in a RichEdit control, the RichEdit control handles this well and for each surrogate pair, it will only show one character.
The difficulty I'm facing is that when I query the selection, I'm getting the position in the UTF-16 stream, and not the character position as the number of visible characters in the control. I have a slow solution to find out the actual position, but it requires retrieving the text up to the selection in UTF-16 and then count myself the number of actual characters.
Did I miss something? Is there anything more efficient than that?
Thanks,
Manu
PS: To query the selection I'm using the EM_EXGETSEL message to fill a CHARRANGE structure.

The problem is real enough and it's only going to get more frequent. Single code points in UTF-16 only reach 64K characters and there are nearly 300K of them now.
What you will see is a pair of character positions (short values) that display as a single character. There will only ever be two, under the current standards.
In .Net code there are particular functions to do this work for you. I am not aware of any in WinApi. You can process the text using functions that test using the macros IS_HIGH_SURROGATE, IS_LOW_SURROGATE, and IS_SURROGATE_PAIR. I see no reason they should be any slower than built-in functions, but you have to write them (unless you can find some source code somewhere).
This article may be helpful: Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?.

Related

standard separator between datablocks in a data stream?

I have several TSV tables (for any characters below ASCII 32 only common characters are contained, such as '\a\b\t\n\v\f\r\e'). I'd like to put them into a single stream. I think that ASCII control characters can be used to separate them. But I am not sure which ASCII control character (other than the ones already used as shown above) is standard for this purpose. Does anybody know what is the standard?
I don't know of standards for what you suggest.
Many protocols use envelopes with a count for what is to follow (TCP) or an index with offsets and lengths (tar, zip) or an arbitrary unique separator that is defined in the header (multipart MIME). There are some that are a simple series like yours but that have formatting that makes item separation obvious (XML element stream [which differs from a document because it has no root element]).
Is the idea that you might not know from the start how many tables and/or rows there are? (That is the use case for XML element streams.)
␀ seems reasonable, perhaps with final extra ␀ to mark the end of the stream (DOS/Windows program environment block).
␊ making an empty line also seems reasonable (like in SMTP).
␚(^Z) is a common file terminator from the days of CP/M and DOS.

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

GetGlyphOutline() and UTF-16 surrogate pairs

I'm using the GDI GetGlyphOutlineW function to get the outline of unicode characters, and it works fine except that it does not work with surrogate pairs (U+10000 and higher). I've tried converting the surrogate pair into a UTF-32 character, but this does not appear to work.
How can I get glyph outlines of Supplementary Multilingual Plane characters?
Some Suggestions:
Does the particular Unicode code point you are trying to get actually exist in the font that is selected in the DC passed to the GetGlyphOutlineW function?
Follow the directions on this page to enable surrogate pairs in Windows.
Use the Uniscribe functions for character manipulation.

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

manually finding the size of a block of text (ASCII format)

Is there an easy way to manually (ie. not through code) find the size (in bytes, KB, etc) of a block of selected text? Currently I am taking the text, cutting/pasting into a new text document, saving it, then clicking "properties" to get an estimate of the size.
I am developing mainly in visual studio 2008 but I need any sort of simple way to manually do this.
Note: I understand this is not specifically a programming question but it's related to programming. I need this to compare a few functions and see which one is returning the smallest amount of text. I only need to do it a few times so figured writing a method for it would be overkill.
This question isn't meaningful as asked. Text can be encoded in different formats; ASCII, UTF-8, UTF-16, etc. The memory consumed by a block of text depends on which encoding you decide to use for it.
EDIT: To answer the question you've stated now (how do I determine which function is returning a "smaller" block of text) -- given a single encoding, the shorter text will almost always be smaller as well. Why can't you just compare the lengths?
In your comment you mention it's ASCII. In that case, it'll be one byte per character.
I don't see the difference between using the code written by the app you're pasting into, and using some other code. Being a python person myself, whenever I want to check length of some text I just do it in the interactive interpreter. Surely some equivalent solution more suited to your tastes would be appropriate?
ended up just cutting/pasting the text into MS Word and using the char count feature in there

Resources