I am writing a text editor which has an option to display a bullet in place of any invisible Unicode character. Unfortunately there appears to be no easy way to determine whether a Unicode character is invisible.
I need to find a text file containing every Unicode character in order that I can look through for invisible characters. Would anyone know where I can find such a file?
EDIT: I am writing this app in Cocoa for Mac OS X.
Oh, I see... actual invisble characters ;) This FAQ will probably be useful:
http://www.unicode.org/faq/unsup_char.html
It lists the current invisible codepoints and has other information that you might find helpful.
EDIT: Added some Cocoa-specific information
Since you're using Cocoa, you can get the unicode character set for control characters and compare against that:
NSCharacterSet* controlChars = [NSCharacterSet controlCharacterSet];
You might also want to take a look at the FAQ link I posted above and add any characters that you think you may need based on the information there to the character set returned by controlCharacterSet.
EDIT: Added an example of creating a Unicode string from a Unicode character
unichar theChar = 0x000D;
NSString* thestring = [NSStirng stringWithCharacters:&theChar length:1];
Let me know if this code helps at all:
-(NSString*)stringByReplacingControlCharacters:(NSString*)originalString
{
NSUInteger length = [originalString length];
unichar *strAsUnichar = (unichar*)malloc(length*sizeof(unichar));
NSCharacterSet* controlChars = [NSCharacterSet controlCharacterSet];
unichar bullet = 0x2022;
[originalString getCharacters:strAsUnichar];
for( NSUInteger i = 0; i < length; i++ ) {
if( [controlChars characterIsMember:strAsUnichar[i]] )
strAsUnichar[i] = bullet;
}
NSString* newString = [NSString stringWithCharacters:strAsUnichar length:length];
free(strAsUnichar);
return newString;
}
Important caveats:
This probably isn't the most efficient way of doing this, so you will have to decide how you want to optimize after you get it working. This only works with characters on the BMP, support for composted characters would have to be added if you have such a requirement. This does no error checking at all.
A good place to start is the Unicode Consortium itself which provides a large body of data, some of which would be what you're looking for.
I'm also in the process of producing a DLL which you give a string and it gives back the UCNs of each character. But don't hold your breath.
The current official Unicode version is 5.1.0, and text files describing all of the code points in that can be found at http://www.unicode.org/standard/versions/components-latest.html
For Java, java.lang.Character.getType. For C, u_charType() or u_isgraph().
you might find this code to be of interest: http://gavingrover.blogspot.com/2008/11/unicode-for-grerlvy.html
Its an impossible task, Unicode supports even Klingon, so it's not going to work. However most text editors use the standard ANSI invisible characters. And if your Unicode library is good, it will support finding equivalent characters and/or categories, you can use these two features to do it as well as any editor out there
Edit: Yes I was being silly about Klingon support, but that doesn't make it not true... of course Klingon is not supported by the Consortium, however there is a movement for Klingon in the Unicode's "Private Use Area" defined for Klingon alphabet (U+F8D0 - U+F8FF). Link here for those interested :)
Note: Wonder what editor Klingon programmers use...
Related
I'm parsing some nasty files - you know, mix comma, space and tab delimiters in a single line, and then run it through a text editor that word wraps at column 65 with CRLF. Ugh.
As part of my efforts to parse this in Cocoa, I use Apple's whitespaceAndNewlineCharacterSet. But what, exactly is in that set? The documentation says "Unicode General Category Z*, U000A ~ U000D, and U0085". I was able to find the last three (85 is interesting, but what does the ~ mean, and what is General Category Z*?
Any Unicode gurus out there?
The ~ means "thru"; thus, U000A, B, C, and D.
The phrase "General Category Z*" is shorthand for "any character whose General Category property is one of the three categories that start with Z." Thus, various forms of space (0020, 00A0, 1680, 2000 thru 200A, 202F, 205F, 3000), plus the line separator (2028) and the paragraph separator (2029).
NSCharacterSet is an opaque class that does not expose its content easily. You have to see it more as a "membership" rule service than a list of characters.
This may be a somewhat brutal approach, but you can get the list of members in an NSCharacterSet by going through all 16 bit scalar values and checking for membership in the set:
let charSet = NSCharacterSet.whitespaceAndNewlineCharacterSet()
for i in 0..<65536
{
let u:UInt16 = UInt16(i)
if charSet.characterIsMember(u)
{ print("\(u): \(Character(UnicodeScalar(u)))") }
}
This gives surprising results for non-displayable character sets but it can probably answer your question.
Some characters have ambiguous directionality, like whitespace and punctuation marks. This can lead to text layout situations where there doesn't appear to be single correct layout without access to additional data to resolve the ambiguity. Consider this text:
\u05e9\u05e0\u05d1\u05d2abcd!
That's four Hebrew characters (unambiguously right-to-left), four English characters (unambiguously left-to-right), and one punctuation mark (ambiguous). If I layout that string in an IDWriteTextLayout with DWRITE_READING_DIRECTION_RIGHT_TO_LEFT, I get the following:
The punctuation mark appears to be treated as a right-to-left character which is starting a new right-to-left block to the left of the English, which seems perfectly reasonable, especially considering that right-to-left was the specified reading direction. However, it's also entirely reasonable to expect the punctuation mark to be treated as a left-to-right character associated with the embedded left-to-right English text, which would mean it should appear to the right of the 'd'.
My app knows exactly how it wants this character should be treated. How do I pass that data to IDWriteTextLayout to resolve this ambiguity?
I found the SetLocaleName method and thought that it must be the answer, but I can't seem to get it to affect the result at all. I also found the localeName parameter when creating an IDWriteTextFormat (which is then used to create the IDWriteTextLayout).
If my goal is for this to generally be Hebrew text with a string of embedded US English, I would think I'd want to use locale he on the IDWriteTextFormat and then use SetLocaleName to override that with locale en-US on character range [4-9]. However, doing so has no effect. In fact, I can't get any combination of locales used in those places to have any effect on the layout at all, whether I restrict them to a subrange or apply them to the entire string.
Am I wrong in thinking that these APIs should serve this purpose? If so, what APIs should I be using? Or is there really no way to tell IDWriteTextLayout to resolve this ambiguity differently? Am I maybe using the APIs wrong? Here is the test code I'm using to create this IDWriteTextLayout:
TestTextRenderer::TestTextRenderer(const std::shared_ptr<DX::DeviceResources>& deviceResources) :
m_deviceResources(deviceResources),
m_text(L"\u05e9\u05e0\u05d1\u05d2abcd!"),
m_readingDirection(DWRITE_READING_DIRECTION_RIGHT_TO_LEFT),
m_formatLocale(L"en-US"),
m_layoutLocale(L"en-US")
{
ComPtr<IDWriteTextFormat> textFormat;
DX::ThrowIfFailed(
m_deviceResources->GetDWriteFactory()->CreateTextFormat(
L"Segoe UI",
nullptr,
DWRITE_FONT_WEIGHT_MEDIUM,
DWRITE_FONT_STYLE_NORMAL,
DWRITE_FONT_STRETCH_NORMAL,
24.0f,
m_formatLocale.c_str(),
&textFormat
)
);
DX::ThrowIfFailed(textFormat->SetReadingDirection(m_readingDirection));
DX::ThrowIfFailed(
m_deviceResources->GetDWriteFactory()->CreateTextLayout(
m_text.c_str(),
(uint32) m_text.length(),
textFormat.Get(),
250.0f,
100.0f,
&m_textLayout
)
);
DWRITE_TEXT_RANGE all{0u, m_text.size()};
DX::ThrowIfFailed(m_textLayout->SetLocaleName(m_layoutLocale.c_str(), all));
DX::ThrowIfFailed(m_deviceResources->GetD2DFactory()->CreateDrawingStateBlock(&m_stateBlock));
CreateDeviceDependentResources();
}
I don't think there's any ambiguity from the Unicode BiDi algorithm point of view. Initial direction set to IDWriteTextFormat or IDWriteTextLayout is crucial, but after that run directions will be derived strictly from codepoints.
Setting locale won't change direction, but it will potentially affect shaping, end result depends on particular features run font has.
I think you can accomplish abcd!... output using LRE/PDF controls around this part of the text.
For historical reasons, Cocoa's Unicode implementation is 16-bit: it handles Unicode characters above 0xFFFF via "surrogate pairs". This means that the following code is not going to work:
NSString myString = #"𠬠";
uint32_t codepoint = [myString characterAtIndex:0];
printf("%04x\n", codepoint); // incorrectly prints "d842"
Now, this code works 100% of the time, but it's ridiculously verbose:
NSString myString = #"𠬠";
uint32_t codepoint;
[#"𠬠" getBytes:&codepoint maxLength:4 usedLength:nil
encoding:NSUTF32StringEncoding options:0
range:NSMakeRange(0,2) remainingRange:nil];
printf("%04x\n", codepoint); // prints "20d20"
And this code using mbtowc works, but it's still pretty verbose, affects global state, isn't thread-safe, and probably fills up the autorelease pool on top of all that:
setlocale(LC_CTYPE, "UTF-8");
wchar_t codepoint;
mbtowc(&codepoint, [#"𠬠" UTF8String], 16);
printf("%04x\n", codepoint); // prints "20d20"
Is there any simple Cocoa/Foundation idiom for extracting the first (or Nth) Unicode codepoint from an NSString? Preferably a one-liner that just returns the codepoint?
The answer given in this otherwise excellent summary of Cocoa Unicode support (near the end of the article) is simply "Don't try it. If your input contains surrogate pairs, filter them out or something, because there's no sane way to handle them properly."
A single Unicode code point might be a Surrogate Pair, but also not all language characters are single code points. i.e. not all language characters are represented by one or two UTF-16 units. Many characters are represented by a sequence of Unicode code points.
This means that unless you are dealing with Ascii you have to think of language characters as substrings, not unicode code points at indexes.
To get the substring for the character at index 0:
NSRange r = [[myString rangeOfComposedCharacterSequenceAtIndex:0];
[myString substringWithRange:r];
This may or may not be what you want depending on what you are actually hoping to do. e.g. although this will give you 'character boundaries' these won't correspond to cursor insertion points, which are language specific.
I'm writing a piece of software using RealBASIC 2011r3 and need a reliable, cross-platform way to break a string out into paragraphs. I've been using the following but it only seems to work on Linux:
dim pTemp() as string
pTemp = Split(txtOriginalArticle.Text, EndOfLine + EndOfLine)
When I try this on my Mac it returns it all as a single paragraph. What's the best way to make this work reliably on all three build targets that RB supports?
EndofLine changes depending upon platform and depending upon the platform that created the string. You'll need to check for the type of EndOfLine in the string. I believe it's sMyString.EndOfLineType. Once you know what it is you can then split on it.
There are further properties for the EndOfLine. It can be EndOfLine.Macintosh/Windows/Unix.
EndOfLine docs: http://docs.realsoftware.com/index.php/EndOfLine
I almost always search for and replace the combinations of line break characters before continuing. I'll usually do a few lines of:
yourString = replaceAll(yourString,chr(10)+chr(13),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13)+chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(10),"<someLineBreakHolderString>")
yourString = replaceAll(yourString,chr(13),"<someLineBreakHolderString>")
The order here matters (do 10+13 before an individual 10) because you don't want to end up replacing a line break that contains a 10 and a 13 with two of your line break holders.
It's a bit cumbersome and I wouldn't recommend using it to actually modify the original string, but it definitely helps to convert all of the line breaks to the same item before attempting to further parse the string.
The below example should work with Unicode strings but it doesn't.
CFStringRef aString = CFSTR("one"); // in real life this is an Unicode string
CFStringRef formatString = CFSTR("This is %s example"); // also tried %S but without success
CFStringRef resultString = CFStringCreateWithFormat(NULL, NULL, formatString, aString);
// Here I should have a valid sentence in resultString but the current result is like aString would contain garbage.
Use %# if you want to include a CFStringRef via CFStringCreateWithFormat.
See the Format Specifiers section of Strings Programming Guide for Core Foundation.
%# is for Objective C objects, OR CFTypeRef objects (CFStringRef is compatible with CFTypeRef)
%s is for a null-terminated array of 8-bit unsigned characters (i.e. normal C strings).
%S is for a null-terminated array of 16-bit Unicode characters.
A CFStringRef object is not the same as “a null-terminated array of 16-bit Unicode characters”.
As an answer to the comment in the other answer, I would recommend the poster to
generate a UTF8 string in a portable way into char*
and, at the last minute, convert it to CFString using CFStringCreateWithCString with kCFStringEncodingUTF8 as the encoding.
Please, please do not use %s in CFStringCreateWithFormat. Please do not rely on the "system encoding", which is MacRoman on Western European environments, but not in other languages. The concept of the system encoding is inherently brain-dead, especially in east Asian environments (which I came from) where even the characters inside ASCII code range (below 127!) is modified. Hell breaks loose if you rely on "system encoding". Fortunately, since 10.4, all of the methods which use "system encoding" are now deprecated, except %s... .
I'm sorry I write this much for this small topic, but it was a real pity a few years ago when there were many nice apps which didn't work on Japanese/Korean Macs because of just this "system encoding." Please refer to this detailed explanation which I wrote a few years ago, if you're interested.