How do I specify directionality of ambiguous characters to IDWriteTextLayout? - internationalization

Some characters have ambiguous directionality, like whitespace and punctuation marks. This can lead to text layout situations where there doesn't appear to be single correct layout without access to additional data to resolve the ambiguity. Consider this text:
\u05e9\u05e0\u05d1\u05d2abcd!
That's four Hebrew characters (unambiguously right-to-left), four English characters (unambiguously left-to-right), and one punctuation mark (ambiguous). If I layout that string in an IDWriteTextLayout with DWRITE_READING_DIRECTION_RIGHT_TO_LEFT, I get the following:
The punctuation mark appears to be treated as a right-to-left character which is starting a new right-to-left block to the left of the English, which seems perfectly reasonable, especially considering that right-to-left was the specified reading direction. However, it's also entirely reasonable to expect the punctuation mark to be treated as a left-to-right character associated with the embedded left-to-right English text, which would mean it should appear to the right of the 'd'.
My app knows exactly how it wants this character should be treated. How do I pass that data to IDWriteTextLayout to resolve this ambiguity?
I found the SetLocaleName method and thought that it must be the answer, but I can't seem to get it to affect the result at all. I also found the localeName parameter when creating an IDWriteTextFormat (which is then used to create the IDWriteTextLayout).
If my goal is for this to generally be Hebrew text with a string of embedded US English, I would think I'd want to use locale he on the IDWriteTextFormat and then use SetLocaleName to override that with locale en-US on character range [4-9]. However, doing so has no effect. In fact, I can't get any combination of locales used in those places to have any effect on the layout at all, whether I restrict them to a subrange or apply them to the entire string.
Am I wrong in thinking that these APIs should serve this purpose? If so, what APIs should I be using? Or is there really no way to tell IDWriteTextLayout to resolve this ambiguity differently? Am I maybe using the APIs wrong? Here is the test code I'm using to create this IDWriteTextLayout:
TestTextRenderer::TestTextRenderer(const std::shared_ptr<DX::DeviceResources>& deviceResources) :
m_deviceResources(deviceResources),
m_text(L"\u05e9\u05e0\u05d1\u05d2abcd!"),
m_readingDirection(DWRITE_READING_DIRECTION_RIGHT_TO_LEFT),
m_formatLocale(L"en-US"),
m_layoutLocale(L"en-US")
{
ComPtr<IDWriteTextFormat> textFormat;
DX::ThrowIfFailed(
m_deviceResources->GetDWriteFactory()->CreateTextFormat(
L"Segoe UI",
nullptr,
DWRITE_FONT_WEIGHT_MEDIUM,
DWRITE_FONT_STYLE_NORMAL,
DWRITE_FONT_STRETCH_NORMAL,
24.0f,
m_formatLocale.c_str(),
&textFormat
)
);
DX::ThrowIfFailed(textFormat->SetReadingDirection(m_readingDirection));
DX::ThrowIfFailed(
m_deviceResources->GetDWriteFactory()->CreateTextLayout(
m_text.c_str(),
(uint32) m_text.length(),
textFormat.Get(),
250.0f,
100.0f,
&m_textLayout
)
);
DWRITE_TEXT_RANGE all{0u, m_text.size()};
DX::ThrowIfFailed(m_textLayout->SetLocaleName(m_layoutLocale.c_str(), all));
DX::ThrowIfFailed(m_deviceResources->GetD2DFactory()->CreateDrawingStateBlock(&m_stateBlock));
CreateDeviceDependentResources();
}

I don't think there's any ambiguity from the Unicode BiDi algorithm point of view. Initial direction set to IDWriteTextFormat or IDWriteTextLayout is crucial, but after that run directions will be derived strictly from codepoints.
Setting locale won't change direction, but it will potentially affect shaping, end result depends on particular features run font has.
I think you can accomplish abcd!... output using LRE/PDF controls around this part of the text.

Related

GS1-128 barcode with ZPL does not put the AI in ()

i was expecting this command
^FO15,240^BY3,2:1^BCN,100,Y,N,Y,^FD>:>842011118888^FS
to generate a
(420) 11118888
interpretation line, instead it generates
~n42011118888
anyone have idea how to generate the expected output?
TIA!
Joey
If the firmware is up to date, D mode can be used.
^BCo,h,f,g,e,m
^XA
^FO15,240
^BY3,2:1
^BCN,100,Y,N,Y,D
^FD(420)11118888^FS
^XZ
D = UCC/EAN Mode (x.11.x and newer firmware)
This allows dealing with UCC/EAN with and without chained
application identifiers. The code starts in the appropriate subset
followed by FNC1 to indicate a UCC/EAN 128 bar code. The printer
automatically strips out parentheses and spaces for encoding, but
prints them in the human-readable section. The printer automatically
determines if a check digit is required, calculate it, and print it.
Automatically sizes the human readable.
The ^BC command's "interpretation line" feature does not support auto-insertion of the parentheses. (I think it's safe to assume this is partly because it has no way of determining what your data identifier is by just looking at the data provided - it could be 420, could be 4, could be any other portion of the data starting from the first character.)
My recommendation is that you create a separate text field which handles the logic for the parentheses, and place it just above or below the barcode itself. This is the way I've always approached these in the past - I prefer this method because I have direct control over the font, font size, and formatting of the interpretation line.

Why do xterm's docs call ' ' a control character?

I'm writing a parser for ANSI escape codes using xterm's docs as a guideline. Under the list of single character functions, they include:
SP Space.
Now, for most of the single character functions, I understand the purpose: BEL, for example, is going to require some special help from your terminal emulator to process, and TAB is likely to be involved in autocompletion rather than being printed as a normal character.
I can't imagine any situation where SP would need to be treated as anything other than a literal space character, so I'm considering dropping the SP control code from my parser. Would I risk anything by doing so? Is there a use for SP in the console that I'm not aware of?
Space isn't a "control" character. In ASCII, the control characters are codes 0 to 31 (space is 32), and 127 (DEL). The POSIX locale uses the same data, not coincidentally.
They are called control characters, because they allow the host (computer) to control (tell) the terminal to perform functions rather than simply print text:
A space is actually "printing" in this regard because (like all of the other ASCII characters), it advances the carriage position by one column. In the C language of course, a space is treated as non-graphic, which is a different shade of meaning. "Graphic" characters are visible.
In contrast, a TAB requires the terminal to do something special: move the carriage position by an amount that depends on where it happens to be at the moment.
"Carriage position" of course refers to printing terminals (such as those on which Unix was originally developed), or typewriters. The "carriage" (noun) is the mechanism which moved left/right to allow the terminal (or typewriter) to print at different positions along the line. "Carriage controls" in turn refer to the control characters which move the carriage left and right (other than as a side-effect of printing individual characters). It's obvious if you have ever used a typewriter...
In XTerm Control Sequences, SP is shown for clarity (to be able to reuse that name in other places, e.g., where a 32 is actually part of a control sequence). That wording was added in patch #25 to support the description of the group of controls S7C1T, S8C1T, and DECSCL — setting ANSI conformance level, none of which fall within ECMA-48.
A quick check shows 8 control sequences containing a space (which happens to be a valid intermediate byte, per ECMA-48, just like semicolon, which is visually distinct and does not require a name in the control sequences descriptions — you might find the PDF clearer than the HTML). None of those sequences are used in the obscure sense referred to in ECMA-48:
ECMA 48 section 6.1.1 is talking about overstriking one character on another to render a mixture of the two. This is very rare in video terminals, but assumed in most printing devices. The closest to this in a terminfo description might be ul (underline character overstrikes), and reviewing the few possibilities, some of those appear to be incorrect. xterm doesn't do that.
ECMA 48 section 8.3.140 in its comment about "character escapement" is referring to proportional fonts or variable-width character pitch (again, very rare in video terminals, but implemented in some printing devices). There are a few terminfo capabilities referring to pitch, and all of those are marked as "printer support". ncurses has one entry (att5310) using the cpi capability.
So: if you are referring to xterm's documentation, it is unlikely that you intend your parser for any other use than for video terminals. But if you intend it to be more general, then reading about printers would be a good way to improve your application.
ECMA 48 sheds some light on this.
tl;dr:
Some terminals may choose to differentiate between erased characters and space characters.
In terminals with variable width fonts, SP can be considered a control character that introduces a configurable amount of horizontal spacing.
Neither is really relevant today, so you're entirely free to just treat as just another character.
ECMA 48 section 6.1.1:
Depending on the implementation, there may or may not be a distinction between a character position in
the erased state and a character position imaging SPACE
ECMA 48 section 8.3.140:
SSW is used to establish for subsequent text the character escapement associated with the character
SPACE. The established escapement remains in effect until the next occurrence of SSW in the data
stream or until it is reset to the default value by a subsequent occurrence of CARRIAGE RETURN/LINE
FEED (CR/LF), CARRIAGE RETURN/FORM FEED (CR/FF), or of NEXT LINE (NEL) in the data
stream, see annex C.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Putting spaces back into a string of text with unreliable space information

I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.
Any suggestions on how to parse the string and put spaces back into the string by guessing?
I'm using ruby. Or should I say I'musingruby?
Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:
.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-
and if I print just string data (I added returns at the end of each line to keep it from
messing up the layout here:
'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal
lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute,
UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab
leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging
featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates
ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc
ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe
repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'
The data is spit out by callbacks so if I print each string as it is returned it looks like this:
'The
-571.3
neural
-573.7
system
-577.4
underly
13.9
ing
-577.2
face
-573.0
perc
13.7
eption
-574.9
must
-572.1
repr
20.8
esent
-577.0
the
unchangin
14.4
g
-538.5
featur
16.5
es
-529.5
of
-536.6
a
-531.4
face
'
On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!
Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.
If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.
EDIT following comment:
The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.
Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.
PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.
EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:
"meandyou"
The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.
If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.

Resources