DT_WORDBREAK: list of word break symbols - winapi

I use DT_WORDBREAK flag when I call DrawTextEx. About this flag MSDN says:
Lines are automatically broken between words if a word extends past
the edge of the rectangle specified by the lprc parameter. A carriage
return-line feed sequence also breaks the line.
But I cannot find "official" list of symbols that are used as word break symbols. Is it exist?

If you get the TEXTMETRICs for the font you're using, it corresponds to the tmBreakChar field.
For any Latin font, this is almost certainly just the plain old space character (Unicode U+0020 SPACE or ASCII 32).
I don't think DrawTextEx does anything fancier. You'd have to use a more advanced API to get more sophisticated behavior such as breaking after hyphens, soft-hyphens, other kinds of spaces, etc.

Related

ABCPDF.Net AddText Control hyphenation

I'm using ABCPDF.net for generating PDF Pages. We've got a problem with the hyphenation system.
For example if we add a text with long words using
doc.AddText("This is a Verylongwordwhichdoesntfit");
and the Rect is too small, we get:
this is a verylongwo
rdwhichdoesntfit.
My Question now is:
Can i control where it starts a new line. to have it break between long and word.
And can i tell it to use a - before the break like this?
this is a verylongwo-
rdwhichdoesntfit.
Thanks a lot.
Details in the documentation here:
http://www.websupergoo.com/helppdfnet/source/3-concepts/b-htmlstyles.htm
Firstly, with .AddText() there is no possibility of hyphenation at all. You'd have to switch to .AddHtml().
Secondly, no, abcpdf has no intelligence about hyphenating at all; it can be told to break lines after certain characters (default is space), but it has no knowledge of English words or syllables.
See http://www.websupergoo.com/helppdfnet/source/3-concepts/b-htmlstyles.htm#stylerun (search for canBreakAfter at that link)
If you're able to edit your text, you can use soft hyphen characters
http://www.websupergoo.com/helppdfnet/source/3-concepts/b-htmlstyles.htm#stylerun, last line of the "Chars" section
If you require fine control over hyphenation you can make use of the soft hyphen character – ­. This character is invisible and indicates a point at which a chunk of text may reasonably be broken.
For example, you'd use this command, and it might break at any of the places where the ­ appears:
doc.AddHtml("This is a Very­long­word­which­doesnt­fit");
But even this won't add the visible hyphens at the break, I don't think.

Why do xterm's docs call ' ' a control character?

I'm writing a parser for ANSI escape codes using xterm's docs as a guideline. Under the list of single character functions, they include:
SP Space.
Now, for most of the single character functions, I understand the purpose: BEL, for example, is going to require some special help from your terminal emulator to process, and TAB is likely to be involved in autocompletion rather than being printed as a normal character.
I can't imagine any situation where SP would need to be treated as anything other than a literal space character, so I'm considering dropping the SP control code from my parser. Would I risk anything by doing so? Is there a use for SP in the console that I'm not aware of?
Space isn't a "control" character. In ASCII, the control characters are codes 0 to 31 (space is 32), and 127 (DEL). The POSIX locale uses the same data, not coincidentally.
They are called control characters, because they allow the host (computer) to control (tell) the terminal to perform functions rather than simply print text:
A space is actually "printing" in this regard because (like all of the other ASCII characters), it advances the carriage position by one column. In the C language of course, a space is treated as non-graphic, which is a different shade of meaning. "Graphic" characters are visible.
In contrast, a TAB requires the terminal to do something special: move the carriage position by an amount that depends on where it happens to be at the moment.
"Carriage position" of course refers to printing terminals (such as those on which Unix was originally developed), or typewriters. The "carriage" (noun) is the mechanism which moved left/right to allow the terminal (or typewriter) to print at different positions along the line. "Carriage controls" in turn refer to the control characters which move the carriage left and right (other than as a side-effect of printing individual characters). It's obvious if you have ever used a typewriter...
In XTerm Control Sequences, SP is shown for clarity (to be able to reuse that name in other places, e.g., where a 32 is actually part of a control sequence). That wording was added in patch #25 to support the description of the group of controls S7C1T, S8C1T, and DECSCL — setting ANSI conformance level, none of which fall within ECMA-48.
A quick check shows 8 control sequences containing a space (which happens to be a valid intermediate byte, per ECMA-48, just like semicolon, which is visually distinct and does not require a name in the control sequences descriptions — you might find the PDF clearer than the HTML). None of those sequences are used in the obscure sense referred to in ECMA-48:
ECMA 48 section 6.1.1 is talking about overstriking one character on another to render a mixture of the two. This is very rare in video terminals, but assumed in most printing devices. The closest to this in a terminfo description might be ul (underline character overstrikes), and reviewing the few possibilities, some of those appear to be incorrect. xterm doesn't do that.
ECMA 48 section 8.3.140 in its comment about "character escapement" is referring to proportional fonts or variable-width character pitch (again, very rare in video terminals, but implemented in some printing devices). There are a few terminfo capabilities referring to pitch, and all of those are marked as "printer support". ncurses has one entry (att5310) using the cpi capability.
So: if you are referring to xterm's documentation, it is unlikely that you intend your parser for any other use than for video terminals. But if you intend it to be more general, then reading about printers would be a good way to improve your application.
ECMA 48 sheds some light on this.
tl;dr:
Some terminals may choose to differentiate between erased characters and space characters.
In terminals with variable width fonts, SP can be considered a control character that introduces a configurable amount of horizontal spacing.
Neither is really relevant today, so you're entirely free to just treat as just another character.
ECMA 48 section 6.1.1:
Depending on the implementation, there may or may not be a distinction between a character position in
the erased state and a character position imaging SPACE
ECMA 48 section 8.3.140:
SSW is used to establish for subsequent text the character escapement associated with the character
SPACE. The established escapement remains in effect until the next occurrence of SSW in the data
stream or until it is reset to the default value by a subsequent occurrence of CARRIAGE RETURN/LINE
FEED (CR/LF), CARRIAGE RETURN/FORM FEED (CR/FF), or of NEXT LINE (NEL) in the data
stream, see annex C.

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]☔︎ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
☂☃☽☀︎☔︎
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.
The symbol being printed ☔︎ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: ☔︉ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[…\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.
This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

Ruby CGI.unescapeHTML generating strange charactors

I have backed up a bunch of markdown formatted comments into an XML document. This of course meant I needed to HTMLescape them. When I try to use CGI.unescapeHTML it adds a bunch of strange characters into the markup that do not render well in all browsers.
Specifically, it replaces two spaces with "\302\240 ", but not consistently. How do I get it to stop this behavior?
eg:
s = "I am seeing more and more <a href="http://github.com/aslakhellesoy/cucumber /tree/master">Cucumber</a> usage.  This is a good thing!  But I'm also seeing people who are not using regular expressions to their fullest.  Here are some quick regex tips to keep you features readable:
* `(?:a|an)` -- using a this construct you can group things wihout actually matching them.  I'm seeing a lot of steps that have unused params because someone needed a group but didn't know how to avoid capturing it&#x000A"
CGI.unescapeHTML s
# => "I am seeing more and more Cucumber usage.\302\240 This is a good thing!\302\240 But I'm..."
Those are non-breaking spaces. Read up on wikipedia.
In computer-based text processing and digital typesetting, a
non-breaking space, also known as a no-break space or
non-breakable space (NBSP), is a variant of the space character
that prevents an automatic line break (line wrap) at its position.
In certain formats (such as HTML), it also prevents the
“collapsing” of multiple consecutive whitespace characters into a
single space. The non-breaking space is also known as a hard space
or fixed space. In Unicode, it is encoded as U+00A0 no-break space
(HTML:   ).

Resources