I discovered this while using ruby printf, but it also applies to C's printf.
If you include ANSI colour escape codes in an output string, it messes up the alignment.
Ruby:
ruby-1.9.2-head > printf "%20s\n%20s\n", "\033[32mGreen\033[0m", "Green"
Green # 6 spaces to the left of this one
Green # correctly padded to 20 chars
=> nil
The same line in a C program produces the same output.
Is there anyway to get printf (or something else) to align output and not add spaces for non-printed characters?
Is this is a bug, or is there a good reason for it?
Update: Since printf can't be relied upon to align data when there's ANSI codes and wide chars, is there a best practice way of lining up coloured tabular data in the console in ruby?
It's not a bug: there's no way ruby should know (at least within printf, it would be a different story for something like curses) that its stdout is going to a terminal that understands VT100 escape sequences.
If you're not adjusting background colours, something like this might be a better idea:
GREEN = "\033[32m"
NORMAL = "\033[0m"
printf "%s%20s%s\n", GREEN, "Green", NORMAL
I disagree with your characterization of '9 spaces after the green Green'. I use Perl rather than Ruby, but if I use a modification of your statement, printing a pipe symbol after the string, I get:
perl -e 'printf "%20s|\n%20s|\n", "\033[32mGreen\033[0m", "Green";'
Green|
Green|
This shows to me that the printf() statement counted 14 characters in the string, so it prepended 6 spaces to produce 20 characters right-aligned. However, the terminal swallowed 9 of those characters, interpreting them as colour changes. So, the output appeared 9 characters shorter than you wanted it to. However, the printf() did not print 9 blanks after the first 'Green'.
Regarding the best practices for aligned output (with colourization), I think you'll need to have each sized-and-aligned field surrounded by simple '%s' fields which deal with the colourization:
printf "%s%20.20s%s|%s%-10d%s|%s%12.12s%s|\n",
co_green, column_1_data, co_plain,
co_blue, column_2_data, co_plain,
co_red, column_3_data, co_plain;
Where, obviously, the co_XXXX variables (constants?) contain the escape sequences to switch to the named colour (and co_plain might be better as co_black). If it turns out that you don't need colourization on some field, you can use the empty string in place of the co_XXXX variables (or call it co_empty).
printf field width specifiers are not useful for aligning tabular data, interface elements, etc. Aside from the issue of control characters which you have already discovered, there are also nonspacing and double-width characters which your program will have to deal with if you don't want to limit things to legacy character encodings (which many users consider deprecated).
If you insist on using printf this way, you probably need to do something like:
printf("%*s\n%*s\n", bytestopad("\033[32mGreen\033[0m", 20), "\033[32mGreen\033[0m", bytestopad("Green", 20), "Green");
where bytestopad(s,n) is a function you write that computes how many total bytes are needed (string plus padding spaces) to result in the string s taking up n terminal columns. This would involve parsing escapes and processing multibyte characters and using a facility (like the POSIX wcwidth function) to lookup how many terminal columns each takes. Note the use of * in place of a constant field width in the printf format string. This allows you to pass an int argument to printf for runtime-variable field widths.
I would separate out any escape sequences from actual text to avoid the whole matter.
# in Ruby
printf "%s%20s\n%s%20s\n", "\033[32m", "Green", "\033[0m", "Green"
or
/* In C */
printf("%s%20s\n%s%20s\n", "\033[32m", "Green", "\033[0m", "Green");
Since ANSI escape sequences are not part of either Ruby or C neither thinks that they need to treat these characters special, and rightfully so.
If you are going to be doing a lot of terminal color stuff then you should look into curses and ncurses which provide functions to do color changes that work for many different types of terminals. They also provide much much more functionality, like text based windows, function keys, and sometimes even mouse interaction.
Here's a solution I came up with recently. This allows you to use color("my string", :red) in a printf statement. I like using the same formatting string for headers and the data -- DRY. This makes that possible. Also, I use the rainbow gem to generate the color codes; it's not perfect but gets the job done. The CPAD hash contains two values for each color, corresponding to left and right padding, respectively. Naturally, this solution should be extended to facilitate other colors and modifiers such as bold and underline.
CPAD = {
:default => [0, 2],
:green => [0, 3],
:yellow => [0, 2],
:red => [0, 1],
}
def color(text, color)
"%*s%s%*s" % [CPAD[color][0], '', text.color(color), CPAD[color][1], '']
end
Example:
puts "%-10s %-10s %-10s %-10s" % [
color('apple', :red),
color('pear', :green),
color('banana', :yellow)
color('kiwi', :default)
]
Related
This produces newlines:
%(https://api.foursquare.com/v2/venues/search
?ll=80.914207,%2030.328466&radius=200
&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259
&intent=browse)
This produces spaces:
"https://api.foursquare.com/v2/venues/search
?ll=80.914207,%2030.328466&radius=200
&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259
&intent=browse"
This produces one string:
"https://api.foursquare.com/v2/venues/search"\
"?ll=80.914207,%2030.328466&radius=200"\
"&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259"\
"&intent=browse"
When I want to separate one string on multiple lines to read it better on screen, is it preferred to use the escape character?
My IDE complains that I should use single quoted strings rather than double quoted strings since there is no interpolation.
Normally you'd put something like this on one line, readability be damned, because the alternatives are going to be problematic. There's no way of declaring a string with whitespace ignored, but you can do this:
url = %w[ https://api.foursquare.com/v2/venues/search
?ll=80.914207,%2030.328466&radius=200
&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259
&intent=browse
].join
Where you explicitly remove the whitespace.
I'd actually suggest avoiding this whole mess by properly composing this URI:
uri = url("https://api.foursquare.com/v2/venues/search",
ll: [ 80.914207,30.328466 ],
radius: 200,
v: 20161201,
m: 'foursquare',
categoryId: '4d4b7105d754a06374d81259',
intent: 'browse'
)
Where you have some kind of helper function that properly encodes that using URI or other tools. By keeping your parameters as data, not as encoded strings, for as long as possible you make it easier to spot bugs as well as make last-second changes to them.
The answer by #tadman definitely suggests the proper way to do it; I’ll post another approach just for the sake of diversity:
query = "https://api.foursquare.com/v2/venues/search"
"?ll=80.914207,%2030.328466&radius=200"
"&v=20161201&m=foursquare&categoryId=4d4b7105d754a06374d81259"
"&intent=browse"
Yes, without any visible concatenation, 4 strings in quotes one by one in a row. This example won’t work in irb/pry (due to it’s REPL nature,) but the above is the most efficient way to concatenate strings in ruby without producing any intermediate result.
Contrived example to test in pry/irb:
value = "a" "b" "c" "d"
My question is similar to this one, but not really.
The issue is that I have variables in my script that will echo/printf control characters directly next to the previous. Unfortunately I have to put spaces between the variables or everything gets misinterpreted, but that's not going to work either, as I can't have spaces between them.
str="25 cents"
one=1
two=2
printf "\x3${one},${two}${str}\x30"\
(without spaces this string messes up)
printf "\x3${one},${two}%s\x30" "${str}" # outputs "5 cents"
So it ends up being either " 25 cents " (wrong), or "5 cents" (wrong x 2)... It should be:
25 cents
I've tried just about everything, escaping the variables, putting them in quotes and no luck. Evidently there's a correct way to handle this that I'm unaware of, so any help is great - thanks.
If what you are trying to do is insert mIRC colour codes into a string -- and you would have made it easier to be helped if you had said so -- then you need to be aware of two things:
The C-style hexadecimal escapes interpreted by Gnu printf have the format \x followed by two hexadecimal digits. (You can use just one digit, but only if the next character is not a hexadecimal digit. So it's better to think of it as always being two digits.) A control-C (character code 3) is written \x03. x30 through \x39 are the character codes for the digits 0 through 9. The translation of the escape code is done by printf, not by the shell, so parameter substitution happens first. So if the value of $one is one, printf "\x3${one}" will be expanded to printf "\x31" by the shell, and then printf will print the digit 1. I presume that is not what you want, since there are obviously much less round-about ways to insert the value of a variable, which don't limit the variable to be a single decimal digit.
Not all printf implementations handle hexadecimal escapes, and not all shells have a built-in printf. So while you can use \x03 with bash, you might find that it is not portable. All printf implementations should handle octal escapes, though, and 3 is still 3 in octal, but now you need three digits: \003.
The mIRC colour codes have the form control-C followed by up to two numbers separated by a comma. These numbers have a maximum of two digits, and if the next character after the colour code is a digit, you must use the two-digit form. (Coincidentally similar to the hex escape codes above, but it is truly just a coincidence.) So if you wanted the text 25 with foreground colour 3 and background colour 1, you would need to send ^C1,0225^C; if you sent ^C1,225^C, that would be interpreted as foreground colour 1 and background colour 25 (which is not a valid colour code), with the text being 5.
This is mentioned in the mIRC documentation linked above:
Note: if you want to color text that begins with numbers, this syntax requires that you specify the color value as two digits.
So a better printf invocation might be:
printf "\003%02d,%02d%s\003" "$one" "$two" "$str"
Note: It is, of course, possible that my guess about what string you are seeking to produce is completely wrong; it is just a guess based on an off-hand comment which was not deleted. If so, and if you are serious about getting your question answered, I strongly suggest you provide a clearer explanation of precisely what byte-string you are attempting to produce with your printf statement.
I'm echoing the serial port input to a CRichEditCtrl, one char at a time as it arrives. The problem I've come across is that when I receive '\r' followed by '\n' I end up two lines further down page, not one. Debugging it a little I realise that sending "\r\n" results in (what I'd consider to be) the correct, single new line insertion, but sending '\r' and '\n' separately yields two new lines.
Simple example, where m_Output is obviously a rich edit control variable:
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("X\r\n"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("Y"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("\r"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("\n"));
m_Output.SetSel(-1, -1);
m_Output.ReplaceSel(_T("Z"));
The output from the above is:
X
Y
Z
Why the extra line?!?!
I figure maybe something about the behaviour of Set/ReplaceSel(), but it doesn't insert lines between regular characters in this way, e.g. if I send 'a' followed by 'b' the output is simply "ab" ...
The various versions of the RichEdit control are documented as using different characters for paragraph breaks; RichEdit 1.0 used \r\n, RichEdit 2.0 is documented as using \r and RichEdit 3.0 (and presumably higher) can use both.
What this looks like though is that the control is actually seeing a solitary \n as a break as well (i.e. it sounds like it accepts \r, \n and \r\n as all representing a single break). This doesn't match the documentation but then again it wouldn't be the first time Microsoft documentation was somewhat inaccurate.
Internally the control probably doesn't store the actual break character verbatim, so when you feed it a \r and then separately a \n it isn't able to join them together into a single break.
It sounds like the easiest solution for you would be to filter out \n characters rather than sending them to the control. That way all the control will see are the \r characters and you'll only end up with a single break in the text.
I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?
In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"
To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/
If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.
There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.
You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.
try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.
For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor