ruby: print light, medium and dark shade character - ruby

'Hi, i would like to print the old DOS characters 176 to 178 (filled cursor with gradients), unicode 2591, 2592 and 2593, light, medium and dark shade in ruby on a windows console, how to do it please ?
tried this
p "\u2592" #=> "\u2592"
p [176].pack('U*') => °

Don't use p; use print (or puts if you want a trailing newline). p displays things using #inspect, which gives you something you can copy and paste into source code, including quotation marks, etc. print and puts are the normal way to output text.
Assuming you have your encodings set up right in your program and console, then print "\u2592" and similar should work fine. Although it can be tricky to set up a Windows console for Unicode, and you might want to look at some third-party console applications.

Related

Cursor shifted after some spanish accent marks in RStudio editor

When editing some lines of code in RStudio, that have Spanish accents (eg. á, é...) the text I type appears one space before the cursor position. For example, in:
a <- tibble(b = c("01", "02", "03", "04", "05"),
c = c("Amazonas", "Áncash", "Apurímac","Arequipa", "Ayacucho"))
if I place the cursor after the c in "Apurímac" and type an "o", i would get "Apurímaoc" instead of Apurímaco.
I've seen this happen in lines with Spanish accents (e.g. á, é...) and only after the accented characters. Surprisingly, it doesn't seem to happen after capitalized accented characters, like Á in "Áncash". I've tried changing the font in RStudio settings as stated here, here and here with no luck. I suspect it might be related to copying from the clipboard, but I'm not pretty sure about it. Though code runs fine, it's quite annoying.
I'm running RStudio 1.4.1103 on macOS 11.4.
This occurs because RStudio's editor is not able to properly position the cursor in unicode text using joining marks. The example in your case is í, which is made up of the code points:
LATIN SMALL LETTER I
COMBINING ACUTE ACCENT
See https://apps.timwhitlock.info/unicode/inspect?s=i%CC%81 for more details.
Compare this to the NFC normalization of that same character, í, which looks the same but is made up of a single code point:
LATIN SMALL LETTER I WITH ACUTE
See https://apps.timwhitlock.info/unicode/inspect?s=%C3%AD for more details.
Unfortunately, until this is resolved, the best solution is to use the NFC-normalized version of this character; that is, LATIN SMALL LETTER I WITH ACUTE. Or, alternatively, use a unicode escape (e.g. "\u00ed") in place that character.
See also: https://github.com/rstudio/rstudio/issues/8859

Prawn with some emojis for ttf-font not rendering text correctly

I have a ruby script to generate a pdf document with some text. The text contains emojis in it.
The problem with the first line of text is that it prints the three emojis separated by something that looks like a cross when they should be a single emoji(family of three members).
The problem with the second line is that it just prints a square instead of the intended emoji(shush face).
I've tried with some other fonts but it still won't work. These are the fonts:
DejaVuSans
ipam
NotoSans-Medium
I can't find the problem
Is there anything missing?
Am I doing something wrong?
The gems are installed and the fonts are in the right place
require "prawn"
require "prawn/emoji"
require "prawn/measurement_extensions"
$pdf = Prawn::Document.new(:page_size => [200.send(:mm),200], :margin => 0)
$pdf.font "./resources/Montserrat-Medium.ttf"
st = "\u{1F468}\u200D\u{1F469}\u200D\u{1F466}".encode("UTF-8")
st2="\u{1F92B}".encode("UTF-8")
$pdf.draw_text st,:at => [10, 100]
$pdf.draw_text st2,:at => [10, 80]
$pdf.render_file "test.pdf"
Turns out Prawn doesn't know how to parse the joined emojis (those formed by the a set of simple emojis joined by \u200D). Prawn/emoji is supposed to do that but there is a bug on the regex used to identify the emojis that causes the joined emojis to be drawn separately.
Also the index and the image gallery used is a little bit outdated.
The solution is to substitute #emoji_index.to_regexp in the class Drawer , in the prawn/emoji source code for a regex that can recognize the joined emojis and update the emoji gallery, after that run the task to update the index and you are good to go.
The fonts have nothing to do with it.
I'm creator of prawn-emoji.
Certainly prawn-emoji v2.1 or older can't draw joined-emojis like 👨‍👨‍👦 and 1️⃣.
https://github.com/hidakatsuya/prawn-emoji/issues/24
So today, i released prawn-emoji v3.0. This release includes support for joined emoji like 👨‍👨‍👦(ZWJ Sequence) and 1️⃣(Combining Sequence), and switch to Twemoji.
Please see below for further details.
https://github.com/hidakatsuya/prawn-emoji/blob/master/CHANGELOG.md
Please try to use prawn-emoji v3.0 if you'd like.
Hope this help.
It does work. You can look up the character codes for deja vu sans.
You can also search for which fonts support which Unicode characters. If you are seeing an empty box with Montserrat-Medium, that means that unicode character is not supported, for example the character, \u200D
Here is a helpful link to search which fonts support that character - http://www.fileformat.info/info/unicode/char/200d/fontsupport.htm
Here is another link for code \u{1F92B}, which is your shush emoji- http://www.fileformat.info/info/unicode/char/1F92B/fontsupport.htm
Both DejaVuSans and Montserrat-Medium dont support it.
require 'prawn'
require 'prawn/emoji'
Prawn::Document.generate 'foo.pdf' do
font "./resources/Montserrat-Medium.ttf"
text "For Montserrat-Medium"
text "\u{1F468}\u200D\u{1F469}\u200D\u{1F466}".encode("UTF-8")
text "\u{1F92B}"
text " "
font './resources/DejaVuSans.ttf'
text " For DejaVuSans"
text "\u{1F468}\u200D\u{1F469}\u200D\u{1F466}".encode("UTF-8")
text "\u{1F92B}"
end

Is it possible to separate STDOUT context by its colour?

I'm using the output of the excellent package icdiff (https://github.com/jeffkaufman/icdiff) to check for differences between updated iterations of files. I'd like to parse out just the significant differences though. From the package --help I can't see any in-built options (and for full disclosure I've 'cross posted' at the github issues page to see if it can be added or I've missed something).
This has got me wondering whether a hacky solution might be to parse out the lines by their colour, since they are also colour coded by 'severity of difference'. Is this at all possible in bash? (Alternative approaches are welcome too!)
Here's a sample of the output (I can only think to add a picture here since the markup wouldnt show colour). I'd like to get just the lines where the whole line is solid red/green for instance. Excuse some of the screen wrapping, my monitor isn't wide enough and the text is small enough already.
with GNU Grep, for example
grep -Po $'\e\[31m\K.*(?=\e\[\d+m)'
to extract text in red,
\K to keep the left outside match, like a lookbehind
(?=..) lookahead assertion 0 length match
you can grep on the ANSI escape sequences, e.g. (with 31 for red):
grep '^[\[31m' # make the escape character (^[) by typing ctrl+v ESC
but you need to make sure your output stays colored if it is not sent to a terminal : (many programs will make their output B&W when output is not a terminal. - you can check it with less, which will show you the escape sequences)

How to get display width of a string from Linux command line?

I am working on an AWK script that processes a text file line by line, formats them and stuffs them into an SVG file text field. The SVG takes care of text wrapping automatically, but I want to predict where each line will wrap. (I need some characters to repeat and extend close to the end of the line). I know the exact font, font size, and width of the text field.
Is there a standard utility in Linux or easily available in Ubuntu that will give a width in pixels or inches given a string, font, and font size?
For example:
get-width 'Nimbus Sans L' 18 "test string"
returns "x pixels"
You can do this with the ghostscript interpreter, assuming you have that and the fonts are set up correctly.
Here is the possibly mysterious incantation:
gs -dQUIET -sDEVICE=nullpage 2>/dev/null - \
<<<'18 /NimbusSanL-Regu findfont exch scalefont setfont
(test string) stringwidth pop =='
Using -dQUIET suppresses warnings about font substitution, which is probably not a good idea until you have some idea about how to name the fonts you're looking for.
ghostscript is not a layout engine, and you may find the measurement doesn't work with complicated bidirectional text, combining characters, or East Asian languages. (I tested it with a little Arabic, and it was OK, but no guarantees.) It does not kern, so it will normally produce measurements a little larger than a good layout engine, and possibly a lot larger if the font positions diacritics using kerning.
Finally, if your text includes unbalanced parentheses or backslashes, you'll need to escape them. I use the following:
"$(sed 's/[()\\]/\\&/g' <<<"$text")"
That's because Postscript strings are enclosed in (...) -- (test string) -- and are allowed to include balanced parentheses. Unbalanced parentheses will usually generate a syntax error, unless they are backslash-escaped.
If you have access to inkscape:
FONT="Nimbus Sans L"
SIZE=18
STRING="test string"
inkscape --without-gui --query-id=id1 -W <(echo '<svg><text id="id1" style="font-size:'$SIZE'px;font-family:'$FONT';">'$STRING'</text></svg>') 2>/dev/null
Output (e.g.):
76.577344

Ruby extract arabic text from PDF

I usually use this code to extract text from PDFs:
require 'rubygems'
require 'pdf/reader'
filename = File.expand_path(File.dirname(__FILE__)) + "/myfile.pdf"
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end
This time I want to parse an Arabic PDF, but, using this code, I get a bunch of weird characters. For example: ±πNuô ≠ö ¥πbËÊ ´Lö Ë«_°u«» ±GKIW √±U±Nr ËîUÅW √Ê ´bœ Ë≠w «∞LπLuŸ, ¥L
I have already read that coding: utf-8 is fine for Arabic, so, is there any solution?
The text in this PDF is not properly encoded: the relation between what appears on the screen and what character code it represents is not stored in this PDF. That's why you get 'random' text.
Also notable: the text appears in the correct order, but that is because the font characters are drawn mirrored and the text itself is also drawn mirrored:
-- a typical hack-ish workaround to properly typeset Arabic using Quark XPress (there used to be an XTension (sp.?) that 'enabled' this).
As it seems this wrong encoding is actually defined as such inside the fonts ("Font uses built-in encoding", according to Acrobat Pro's "Inventory" function), you might be able to find a translation table between the characters you are reading and what they actually should be. Be aware that these tables may very well differ for each of the fonts in this document, so you have to check what font each of your text strings is using.
Addition
I did some further investigations, and they agree with your own, and Acrobat Pro's, findings. Your sample text appears like this:
/F1 1 Tf % set font and size "HGKECF+PHBagdad"
...
[ (´Mb ) -24.4 (¢'b¥b ) -24.4 («®{05}d«ØU¢Nr, ) -24.4 (Ë«ù´öÂ ) -24.4 (°LDU{03}&Nr.) ] TJ
Usually, font entries in a PDF contain a table that 'translates' into actual character codes. That is also true for this font (and all others):
<<
/Type /Font
/Subtype /Type1
/BaseFont /HGKECF+PHBagdad
/Encoding 66 0 R
/ToUnicode 69 0 R
>>
(only relevant entries listed). The /Encoding entry points to a simple array of index > character codes list, and /ToUnicode to a more formal table, which essentially contains the same. Both lists translate to the same text.
As you can see in the top image, the font contains Arabic glyphs (mirrored), but the code linked to these glyphs is not the correct one for Arabic. It's like the old "Symbol" font hack: type 'a' to get an alpha, 'b' for a beta, 'g' for a gamma: text on your screen appears to be "ɑβɣ" but in truth it says "abg".
Addition 2
See also this Adobe Forum thread: Arabic - ToUnicode Map incorrect?
Quote:
Arabic XT fonts are not Arabic fonts from the operating system point of view (MacOS or Windows). They use the Mac Roman encoding; the Arabic glyphs are placed in place of the Roman glyphs.
I tried to find a "correcting" encoding for your fonts but have this far not been successful. If I could locate a translation table, it ought to be possible to exchange the existing /ToUnicode table with a corrected one, and you'd get the correct text when extracting. (Although it may be simpler to use the same table to change the text strings after extraction in your programming language of choice.)

Resources