Render non english characters in asciidoctor-pdf - ruby

I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.

I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.

One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like
Test-ţ

Related

Symbol # in variable cannot be handled

I got a CSV file from my front-end as a XString and after I convert it into String it looks as follows:
In the next step I'm trying to perform SPLIT lv_string AT '##' INTO TABLE itab so I can get my data but it doesn't split anything, itab contains one line equal to lv_string.
If I try REPLACE '#' IN lv_string WITH space, lv_string doesn't change and sy-subrc is 4.
From my point of view I have this problem because the symbol # is used by SAP in this context as a symbol for non-printable symbols (that result from the conversion byte->string).
My question is: how may I use SPLIT/REPLACE with # in this case?
I also thought that I can change the SAP code page when converting XString to String but I already use the SAP code page 4110 (utf-8) and don't know a better alternative...
When you display a variable with the debugger, it displays the generic character # (U+0023) for all control characters which are not assigned a glyph ("non-printable symbols" as you say).
If the variable corresponds to the contents of a text file, and ## frequently occurs, there is a big chance that it's the combination of the control characters U+000D and U+000A which correspond to "newline" in Windows files.
In the backend debugger, you can check the hexadecimal values of those characters by clicking the button "Hexadezimal" (shown in your screenshot).
You may use the variable CL_ABAP_CHAR_UTILITIES=>CR_LF which contains those two control characters.

How to Escape Double Quotes from Ruby Page Object text

In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
(RSpec::Expectations::ExpectationNotMetError)
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
error_str.encode('UTF-8')
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.

C# MVC3 and non-latin characters

I have my database results (áéíóúàâêô...) and when I display any of this characters I get codes like:
á
My controller is like this:
ViewBag.EstadosDeAlma = (from e in db.EstadosDeAlma select e.Title).ToList();
My cshtml page is like this:
var data = '#foreach (dynamic item in ViewBag.EstadosDeAlma){ #(item + " ") }';
In addition, if I use any rich text editor as Tiny MCE all non-latin characters are like this too.
What should I do to avoid this problem?
What output encoding are you using on your web pages? I would suggest using UTF-8 since you want a lot of non-ascii characters to work.
I think you should HTML encode/decode the values before comparing them.
Since you are using jQuery you can take advantage of the encoding functions built-in into it. For example:
$('<div/>').html('& #225;gil').html()
gives you "ágil" (notice that I added an extra space between the & and the # so that stackoverflow does not encode it, you won't need it)
This other question has more information about this.
HTML-encoding lost when attribute read from input field

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

How do I correctly deal with non-breaking spaces using Nokogiri?

I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.
Here is the HTML snippet in question:
<td>Amount 15,300 at dollars</td>
Note the change for the representation after I use Nokogiri:
<td>Amount 15,300 at dollars</td>
And outputting the inner_text:
Amount 15,300 at dollars
This is my base Nokogiri grab, I did try a few alternatives to solve but failed miserably:
doc = Nokogiri::HTML(open(url))
And then I do a doc.search for the item in question.
Note that if I look at the doc, the line shows up with the   on that line.
Clarification: I do not think I clearly stated the difficulty I am having. I can't get the inner_text to show up without the strange  symbol.
Unless you really, really want to keep the notation, there shouldn't be a problem here.
A0 is the hex character code for a non-breaking space. As such,   prints a non-breaking space, and is exactly equivalent to .   does the same thing, too.
What Nokogiri is doing here is reading the text node, recognizing the entities, and converting them to their actual string representation internally. Then, when converting it back to an HTML-friendly version of the text node, it represents the non-breaking space by its hex code, rather than taking the performance overhead of looking it up in an entity table, since it's equivalent, anyway.
Assuming that  was what you were seeing and wasn't just an issue pasting into StackOverflow, this is a text encoding issue: the output software (browser?) isn't in UTF-8 mode, so doesn't know how to handle character code A0, so does the best it can. If this is a browser, adding <meta charset="utf-8"> to the head will solve this issue, and will make the rest of the output more Unicode-friendly.
If you really, really want , use gsub to replace them in your final output. Otherwise, don't worry about it.
I know this is old, but it took me an hour to find out how to solve this problem, and it is really easy once you know. Just pass your string to this function and it will be "de-nbsp-fied".
def strip_html(str)
nbsp = Nokogiri::HTML(" ").text
str.gsub(nbsp,'')
end
You could also replace it whith a space if you wished. May many of you find this answer!
As #sawa says, the main problem is what you see when writing to the console. It's not correctly displaying the non-breaking space after Nokogiri converts it to the appropriate binary value.
The usual way to fix the problem is to preprocess the content:
require 'nokogiri'
html = '<td>Amount 15,300 at dollars</td>'
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(/&(?:#xa0|#160|nbsp);/i, ' '))
puts doc.to_html
Which outputs:
<td>Amount 15,300 at dollars</td>

Resources