How to make Zend Form element attributes encode HTML elements using correct encoding? - utf-8

When using Zend\Form\Element\Select option that contains HTML entities, how do I encode it correctly?
Try 1:
I pass in 90°, I see it unconverted (same as 90°) in my HTML select box, instead of the expected degree symbol (°)
Try 2:
I use ° directly in my label name, I see this: 90�
Zend Code
Chasing the Zend Form code appears to yield these lines:
https://github.com/zendframework/zend-form/blob/master/src/View/Helper/AbstractHelper.php#L248,
where $escape is the $this->getEscapeHtmlHelper() method.
and actual conversion happens here:
https://github.com/zendframework/zend-escaper/blob/master/src/Escaper.php#L369

Re-saving the source file that contained the degree symbol directly (°) using UTF-8 encoding seems to have done the job.
(Before it was encoded using ANSI)
Incidentally that degree symbol now shows up like this in my IDE: °

Related

Symbol # in variable cannot be handled

I got a CSV file from my front-end as a XString and after I convert it into String it looks as follows:
In the next step I'm trying to perform SPLIT lv_string AT '##' INTO TABLE itab so I can get my data but it doesn't split anything, itab contains one line equal to lv_string.
If I try REPLACE '#' IN lv_string WITH space, lv_string doesn't change and sy-subrc is 4.
From my point of view I have this problem because the symbol # is used by SAP in this context as a symbol for non-printable symbols (that result from the conversion byte->string).
My question is: how may I use SPLIT/REPLACE with # in this case?
I also thought that I can change the SAP code page when converting XString to String but I already use the SAP code page 4110 (utf-8) and don't know a better alternative...
When you display a variable with the debugger, it displays the generic character # (U+0023) for all control characters which are not assigned a glyph ("non-printable symbols" as you say).
If the variable corresponds to the contents of a text file, and ## frequently occurs, there is a big chance that it's the combination of the control characters U+000D and U+000A which correspond to "newline" in Windows files.
In the backend debugger, you can check the hexadecimal values of those characters by clicking the button "Hexadezimal" (shown in your screenshot).
You may use the variable CL_ABAP_CHAR_UTILITIES=>CR_LF which contains those two control characters.

Render non english characters in asciidoctor-pdf

I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.
I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.
One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like
Test-ţ

Ruby 2: Recognizing decomposed utf8 in XML entities (NFD)

Problem
Problem is simple: I have XML containing this value
Mu¨ller
This appears to be valid XML format for representing a u with an umlaut, like this.
Müller
But all the parsers we have tried so far result in u¨ -- two distinct characters.
Background
This form of unicode (UTF-8) uses two codepoints to represent a single character; and is called Normalized Form Decomposed or NFD, and in binary is \303\274.
Most characters can also be represented as a single codepoint and entity, including this case. The XML could also have included ü or ü or ü and in binary is \195\188. This is called Normalized Form Composed. Any of these would work fine.
Getting Right to the Question
So I think the question is one of:
Is there a parser (doesn't seem to be nokogiri) that can detect and normalize to our preferred form?
Is there a reasonable way for us to reliably detect entities in the NFD form and convert them to the NFC form (or is there something that will do that out there?)
Thanks!
The character you’re using, U+00A8 (DIAERESIS) isn’t a combining character – it is distinct from U+0308 (COMBINING DIAERESIS). (I’ve only just discovered this myself – I don’t know what the use for the non-combining diaeresis is).
It looks like in this case this behaviour is correct and your XML is wrong (it should be using ̈ and not ¨).

How Do Validators Differentiate Between '&' and '&amp'?

Knowing that & is the html entity value of & - how do validators like w3c know this? Even when I look at my source code it's already been parsed into the correct value.
Your question is based on a false premise -- as Co_42 noted, & is not the "ASCII value" of '&'. It's a HTML character reference representing the character '&'. The ASCII value of '&' is 38 (or 0x26).
Your source code almost certainly consists of ASCII or Unicode text files. Those don't use HTML entities. If you have a string with an ampersand stored in the source code, it'll probably be stored with a bare "&". If there's a string literal somewhere containing actual HTML data, it may contain "&".
When you use some sort of tool or function to convert strings to text ready to put into for an HTML or XML document, any "&" will be (should be!) converted into "&".
When a program that reads HTML documents encounters an ASCII "&", it can assume that that's the beginning of a HTML character reference. This is okay because all ampersands in the actual text should have been converted into "&".
As a somewhat perverse example, if you open your source code in a word processor and save it as an HTML document, you'll find that in the actual file, "&" has been converted into "&" (and "&" has been converted into "&amp;"). If you then open that document in a browser, you'll find that the ampersands are displayed the same way they are when you view your source code in a text editor. The encoding step that happened when you saved the HTML document corresponds to the decoding step that happens when the browser displays it.
If you put something like "Fish & chips" directly into an actual HTML document, your HTML document will be invalid. Complicating the matter is the fact that programs such as browsers tend to try to recover from errors in document and display the documents anyway. As such, your browser may still display "Fish & chips" on the screen when you open your invalid document. However, a program such as the W3C validator, which is specifically meant to discover errors in HTML documents, will notify you that your document is invalid.

C# MVC3 and non-latin characters

I have my database results (áéíóúàâêô...) and when I display any of this characters I get codes like:
á
My controller is like this:
ViewBag.EstadosDeAlma = (from e in db.EstadosDeAlma select e.Title).ToList();
My cshtml page is like this:
var data = '#foreach (dynamic item in ViewBag.EstadosDeAlma){ #(item + " ") }';
In addition, if I use any rich text editor as Tiny MCE all non-latin characters are like this too.
What should I do to avoid this problem?
What output encoding are you using on your web pages? I would suggest using UTF-8 since you want a lot of non-ascii characters to work.
I think you should HTML encode/decode the values before comparing them.
Since you are using jQuery you can take advantage of the encoding functions built-in into it. For example:
$('<div/>').html('& #225;gil').html()
gives you "ágil" (notice that I added an extra space between the & and the # so that stackoverflow does not encode it, you won't need it)
This other question has more information about this.
HTML-encoding lost when attribute read from input field

Resources