Robot Framework: validating text that's not UTF-8 - utf-8

I have a few tests in RF (using the Java port by Hi-Fi) with Selenium 3. The thing is, I need to validate some texts that contain non UTF-8 characters.
My usual test to validate text compares against a variable where I previously stored the text I expect to find. Something like this:
*** Variables ***
${SOME TEXT} Some text I need to find in the web application I'm testing
*** Test Cases ***
Validate main paragraph text
Open Home Page
Open Browser ${URL} ${BROWSER}
Element Text Should Be ${ELEMENT LOCATOR} ${SOME TEXT}
Problem is, the expected text I'm given contains some cp-1252 characters (e.g.: the 0x92 character that looks like an apostrophe: ’), so if I just store that in my ${SOME TEXT} variable I'll get this error: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 84".
I can't have the text changed either, I need to work with it as it is.
Any clues on what would be the best approach? Thanks.

As A. Kootstra said, I just needed to change the encoding in my files. Since I'm working on Eclipse, I went to Window > Preferences > General > Workspace and changed “Text file encoding” to UTF-8. This way I'm now able to enter all the expected characters and my files are saved in the right encoding.

Related

Render non english characters in asciidoctor-pdf

I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.
I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.
One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like
Test-ţ

W3C unable to validate

Sorry, I am unable to validate this document because on line 1200 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xD8" does not map to Unicode
i would be thankful to know what exactly should i do, my website is : http://dailysahara.com/
The issue, as stated by the validator, is that you have some invalid UTF-8 in your document. It appears to be in the box on the left of the site with the four tabs "Tags", "Comments", "Recents", and "Popular". It shows up to me as a black square like this: �. If you remove that, you should be able to validate your site.

How Do Validators Differentiate Between '&' and '&amp'?

Knowing that & is the html entity value of & - how do validators like w3c know this? Even when I look at my source code it's already been parsed into the correct value.
Your question is based on a false premise -- as Co_42 noted, & is not the "ASCII value" of '&'. It's a HTML character reference representing the character '&'. The ASCII value of '&' is 38 (or 0x26).
Your source code almost certainly consists of ASCII or Unicode text files. Those don't use HTML entities. If you have a string with an ampersand stored in the source code, it'll probably be stored with a bare "&". If there's a string literal somewhere containing actual HTML data, it may contain "&".
When you use some sort of tool or function to convert strings to text ready to put into for an HTML or XML document, any "&" will be (should be!) converted into "&".
When a program that reads HTML documents encounters an ASCII "&", it can assume that that's the beginning of a HTML character reference. This is okay because all ampersands in the actual text should have been converted into "&".
As a somewhat perverse example, if you open your source code in a word processor and save it as an HTML document, you'll find that in the actual file, "&" has been converted into "&" (and "&" has been converted into "&amp;"). If you then open that document in a browser, you'll find that the ampersands are displayed the same way they are when you view your source code in a text editor. The encoding step that happened when you saved the HTML document corresponds to the decoding step that happens when the browser displays it.
If you put something like "Fish & chips" directly into an actual HTML document, your HTML document will be invalid. Complicating the matter is the fact that programs such as browsers tend to try to recover from errors in document and display the documents anyway. As such, your browser may still display "Fish & chips" on the screen when you open your invalid document. However, a program such as the W3C validator, which is specifically meant to discover errors in HTML documents, will notify you that your document is invalid.

Extended charsets chars not reccognized and converting to ? mark

I have a string contain some special char like "\u2012" i.e. FIGURE DASH. When i am trying to print this on console I am getting a '?' mark instead of its symbol. I have an editor where in I can insert the symbol using alt+numpad like alt+2012. In editor it I could see the symbol save it in a xml file and get the value using nodevalue, I get a '?' mark.
To summerize I am facing problem to read extended latin a charset. What i need is When i insert such symbols and read it, i should get something like &#xXXXX;.
Please help!
TIA :)
Simply I have a String inpath = "À";, I want to get its unicode value..like &#xXXXX;
The default console encoding in Windows is some MS-DOS code page and they don't support the character. You can try running chcp 65001 before running the program but you might also need to change the console font as well.
You don't need to do anything you wouldn't do with any other character, as long as you use UTF-8. You aren't doing that in many places. You need to explicitly write in your code to save and read the file in UTF-8, and not rely on the platform default encoding.

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Resources