W3C unable to validate - utf-8

Sorry, I am unable to validate this document because on line 1200 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xD8" does not map to Unicode
i would be thankful to know what exactly should i do, my website is : http://dailysahara.com/

The issue, as stated by the validator, is that you have some invalid UTF-8 in your document. It appears to be in the box on the left of the site with the four tabs "Tags", "Comments", "Recents", and "Popular". It shows up to me as a black square like this: �. If you remove that, you should be able to validate your site.


Robot Framework: validating text that's not UTF-8

I have a few tests in RF (using the Java port by Hi-Fi) with Selenium 3. The thing is, I need to validate some texts that contain non UTF-8 characters.
My usual test to validate text compares against a variable where I previously stored the text I expect to find. Something like this:
*** Variables ***
${SOME TEXT} Some text I need to find in the web application I'm testing
*** Test Cases ***
Validate main paragraph text
Open Home Page
Open Browser ${URL} ${BROWSER}
Element Text Should Be ${ELEMENT LOCATOR} ${SOME TEXT}
Problem is, the expected text I'm given contains some cp-1252 characters (e.g.: the 0x92 character that looks like an apostrophe: ’), so if I just store that in my ${SOME TEXT} variable I'll get this error: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 84".
I can't have the text changed either, I need to work with it as it is.
Any clues on what would be the best approach? Thanks.
As A. Kootstra said, I just needed to change the encoding in my files. Since I'm working on Eclipse, I went to Window > Preferences > General > Workspace and changed “Text file encoding” to UTF-8. This way I'm now able to enter all the expected characters and my files are saved in the right encoding.

How to Escape Double Quotes from Ruby Page Object text

In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.

Render non english characters in asciidoctor-pdf

I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.
I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.
One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

RSS reader Error : Input is not proper UTF-8 when use simplexml_load_file()

I'm using simplexml_load_file method for parsing feed from external source.
My code like this
$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml';
$rssParser = simplexml_load_file($url);
The output is as follows :
Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39
Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.
Can i be able to read this using this method(SimpleXML API)?
If no then any other method is available?
I've searched through Google but no answer. Every method I applied returns with this error.
Well, well, when I retrieve this content using Python, I get the following:
'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate
Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.
You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.
Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.
We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.
$contents = file_get_contents($url);
