Why are my search results not in the same charset as my page encoding? - utf-8

I am using UTF-8 encoding for an html page.
<head>
<meta charset="utf-8">
In the debugger console, document.characterSet returns "UTF-8".
On the page, I have metadata (keywords, description, title) with a valid UTF-8 character: '®', which is UTF-8: 'c2ae'
The character displays correctly in the view source, and on the page title.
But google search results and bing search results are showing it as 'î'. That is, during the web crawl, it appears to be getting converted to ISO-8859-1 or Western-1252 displaying both bytes: 'c2' and 'ae'.
If I replace the character with ® => (\u00ae) it shows correctly.
Short of converting my meta data to ISO-8859-1, is there a best practice I should be using for this?

Issue was on the back-end, the data was not being transcoded to UTF-8 properly when read from cache. So, I feel the best practice is to use the native UTF-8 BMP character, with the proper page encoding, and not be required to use html entity values.

Look at the pages meta tags and confirm that it is not using this:
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
For HTML5 Google recommends:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
Also note this:
Note:
<meta charset="">
Another Note:
Some characters are reserved in HTML. "Html Entities"
These reserved characters in HTML must be replaced with character entities.
e.g.
& ampersand & &
® registered trademark ® ®

Related

UTF-8 Special Characters not working in header file

I am facing a strange issue with wkhtmltopdf. While in Footer and Content, all special characters are shown as supposed, in the header file, they don't show up or get replaced by blocks with a question mark. All of the three files are built the same:
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
</body>
</html>
Is there some kind of trick to tell the header to use utf-8 to?
Btw, I am already telling wkhtmltopdf to use utf-8 not only as meta but also in the script call:
--encoding utf-8
EDIT: I am using an html-header (as you can see on the code I posted). Every 3 HTML files are built the same way. But while it is working in the content, the header don't likes my special chars. Maybe it is a problem that the content in the header comes from a $_POST variable while in the content, the text is built out of the db?

wkhtmltopdf prints "μ" incorrectly

wkhtmltopdf converts μ to μ when converting the HTML document to PDF. The HTML document renders μ perfectly.
charset meta tag is set to utf-8 in the HTML document.
<meta charset="utf-8">
Setting encoding to UTF-8 on server side fixed the issue encoding: "UTF-8".

How to escape to htmlentities except for html tags in smarty

Example:
$smarty->assign('string', '<p>Germans use "Ümlauts" and pay in €uro</p>');
{$string|escape|unescape:"html"}
results in:
<p>Germans use 'Ümlauts' and pay in €uro</p>
What am I doing wrong...
You should also add UTF-8 to escape function as in documentation: http://www.smarty.net/docsv2/en/language.modifier.escape
There are more than one reasons why this can occur.
Check the encoding of
your php files,
your template files and
your html output (doctype and meta tags),
usually it is one of those which provokes this.
To avoid this kind of issue, in many cases the best way is to use utf8 throughout your project, which means converting smarty templates and php to utf8 and use proper utf8 tags in your html header.
HTML 4.01:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
HTML5:
<meta charset="UTF-8">

UTF8 -- still showing weird characters?

My webpage's response headers show this:
Content-Type:text/html; charset=UTF-8
However, I still get a black diamond with white question mark for characters like é. What am I supposed to do exactly? It's my .htaccess that's setting UTF-8.
If its a script or HTML file, check the encoding of the file itself, which should be saved as UTF-8.
In Zend, its something like: Edit->Set encoding->Other: UTF-8,
If you are serving a HTML page you need to indicate in the HTML file that the content is UTF-8.
You can do this by adding a meta html tag to your header section:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Firefox and UTF-16 encoding

I'm building a website with the encoding UTF-16. It means that every files (html,jsp) is encoded in UTF-18 and I set in the head of every HTML page :
<meta http-equiv="content-type" content="text/html; charset=UTF-16">
My index page is correctly displayed by Chrom and IE. However, firefox doesn't render the index. It displays 2 strange characters and the full index page code :
��<!DOCTYPE html> <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-16"> ...
Do you know the reason? It should be a problem of encoding, but I don't know where it's located...
Thanks
(Disclosure: I’m the developer responsible for the relevant code in Firefox.)
I'm building a website with the encoding UTF-16.
Please don’t. The short rules are:
Never use UTF-16 for interchange.
Always use UTF-8 for interchange.
If you break rules 1 & 2 and still use UTF-16, at least use the BOM (the right one).
But seriously, don’t break rules 1 and 2.
If you include user-provided content on your pages, using UTF-16 means that your site is vulnerable to socially engineered XSS at least in older browsers. Try this demo in an old version of Firefox (version 20 or older) or in a Presto-based version of Opera.
To avoid the vulnerability, use UTF-8.
It means that every files (html,jsp) is encoded in UTF-18
Uh oh. :-)
and I set in the head of every HTML page :
<meta http-equiv="content-type" content="text/html; charset=UTF-16">
A meta tag works as an internal encoding declaration only when the encoding being used maps the bytes of the meta tag to the same bytes ASCII would. That’s not the case for UTF-16.
Do you know the reason?
Not without full response headers and the original response body in a hex editor. The general solution, as noted above, is to use always UTF-8 and never to use UTF-16 over HTTP.
If your content is in a language for which UTF-16 is more compact than UTF-8, two things:
All the HTML, JS and CSS on the page is more compact in UTF-8.
gzip makes the difference go away.
Check that the server sends a Content-Type header with the correct encoding.

Resources