HTML unit displaying wrong characters - htmlunit

I'm using HTMLUnit. I am accessing the pages however special (Maltese) characters are being displayed wrongly. For example, ġuvni is displayed as ?uvni
HtmlPage page = submit_button.click();
System.out.println(page.asText());
I suspect it's an encoding problem, though I don't find any page.setPageEndoding or some similar method... Has anyone had such a problem before?
Thanks!

Make sure your page is in UTF-8 by putting this meta tag in your <head>:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Related

Hidden characters appear out of nowhere

I have an HTML Code that keeps coming up with hidden characters. At first, it was the A characters in all the extra spaces which I removed. I tried the <meta http-equiv ="Content-Type " and content= "text /html; charset=UTF-8" /> That seemed to fix the issue.
However, now the code comes up with hidden ?????. I have rebuilt the code again but once it's put through our system it comes up with hidden ?????, how do I fix this? Or could this just be that our system is messing it up?
Please see the photos for reference. <meta http-equiv="Content-Type" and content="text/html; charset=UTF-8=" /> seemed to fix the A character issue

Why are my search results not in the same charset as my page encoding?

I am using UTF-8 encoding for an html page.
<head>
<meta charset="utf-8">
In the debugger console, document.characterSet returns "UTF-8".
On the page, I have metadata (keywords, description, title) with a valid UTF-8 character: '®', which is UTF-8: 'c2ae'
The character displays correctly in the view source, and on the page title.
But google search results and bing search results are showing it as 'î'. That is, during the web crawl, it appears to be getting converted to ISO-8859-1 or Western-1252 displaying both bytes: 'c2' and 'ae'.
If I replace the character with ® => (\u00ae) it shows correctly.
Short of converting my meta data to ISO-8859-1, is there a best practice I should be using for this?
Issue was on the back-end, the data was not being transcoded to UTF-8 properly when read from cache. So, I feel the best practice is to use the native UTF-8 BMP character, with the proper page encoding, and not be required to use html entity values.
Look at the pages meta tags and confirm that it is not using this:
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
For HTML5 Google recommends:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
Also note this:
Note:
<meta charset="">
Another Note:
Some characters are reserved in HTML. "Html Entities"
These reserved characters in HTML must be replaced with character entities.
e.g.
& ampersand & &
® registered trademark ® ®

UTF-8 Special Characters not working in header file

I am facing a strange issue with wkhtmltopdf. While in Footer and Content, all special characters are shown as supposed, in the header file, they don't show up or get replaced by blocks with a question mark. All of the three files are built the same:
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
</body>
</html>
Is there some kind of trick to tell the header to use utf-8 to?
Btw, I am already telling wkhtmltopdf to use utf-8 not only as meta but also in the script call:
--encoding utf-8
EDIT: I am using an html-header (as you can see on the code I posted). Every 3 HTML files are built the same way. But while it is working in the content, the header don't likes my special chars. Maybe it is a problem that the content in the header comes from a $_POST variable while in the content, the text is built out of the db?

How to escape to htmlentities except for html tags in smarty

Example:
$smarty->assign('string', '<p>Germans use "Ümlauts" and pay in €uro</p>');
{$string|escape|unescape:"html"}
results in:
<p>Germans use 'Ümlauts' and pay in €uro</p>
What am I doing wrong...
You should also add UTF-8 to escape function as in documentation: http://www.smarty.net/docsv2/en/language.modifier.escape
There are more than one reasons why this can occur.
Check the encoding of
your php files,
your template files and
your html output (doctype and meta tags),
usually it is one of those which provokes this.
To avoid this kind of issue, in many cases the best way is to use utf8 throughout your project, which means converting smarty templates and php to utf8 and use proper utf8 tags in your html header.
HTML 4.01:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
HTML5:
<meta charset="UTF-8">

How to Properly Define UTF-8 Charset in in <head> Tag Section of Web Document

If my doc type is <!DOCTYPE html> is it best or more correct to use
<meta charset="utf-8" />
or
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
to define utf-8?
Thanks.
The first one is only valid with HTML5.
The second one is also valid for older (X)HTML versions
With this doctype (indicating HTML5) both are valid, I prefer the first as it is shorter. :)

Resources