How to fix text encoding: ðø? I'm using Laravel and MySQL database.
I tried
mb_convert_encoding($var,$to,$from);
But I don't know encoding from and to.
Create the database with UTF-8 encoding like this:
CREATE DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
If you do this, you should never need to convert anything. To ensure incoming user data has the right coding, make sure to set the encoding in the head tag:
<head>
<meta charset="UTF-8">
</head>
If you work with existing data, there is no way around figuring out what encoding it has before converting it with mb_convert_encoding.
Related
My client's web app has large database which millions of records. All table's encoding is latin1.
When I fetch some text field which holds huge data and mail that string some strange haracter issue comes. Such when I recieve email spaces are converted into this character Â.
It is not premissible to change the DB encoding.
I tried the following PHP function but no outcome ;(
$msg = mb_convert_encoding($msg, "UTF-8", "latin1");
Please help
I would check for the encoding php thinks it is
echo mb_detect_encoding($str);
And then do
iconv("detectedEncoding", "UTF-8", $str);
Or if iconv is not installed, check if your encoding was right in your solution. ;)
French characters in HTML with utf-8 charset still display incorrectly. I have a small sample page in ShopAndBind.com/Sample.asp with META HTTP-EQUIV='Content-Type' CONTENT='text/html;charset=utf-8' that still does not display Véhicules Terrestres à Moteur correctly, whether it is in the source or loaded from MySQL data in a database. It displays fine everywhere else. I'm using Visual InterDev 6.0 from Visual Studio 2008 for development. NotePad, Kedit works. The hex in the file is'E0' and 'E9' respectively for é and à.
The page http://shopandbind.com/Sample.asp is served with HTTP headers that do not specify character encoding, the data does not start with BOM, but it contains a meta tag that specifies UTF-8 as the character encoding. However, the data contains bytes that are invalid in UTF-8. This explains the failure.
The data is in fact in ISO-8859-1 (or compatible) encoding, as you can see by manually selecting that encoding (often under the name “Western European”) in the View → Encoding menu of your browser. Byes E0 and E9 denote é and à in ISO-8859-1, byt definitely not in UTF-8.
Thus, the minimal fix is to replace UTF-8 by ISO-8859-1 in the meta tag. A better fix might be to make the process that produces the HTML file to generate UTF-8 encoded data.
We are currently converting our webapp to UTF-8 from ISO-8859-1. And everything works great but requesting get/post variables from other sites (Signup forms).
Some of this sites that post to our site have ISO-8859-1 encoding and som have UTF-8.
The problem is that special characters gets URLencoded differently depending on the site charset.
For example:
ø = %F8 in ISO-8859-1
ø = %C3%B8 in UTF-8
I cant get %F8 right when i have UTF-8 charset. I only get a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Any tips on how to fix this would be much appreciated:)
Torbjørn
You can specify the encoding explicitly using <form accept-charset="UTF-8">.
If you don't want to do that, the browser has to guess the encoding you want. For that it usually takes the encoding of the page in which the form is. So if you serve the HTML files as UTF-8 your forms will be sent back as UTF-8, too.
I'd suggest you did a preanalysis of the inputs before converting them. Essentially, scan for the iso-8859-1 codes for Æ, Ø and Å (upper and lower case). If you find any, do a search/replace for the entire request, where you swap the iso-char codes to the UTF-8 charcodes.
I use ruby reading a web page, and its content is:
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GB2312" />
</HEAD>
<BODY>
中文
</BODY>
</HTML>
From the meta, we can see it uses a GB2312 encoding.
My code is:
res = Net::HTTP.post_form(URI.parse("http://xxx/check"),
{:query=>'xxx'})
Then I use:
res.include?("中文")
to check if the content has that word. But if shows false.
I don't know why it is false, and what should I do? What encoding ruby 1.8.7 use? If I need to convert the encoding, how to do it?
Ruby 1.8 doesn't use encodings, it uses plain byte strings. If you want the byte string in your program to match the byte string in the web page, you'd have to save the .rb file in the same encoding the web pages uses (GB2312) so that Ruby will see the same bytes.
Probably better would be to write the byte string explicitly, avoiding issues to do with the encoding of the .rb file:
res.include?("\xD6\xD0\xCE\xC4")
However, matching byte strings doesn't match characters reliably when multibyte encodings are in use (except for UTF-8, which is deliberately designed to allow it). If the web page had the string:
兄形男
in it, that would be encoded as "\xD0\xD6\xD0\xCE\xC4\xD0". Which contains the byte sequence "\xD6\xD0\xCE\xC4", so the include? would be true even though the characters 中文 are not present.
If you need to handle non-ASCII characters fully reliably, you'd need a language with Unicode support.
So I have a ruby script that parses HTML pages and saves the extracted string into a DB...
but i'm getting weired charcters (usually question marks) instead of plain text...
Eg : ‘SOME TEXT’ instead of 'Some Text'
I've tried HTML entities and CGI::unescape ... but to no avail...
did some googling n set $KCODE = 'u' & require 'jcode'
still not working...
any suggestions /pointers would be great
Thanks
PS : using mysql 5.1
Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.
That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.
The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.
If it's terminal output, make sure that $ENV{'LANG'} is set to the appropriate UTF8 encoding (e.g. en.UTF-8) and that the terminal emulator itself is set the same way.
If it's HTML output, make sure that the page encoding is set to UTF-8 as well, i.e.:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.
It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.
Ruby has an iconv implementation to convert between encoding types. See here for more information.