Find out character encoding of straße - utf-8

I'm struggling with the encoding of the content of an external interface. In the MySQL database the collation is latin1_swedish_ci. Also the collation of the field ist latin1_swedish_ci. The php script is encoded in UTF-8 and the output in the browser gives me UTF-8. Everything is working fine except the content of this database. The database connection should be UTF-8 (Typo3 4.7) and the content is
straße
but it should be straße.
mb_detect_encoding($data['street'],'UTF-8') says it is UTF-8. If I use utf8_decode() I get
stra�?e
If I use utf8_encode() I get
straße
My assumption was that UTF-8 encoded data is stored in ISO-8859-1, but if this would be the case this shouldn't make such problems here. How do I find out what the real encoding is?
PS: I cannot change the encoding of the source!
My solution for my initial problem:
I had to set the datbase connection from UTF-8 to ISO-8859-1 with this line of code
$res = $GLOBALS['TYPO3_DB']->sql_query("SET NAMES latin1");

The character ß 'LATIN SMALL LETTER SHARP S' (U+00DF) exist in UTF-8 of bytes 0xC3 and 0x9F as per the linked site:
UTF-8 (hex) 0xC3 0x9F (c39f)
If we look at the ISO-8859-1 codepage layout, then those bytes represent the characters à and a character not definied in the ISO-8859-1 codepage layout. This is thus not it. Another common character encoding which has some overlap with ISO-8859-1 is Windows CP1252 (also known as ANSI, used by default when saving a text file in Notepad — which is overridable by using Save As instead). If we look at CP1252 codepage layout, then those bytes represent the characters à and Ÿ which confirms what you're initially retrieving.
So, it's most likely CP1252 encoded.

What you see as “ß” is really the windows-1252 (also known as CP1252) interpretation of the two bytes 0xC3 and 0x9F that constitute the UTF-8 encoding of “ß”. But this seems to mean that the data is actually UTF-8 encoded and just gets misinterpreted as windows-1252 encoded. So I think it should be simply processed as UTF-8, with due precautions.

i recommend that you proceed to verify what charset is being used by your sql connection. it is NOT necessarily the same as the charset that you define for your databse.
FROM PHP
// Opens a connection to a MySQL server
$connection = mysql_connect ($server, $username, $password);
$charset = mysql_client_encoding($connection);
$flagChange = mysql_set_charset('utf8', $connection);
echo "The character set is: $charset</br>mysql_set_charset result:$flagChange</br>";
INSIDE PHPMYADMIN
open database information_schema
open table schemata
check out your mysql default collation
you may or may not be able to change these parameters, depending on user privileges.
as shown above, i solved my conflicting character set problems in mysql by appending the following line to my connection.php file (which i call at the beginning of every page that uses db access):
$flagChange = mysql_set_charset('utf8', $connection);

Related

AWS SAM throws UnicodeEncodeError when invoking NodeJS 12.x lambda function [duplicate]

What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!
I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().
Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.
The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""
I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.
Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'
You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!
SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs
Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')
The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available
I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL
Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.

Play framework JDBC ebean mysql exception with characters řů but accepts áõ

Trying to save models and i get a:
java.sql.SQLException: Incorrect string value: ...
Saving a text like "jedna dva tři kachna dům a kachní maso"
I'm using default.url="jdbc:mysql://[url]/[database]?characterEncoding=UTF-8"
řů have no encoding in latin1; áõ do. That suggests that CHARACTER SET latin1 is involved somewhere. Let's see SHOW CREATE TABLE.
C599, etc, are valid utf8 encodings for the corresponding characters.
? occurs when the destination character set cannot represent the character. Again, this points to the column/table being latin1, when it should be utf8 (or utf8mb4).
More discussion, and for debugging similar situations: Trouble with utf8 characters; what I see is not what I stored
Probably has some special character, and the UTF-8 encode that you are forcing may cause some error.
This ASCII string has the following text:
String:
jedna dva tři kachna dům a kachní maso
ASCII:
'jedna dva t\xc5\x99i kachna d\xc5\xafm a kachn\xc3\xad maso'

Special French characters in HTML

French characters in HTML with utf-8 charset still display incorrectly. I have a small sample page in ShopAndBind.com/Sample.asp with META HTTP-EQUIV='Content-Type' CONTENT='text/html;charset=utf-8' that still does not display Véhicules Terrestres à Moteur correctly, whether it is in the source or loaded from MySQL data in a database. It displays fine everywhere else. I'm using Visual InterDev 6.0 from Visual Studio 2008 for development. NotePad, Kedit works. The hex in the file is'E0' and 'E9' respectively for é and à.
The page http://shopandbind.com/Sample.asp is served with HTTP headers that do not specify character encoding, the data does not start with BOM, but it contains a meta tag that specifies UTF-8 as the character encoding. However, the data contains bytes that are invalid in UTF-8. This explains the failure.
The data is in fact in ISO-8859-1 (or compatible) encoding, as you can see by manually selecting that encoding (often under the name “Western European”) in the View → Encoding menu of your browser. Byes E0 and E9 denote é and à in ISO-8859-1, byt definitely not in UTF-8.
Thus, the minimal fix is to replace UTF-8 by ISO-8859-1 in the meta tag. A better fix might be to make the process that produces the HTML file to generate UTF-8 encoded data.

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

Different querystring urlencoding based on codepage. ASP classic

We are currently converting our webapp to UTF-8 from ISO-8859-1. And everything works great but requesting get/post variables from other sites (Signup forms).
Some of this sites that post to our site have ISO-8859-1 encoding and som have UTF-8.
The problem is that special characters gets URLencoded differently depending on the site charset.
For example:
ø = %F8 in ISO-8859-1
ø = %C3%B8 in UTF-8
I cant get %F8 right when i have UTF-8 charset. I only get a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Any tips on how to fix this would be much appreciated:)
Torbjørn
You can specify the encoding explicitly using <form accept-charset="UTF-8">.
If you don't want to do that, the browser has to guess the encoding you want. For that it usually takes the encoding of the page in which the form is. So if you serve the HTML files as UTF-8 your forms will be sent back as UTF-8, too.
I'd suggest you did a preanalysis of the inputs before converting them. Essentially, scan for the iso-8859-1 codes for Æ, Ø and Å (upper and lower case). If you find any, do a search/replace for the entire request, where you swap the iso-char codes to the UTF-8 charcodes.

Resources