getting utf-8 error when importing snowflake data in power bi - utf-8

I'm using the power bi snowflake connector to import data from various tables.
While it works for some tables, it fails for a particular table with special character.
This is the error I get.
Can you help?
Best

I suspect that you have Windows-1252 "Latin 1 Windows" encoded data, Microsoft's embrace-and-extend version of iso-8859-1/ECMA-94. Somehow the data presents itself to the Power BI connector as utf8 when it isn't. When everything is correctly declared, the right software (ICU?) will correctly convert into Unicode and encode into utf8 before shipping the data to Snowflake.
You've got two choices:
Fix at the source (eg correct or declare correct encoding), or
Import as binary data and try to fix after arrival in Snowflake.
My best advise is 1. - to reencode it into utf8 before importing to Snowflake.
You can't put something into a text field that isn't a sequence of valid characters. And in this case, you've got erroneous data that are not valid characters, so it is not possible to store as text.
How can this be? It is all about encoding. An utf8 character is a chained byte sequence of up to 6 bytes that is decoded into a 1-5 significant byte Unicode character codepoint (skintone emojis are examples of long byte sequences). The starting byte tells how long the utf8 sequence is, and the following bytes all contain two continuation bits 10*. If the starting byte is invalid or the correct number of follow-up bytes don't have the continuation bits, you have an invalid utf8 encoding.
And how can this happen? There are character encodings where every byte sequence is legal, like the 8-bit iso-8859-1 "ISO latin 1" or its extended cousin Windows-1252. If you declare that this sequence of byte is utf8 and not iso-8859-1, you've suddenly got a sequence of bytes that may contain invalid utf8 (because it's really Windows-1252 encoding).
As of your error message, there is no legal utf8 character encoding starting with the byte HEX(92), which is a "follow-up" byte.

Related

How do I handle encoding for importing foreign movie titles with diacritics into a database?

There may not be one rule to this, but I am trying to find a way to deal with encoding between a PSQL import and retrieving and displaying records in bash or other programs. I am really new to the idea of encoding, so please bear with me! I'm running into issues with the character 'é'. I've gotten the error ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 on import (I believe the default on it is UTF-8) and was able to temporarily fix it by changing the encoding to Windows-1251 on the import. However, trying to retrieve the data in bash gave me the error ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252", so I assumed bash was using 1252 encoding.
I deleted it all and re-imported with WIN1252 encoding, and it worked for both import and retrieval. My concern is whether I may run into issues down the line with displaying this character or trying to retrieve it on a browser. Currently, if I select the movie by id in bash, I get Les MisΘrables. Honestly, not ideal, but it's okay with me if there won't be errors. It did scare me, though, when the query couldn't be completed because of an encoding mismatch.
From my little understanding of UTF-8, I feel like the character 'é' should have been accepted in the first place. I searched online for a character set and that was on it. Can anyone tell me a little more about any of this? My inclination is to go with UTF-8 as it seems the most ubiquitous, but I don't know why it was giving me trouble. Thanks!
EDIT: My complete lack of knowledge surrounding encoding led me to never save the file specifically encoded as UTF-8. That solved it. Thanks to all who looked into this!
Described three different individual problems in the question. Code examples given in Python (commonly comprehensible, I hope).
1st ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x72 0x61 means exactly what said: it's not UTF-8:
>>> print(bytearray([0xe9,0x72,0x61]).decode('latin1'))
éra
2nd ERROR: character with byte sequence 0xd0 0xb9 in encoding "UTF8" has no equivalent in encoding "WIN1252". Source is UTF-8, result is Cyrillic ('WIN1251'):
>>> print(bytearray([0xd0,0xb9]).decode('UTF-8'))
й
3rd I get Les MisΘrables (probably instead of Les Misérables?) It's a flagrant mojibake case:
>>> print('Les Misérables'.encode('latin1').decode('cp437'))
Les MisΘrables
In fine, here's a recapitulation of discussed byte values and corresponding characters. Note that the column 'CodePoint' contains Unicode (U+hhhh) and UTF-8 bytes:
Char CodePoint Category Description
---- --------- -------- -----------
é {U+00E9, 0xC3,0xA9} LowercaseLetter Latin Small Letter E With Acute
r {U+0072, 0x72} LowercaseLetter Latin Small Letter R
a {U+0061, 0x61} LowercaseLetter Latin Small Letter A
Ð {U+00D0, 0xC3,0x90} UppercaseLetter Latin Capital Letter Eth
¹ {U+00B9, 0xC2,0xB9} OtherNumber Superscript One
й {U+0439, 0xD0,0xB9} LowercaseLetter Cyrillic Small Letter Short I
Θ {U+0398, 0xCE,0x98} UppercaseLetter Greek Capital Letter Theta

Is there a term for characters that appear from ios logs like `\M-C\M-6` or `\134`

I'm trying to figure out the term for these types of characters:
\M-C\M-6 (corresponds to german "ö")
\M-C\M-$ (corresponds to german "ä")
\M-C\M^_ (corresponds to german "ß")
I want to know the term for these outputs so that I can easily convert them into the utf-8 character they actually are in golang instead of creating a mapping of each I come across.
What is the term for these? unicode? What would be the best way to convert these "characters" to their actual human readable character in golang?
It is the vis encoding of UTF-8 encoded text.
Here's an example:
The UTF-8 encoding of the rune ö in bytes is [0303, 0266].
vis encodes the byte 0303 as the bytes \M-C and the byte 0266 as the bytes \M-6.
Putting the two levels of encoding together, the rune ö is encoded as the bytes \M-C\M-6.
You can either write an decoder using the documentation on the man page or search for a decoding package. The Go standard library does not include such a decoder.

Oracle convert from utf-16 hex to utf-8 character

My database character set is AL32UTF8 and national character set AL16UTF16. I need to store in a table numeric values of some characters according to db character set and later on display a specific character using numeric value. I had some problems with understanding how this encoding works (differences between unistr, chr, ascii functions and so on), but eventually I found website where the following code was used:
chr(ascii(convert(unistr(hex), AL32UTF8)))
And it works fine when hex code is smaller than 1000 when I use for example:
chr(ascii(convert(unistr('\1555'), AL32UTF8)))
chr(ascii(convert(unistr('\1556'), AL32UTF8)))
it returns the same ascii value (ascii(convert(unistr('\hex >= 1000'), AL32UTF8))). Could anyone look at this and try to explain what's the reason? I really thought I understood how it works, but now I'm confused a bit.

Play framework JDBC ebean mysql exception with characters řů but accepts áõ

Trying to save models and i get a:
java.sql.SQLException: Incorrect string value: ...
Saving a text like "jedna dva tři kachna dům a kachní maso"
I'm using default.url="jdbc:mysql://[url]/[database]?characterEncoding=UTF-8"
řů have no encoding in latin1; áõ do. That suggests that CHARACTER SET latin1 is involved somewhere. Let's see SHOW CREATE TABLE.
C599, etc, are valid utf8 encodings for the corresponding characters.
? occurs when the destination character set cannot represent the character. Again, this points to the column/table being latin1, when it should be utf8 (or utf8mb4).
More discussion, and for debugging similar situations: Trouble with utf8 characters; what I see is not what I stored
Probably has some special character, and the UTF-8 encode that you are forcing may cause some error.
This ASCII string has the following text:
String:
jedna dva tři kachna dům a kachní maso
ASCII:
'jedna dva t\xc5\x99i kachna d\xc5\xafm a kachn\xc3\xad maso'

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Resources