Importing extended ASCII into Oracle - oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.

There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.

Related

getting utf-8 error when importing snowflake data in power bi

I'm using the power bi snowflake connector to import data from various tables.
While it works for some tables, it fails for a particular table with special character.
This is the error I get.
Can you help?
Best
I suspect that you have Windows-1252 "Latin 1 Windows" encoded data, Microsoft's embrace-and-extend version of iso-8859-1/ECMA-94. Somehow the data presents itself to the Power BI connector as utf8 when it isn't. When everything is correctly declared, the right software (ICU?) will correctly convert into Unicode and encode into utf8 before shipping the data to Snowflake.
You've got two choices:
Fix at the source (eg correct or declare correct encoding), or
Import as binary data and try to fix after arrival in Snowflake.
My best advise is 1. - to reencode it into utf8 before importing to Snowflake.
You can't put something into a text field that isn't a sequence of valid characters. And in this case, you've got erroneous data that are not valid characters, so it is not possible to store as text.
How can this be? It is all about encoding. An utf8 character is a chained byte sequence of up to 6 bytes that is decoded into a 1-5 significant byte Unicode character codepoint (skintone emojis are examples of long byte sequences). The starting byte tells how long the utf8 sequence is, and the following bytes all contain two continuation bits 10*. If the starting byte is invalid or the correct number of follow-up bytes don't have the continuation bits, you have an invalid utf8 encoding.
And how can this happen? There are character encodings where every byte sequence is legal, like the 8-bit iso-8859-1 "ISO latin 1" or its extended cousin Windows-1252. If you declare that this sequence of byte is utf8 and not iso-8859-1, you've suddenly got a sequence of bytes that may contain invalid utf8 (because it's really Windows-1252 encoding).
As of your error message, there is no legal utf8 character encoding starting with the byte HEX(92), which is a "follow-up" byte.

How to decode/encode ASCII_8BIT in the Scala?

I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.

Octal, Hex, Unicode

I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.
This is what my header looks like:
From: "\261Central Station <sip#...>"
Looking at the ASCII table the character in the picture is "±":
What I don't understand:
If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1".
e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.
From the page you linked to:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.
That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).
Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as #muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.
If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.
As to your questions:
If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?
The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.
How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?
We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).
If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)
That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print � for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.
Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).
Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?
How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.
The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

UTF-8 string delimiter

I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.
My question is, is this a standard way of delimiting UTF-8 strings?
I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.
I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.
UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description
i would use a delimiter which starts with 0x11......
but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.
if the user inputs any utf8 represented char you may simply send it as is.

Resources