In Go a byte is the same as an uint8. Which means a byte can store a value between 0 and 255.
A string can also be written as a slice of bytes. I've read that there are almost no differences between a string and a slice of bytes (except the mutability).
So how is it possible in Go to write something like "世界" when this is clearly not in the first 255 characters in the UTF-8 encoding table? How does Go handle characters that are not within the first 255 rows in the UTF8 encoding table?
Go uses UTF-8 encoding for source files, string literals, []rune to string conversions, string to []rune conversions, integer to string conversions and on range over string.
UTF-8 uses one to four bytes to encode a character.
Related
I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.
I'm trying to figure out the term for these types of characters:
\M-C\M-6 (corresponds to german "ö")
\M-C\M-$ (corresponds to german "ä")
\M-C\M^_ (corresponds to german "ß")
I want to know the term for these outputs so that I can easily convert them into the utf-8 character they actually are in golang instead of creating a mapping of each I come across.
What is the term for these? unicode? What would be the best way to convert these "characters" to their actual human readable character in golang?
It is the vis encoding of UTF-8 encoded text.
Here's an example:
The UTF-8 encoding of the rune ö in bytes is [0303, 0266].
vis encodes the byte 0303 as the bytes \M-C and the byte 0266 as the bytes \M-6.
Putting the two levels of encoding together, the rune ö is encoded as the bytes \M-C\M-6.
You can either write an decoder using the documentation on the man page or search for a decoding package. The Go standard library does not include such a decoder.
According to this golang blog post (https://blog.golang.org/strings) on strings, there is a line that states "...string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes."
What's the difference?
This distinction rests on the definitions in the Go language specification of a string literal and a string value:
String literal: A string literal represents a string constant obtained from concatenating a sequence of characters.
String value: A string value is a (possibly empty) sequence of bytes.
and the requirement that source code is represented in source files encoded as UTF-8 text:
Source code representation: Source code is Unicode text encoded in UTF-8.
Most crucially, note that a string literal is defined to be a constant, so its value is defined entirely at compile time.
Consider the set of valid Go source files to be those files which adhere to the language spec. By definition, such source files must be encoded using valid UTF-8. As a string literal's value is known at compile time from the contents of the source files, we see by construction that any string literal must contain valid UTF-8 text.
If we ignore byte-level escape sequences, we can see further that the compiled output of the string literal in the resulting binary must also contain valid UTF-8, as it is compiled from a UTF-8 value in the source file.
Sequences of bytes? A generalisation
However, string values are defined to be sequences of bytes,1 so there is no requirement that they themselves contain valid UTF-8. Such non-UTF-8 text may arise from values received by the program from inputs other than string literals defined at compile time, such as data received over sockets or local pipes, read from a file, or string literals which contain byte-level escape sequences.
1As the blog post says, this sequence of bytes may be considered to be a byte slice ([]byte) for most purposes. However, this is not strictly correct as the underlying implementation does not use a slice; strings are immutable and do not require to separately track their capacity. Calling cap(str) where str is of type string is an error.
Byte-level escape sequences
Byte-level escape sequences provide a mechanism for encoding non-UTF-8 values using a UTF-8 representation. Such sequences allow arbitrary bytes to be represented in a UTF-8 format to produce valid Go source files and satisfy the language spec.
For example, the byte sequence b2 bd (where bytes are here represented as space-delimited two-digit hexadecimal numbers) is not valid UTF-8. Attempting to decode this byte sequence using a UTF-8 decoder will produce an error:
var seq = "\xb2\xbd"
fmt.Println("Is 'seq' valid UTF-8 text?", utf8.Valid([]byte(seq)))
$ go run ./main.go
Is 'seq' valid UTF-8 text? false
Consequently, while such a byte sequence can be stored in a string in Go, it is not possible to express directly in a Go source file. Any source file which does so is invalid Go and the lexer will reject it (see below).
A backslash-escape sequence provides a mechanism for decomposing this byte string into a valid UTF-8 representation which satisfies the toolchain. The interpreted Go string literal "\xb2\xbd" represents the string using a sequence of ASCII characters, which can be expressed in UTF-8. When compiled, this sequence is parsed to produce a string in the compiled output containing the byte sequence b2 bd.
Example
I'll provide a working example to demonstrate this more concretely. As we will be generating invalid source files, I cannot readily use the Go playground for this; I used the go toolchain on my machine. The toolchain is go version go1.11 darwin/amd64.
Consider the following simple Go program where values may be specified for the string myStr. For convenience, we replace the whitespace (the byte sequence 20 20 20 after the string my value:) with the desired bytes. This allows us to easily find the compiled output when disassembling binaries later.
package main
import "fmt"
func main() {
myStr := "my value: "
fmt.Println(myStr)
}
I used a hex editor to insert the invalid UTF-8 byte sequence b2 bd into the final bytes of the whitespace in myStr. Representation below; other lines elided for brevity:
00000030: 203a 3d20 22b2 bd20 220a 0966 6d74 2e50 := ".. "..fmt.P
Attempting to build this file results in the error:
$ go build ./test.go
# command-line-arguments
./test.go:6:12: invalid UTF-8 encoding
If we open the file in an editor, it interprets the bytes in its own way (likely to be platform-specific) and, on save, emits valid UTF-8. On this OS X-based system, using vim, the sequence is interpreted as Unicode U+00B2 U+00BD, or ²½. This satisfies the compiler, but the source file and hence compiled binary now contains a different byte sequence than the one originally intended. The code dump below exhibits the sequence c2 b2 c2 bd, as does a hex dump of the source file. (See the final two bytes of line 1 and first two bytes of line 2.)
$ gobjdump -s -j __TEXT.__rodata test | grep -A1 "my value"
10c4c50 63746f72 796d7920 76616c75 653ac2b2 ctorymy value:..
10c4c60 c2bd206e 696c2065 6c656d20 74797065 .. nil elem type
To recover the original byte sequence, we can change the string definition thus:
myStr := "my value: \xb2\xbd"
Dumping the source file produced by the editor yields a valid UTF-8 sequence 5c 78 62 32 5c 78 62 64, the UTF-8 encoding of the ASCII characters \xb2\xbd:
00000030: 203a 3d20 226d 7920 7661 6c75 653a 205c := "my value: \
00000040: 7862 325c 7862 6422 0a09 666d 742e 5072 xb2\xbd"..fmt.Pr
Yet the binary produced by building this source file demonstrates the compiler has transformed this string literal to contain the desired b2 bd sequence (final two bytes of the fourth column):
10c48b0 6d792076 616c7565 3a20b2bd 6e6f7420 my value: ..not
In general, there is no difference between string value and literal. String literal is just one of ways to initialize new string variable (string as is).
I am trying to fit pinyin string in character array. for example If I have pinyin string like below.
string str = "转换汉字为拼音音"; // needs at least 25 bytes to store
char destination[22];
strncpy(destination, str.c_str(), 20);
destination[21] = '\0';
since Chinese characters takes 3 bytes i can do strncpy(destination, str.c_str(), (20/3)*3); but if str contains any character other than Chinese (that takes 2 bytes of 4 bytes in utf8 encoding) this logic will fill.
Later If i try to convert destination to print pinyin characters, only first 6 Chinese characters are printed properly and 2 bytes are printed in hexadecimal.
Is there any way, I can shorten the string before copying to destination so that when destination is printed, proper Chinese characters are printed (without any individual hex bytes)? using POCO::Textendcoing or POCO::UTF8Encoding class?
Thanks in Advance.
Nothing short of creating own way to encode text would work. But even in that case you would have to create 25 characters (don't forget zero at end!) array to store string at end to be printed properly, unless you create own printing routines.
I.e. amount of work required doesn't balance out win of extra 3 bytes.
Note, that code is practically C. In C++ you don't use that style of code.
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.