According to this golang blog post (https://blog.golang.org/strings) on strings, there is a line that states "...string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes."
What's the difference?
This distinction rests on the definitions in the Go language specification of a string literal and a string value:
String literal: A string literal represents a string constant obtained from concatenating a sequence of characters.
String value: A string value is a (possibly empty) sequence of bytes.
and the requirement that source code is represented in source files encoded as UTF-8 text:
Source code representation: Source code is Unicode text encoded in UTF-8.
Most crucially, note that a string literal is defined to be a constant, so its value is defined entirely at compile time.
Consider the set of valid Go source files to be those files which adhere to the language spec. By definition, such source files must be encoded using valid UTF-8. As a string literal's value is known at compile time from the contents of the source files, we see by construction that any string literal must contain valid UTF-8 text.
If we ignore byte-level escape sequences, we can see further that the compiled output of the string literal in the resulting binary must also contain valid UTF-8, as it is compiled from a UTF-8 value in the source file.
Sequences of bytes? A generalisation
However, string values are defined to be sequences of bytes,1 so there is no requirement that they themselves contain valid UTF-8. Such non-UTF-8 text may arise from values received by the program from inputs other than string literals defined at compile time, such as data received over sockets or local pipes, read from a file, or string literals which contain byte-level escape sequences.
1As the blog post says, this sequence of bytes may be considered to be a byte slice ([]byte) for most purposes. However, this is not strictly correct as the underlying implementation does not use a slice; strings are immutable and do not require to separately track their capacity. Calling cap(str) where str is of type string is an error.
Byte-level escape sequences
Byte-level escape sequences provide a mechanism for encoding non-UTF-8 values using a UTF-8 representation. Such sequences allow arbitrary bytes to be represented in a UTF-8 format to produce valid Go source files and satisfy the language spec.
For example, the byte sequence b2 bd (where bytes are here represented as space-delimited two-digit hexadecimal numbers) is not valid UTF-8. Attempting to decode this byte sequence using a UTF-8 decoder will produce an error:
var seq = "\xb2\xbd"
fmt.Println("Is 'seq' valid UTF-8 text?", utf8.Valid([]byte(seq)))
$ go run ./main.go
Is 'seq' valid UTF-8 text? false
Consequently, while such a byte sequence can be stored in a string in Go, it is not possible to express directly in a Go source file. Any source file which does so is invalid Go and the lexer will reject it (see below).
A backslash-escape sequence provides a mechanism for decomposing this byte string into a valid UTF-8 representation which satisfies the toolchain. The interpreted Go string literal "\xb2\xbd" represents the string using a sequence of ASCII characters, which can be expressed in UTF-8. When compiled, this sequence is parsed to produce a string in the compiled output containing the byte sequence b2 bd.
Example
I'll provide a working example to demonstrate this more concretely. As we will be generating invalid source files, I cannot readily use the Go playground for this; I used the go toolchain on my machine. The toolchain is go version go1.11 darwin/amd64.
Consider the following simple Go program where values may be specified for the string myStr. For convenience, we replace the whitespace (the byte sequence 20 20 20 after the string my value:) with the desired bytes. This allows us to easily find the compiled output when disassembling binaries later.
package main
import "fmt"
func main() {
myStr := "my value: "
fmt.Println(myStr)
}
I used a hex editor to insert the invalid UTF-8 byte sequence b2 bd into the final bytes of the whitespace in myStr. Representation below; other lines elided for brevity:
00000030: 203a 3d20 22b2 bd20 220a 0966 6d74 2e50 := ".. "..fmt.P
Attempting to build this file results in the error:
$ go build ./test.go
# command-line-arguments
./test.go:6:12: invalid UTF-8 encoding
If we open the file in an editor, it interprets the bytes in its own way (likely to be platform-specific) and, on save, emits valid UTF-8. On this OS X-based system, using vim, the sequence is interpreted as Unicode U+00B2 U+00BD, or ²½. This satisfies the compiler, but the source file and hence compiled binary now contains a different byte sequence than the one originally intended. The code dump below exhibits the sequence c2 b2 c2 bd, as does a hex dump of the source file. (See the final two bytes of line 1 and first two bytes of line 2.)
$ gobjdump -s -j __TEXT.__rodata test | grep -A1 "my value"
10c4c50 63746f72 796d7920 76616c75 653ac2b2 ctorymy value:..
10c4c60 c2bd206e 696c2065 6c656d20 74797065 .. nil elem type
To recover the original byte sequence, we can change the string definition thus:
myStr := "my value: \xb2\xbd"
Dumping the source file produced by the editor yields a valid UTF-8 sequence 5c 78 62 32 5c 78 62 64, the UTF-8 encoding of the ASCII characters \xb2\xbd:
00000030: 203a 3d20 226d 7920 7661 6c75 653a 205c := "my value: \
00000040: 7862 325c 7862 6422 0a09 666d 742e 5072 xb2\xbd"..fmt.Pr
Yet the binary produced by building this source file demonstrates the compiler has transformed this string literal to contain the desired b2 bd sequence (final two bytes of the fourth column):
10c48b0 6d792076 616c7565 3a20b2bd 6e6f7420 my value: ..not
In general, there is no difference between string value and literal. String literal is just one of ways to initialize new string variable (string as is).
Related
I'm making my way through the Building Git book, but attempting to build my implementation in JavaScript. I'm stumped at the part of reading data in this file format that apparently only Ruby uses. Here is the excerpt from the book about this:
Note that we set the string’s encoding13 to ASCII_8BIT, which is Ruby’s way of saying that the string represents arbitrary binary data rather than text per se. Although the blobs we’ll be storing will all be ASCII-compatible source code, Git does allow blobs to be any kind of file, and certainly other kinds of objects — especially trees — will contain non-textual data. Setting the encoding this way means we don’t get surprising errors when the string is concatenated with others; Ruby sees that it’s binary data and just concatenates the bytes and won’t try to perform any character conversions.
Is there a way to emulate this encoding in JS?
or
Is there an alternative encoding I can use that JS and Ruby share that won't break anything?
Additionally, I've tried using Buffer.from(< text input >, 'binary') but it doesn't result in the same amount of bytes that the ruby ASCII-8BIT returns because in Node.js binary maps to ISO-8859-1.
Node certainly supports binary data, that's kind of what Buffer is for. However, it is crucial to know what you are converting into what. For example, the emoji "☺️" is encoded as six bytes in UTF-8:
// UTF-16 (JS) string to UTF-8 representation
Buffer.from('☺️', 'utf-8')
// => <Buffer e2 98 ba ef b8 8f>
If you happen to have a string that is not a native JS string (i.e. its encoding is different), you can use the encoding parameter to make Buffer interpret each character in a different manner (though only several different conversions are supported). For example, if we have a string of six characters that correspond to the six numbers above, it is not a smiley face for JavaScript, but Buffer.from can help us repackage it:
Buffer.from('\u00e2\u0098\u00ba\u00ef\u00b8\u008f', 'binary')
// => <Buffer e2 98 ba ef b8 8f>
JavaScript itself has only one encoding for its strings; thus, the parameter 'binary' is not really the binary encoding, but a mode of operation for Buffer.from, telling it that the string would have been a binary string if each character were one byte (however, since JavaScript internally uses UCS-2, each character is always represented by two bytes). Thus, if you use it on something that is not a string of characters in range from U+0000 to U+00FF, it will not do the correct thing, because there no such thing (GIGO principle). What it will actually do is get the lower byte of each character, which is probably not what you want:
Buffer.from('STUFF', 'binary') // 8BIT range: U+0000 to U+00FF
// => <Buffer 42 59 54 45 53> ("STUFF")
Buffer.from('STUFF', 'binary') // U+FF33 U+FF34 U+FF35 U+FF26 U+FF26
// => <Buffer 33 34 35 26 26> (garbage)
So, Node's Buffer structure exactly corresponds to Ruby's ASCII-8BIT "encoding" (binary is an encoding like "bald" is a hair style — it simply means no interpretation is attached to bytes; e.g. in ASCII, 65 means "A"; but in binary "encoding", 65 is just 65). Buffer.from with 'binary' lets you convert weird strings where one character corresponds to one byte into a Buffer. It is not the normal way of handling binary data; its function is to un-mess-up binary data when it has been read incorrectly into a string.
I assume you are reading a file as string, then trying to convert it to a Buffer — but your string is not actually in what Node considers to be the "binary" form (a sequence of characters in range from U+0000 to U+00FF; thus "in Node.js binary maps to ISO-8859-1" is not really true, because ISO-8859-1 is a sequence of characters in range from 0x00 to 0xFF — a single-byte encoding!).
Ideally, to have a binary representation of file contents, you would want to read the file as a Buffer in the first place (by using fs.readFile without an encoding), without ever touching a string.
(If my guess here is incorrect, please specify what the contents of your < text input > is, and how you obtain it, and in which case "it doesn't result in the same amount of bytes".)
EDIT: I seem to like typing Array.from too much. It's Buffer.from, of course.
In Go a byte is the same as an uint8. Which means a byte can store a value between 0 and 255.
A string can also be written as a slice of bytes. I've read that there are almost no differences between a string and a slice of bytes (except the mutability).
So how is it possible in Go to write something like "世界" when this is clearly not in the first 255 characters in the UTF-8 encoding table? How does Go handle characters that are not within the first 255 rows in the UTF8 encoding table?
Go uses UTF-8 encoding for source files, string literals, []rune to string conversions, string to []rune conversions, integer to string conversions and on range over string.
UTF-8 uses one to four bytes to encode a character.
I am using IMultilanguage2::ConvertStringFromUnicode to convert from UTF-16. For some languages (Japanese, Chinese, Korean), I am getting an escape sequence (e.g. 0x1B, 0x24, 0x29, 0x43 for codepage 50225 (ISO-2022 Korean)). WideCharToMultiByte exhibits the same behavior.
I am building a MIME message, so the encoding is specified in the header itself and the escape prefix is displayed as-is.
Is there a way to convert without the prefix?
Thank you!
I don't really see a problem here. That is a valid byte sequence in ISO 2022:
Escape sequences to designate character sets take the form ESC I [I...] F, where there are one or more intermediate I bytes from the range 0x20–0x2F, and a final F byte from the range 0x40–0x7F. (The range 0x30–0x3F is reserved for private-use F bytes.) The I bytes identify the type of character set and the working set it is to be designated to, while the F byte identifies the character set itself.
...
Code: ESC $ ) F
Hex: 1B 24 29 F
Abbr: G1DM4
Name: G1-designate multibyte 94-set F
Effect: selects a 94n-character set to be used for G1.
As F is 0x43 (C), this byte sequence tells a decoder to switch to ISO-2022-KR:
Character encodings using ISO/IEC 2022 mechanism include:
...
ISO-2022-KR. An encoding for Korean.
ESC $ ) C to switch to KS X 1001-1992, previously named KS C 5601-1987 (2 bytes per character) [designated to G1]
In this case, you have to specify iso-2022-kr as the charset in a MIME Content-Type or RFC2047-encoded header. But an ISO 2022 decoder still has to be able to switch charsets dynamically while decoding, so it is valid for the data to include an intial switch sequence to the Korean charset.
Is there a way to convert without the prefix?
Not with IMultiLanguage2 and WideCharToMultiByte(), no. They have no clue how you are going to use their output, so it makes sense why they include an initial switch sequence to the Korean charset - so a decoder without access to charset info from MIME (or other source) would still know what charset to use initially.
When you put the data into a MIME message, you will have to manually strip off the charset switch sequence when you set the MIME charset to iso-2022-kr. If you do not want to strip it manually, you will have to find (or write) a Unicode encoder that does not output that initial switch sequence.
That was a red herring - turned out the escape sequence is necessary. The problem was with my code that was trimming the names and addresses using Trim() Delphi function, which trims all characters less than or equal to space (0x20); that includes the escape character (0x1B).
Switching to my own trimming function that removes only spaces fixed the problem.
I'm trying to store the literal ascii value of hex FFFF which in decimal is 65535 and is ÿ when written out in VB6. I want to store this value in a buffer which is defined by:
Type HBuff
txt As String * 16
End Type
Global WriteBuffer As HBuff
in the legacy code I inherited.
I want to do something like WriteBuffer.txt = Asc(hex$(-1)) but VB6 stores it as 70
I need to store this value, ÿ in the string, even though it is not printable.
how can I do this?
I'm not sure what your problem is.
If you want to store character number 255 in a string, then do so:
WriteBuffer.txt = Chr$(255)
Be warned though that the result depends on the current locale.
ChrW$(255) does not, but it may yield not the character you want.
For the reference, the code you used returns ASCII code of the first character of the textual hex representation of the number -1. Hex(-1) is FFFF when -1 is typed as Integer (which it is by default), so you get the ASCII code of letter F, which is 70.
I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one.
The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values meaning something else in UTF8.
I think what I need is a function that takes an ASCII character index and returns the appropriate UTF8 character.
Update: If I happen to know the equivalent Oracle character set for the incoming text can I then convert the raw bytes to UTF8? The source text will always be single byte.
There's no such thing as "extended ASCII." Or, to be more precise, so many encodings are supersets of ASCII, sharing the same first 127 code points, that the term is too vague to be meaningful. You need to find out if the strings in this file are encoded using UTF-8, ISO-8859-whatever, MacRoman, etc.
The answer to the second part of your question is the same. UTF-8 is, by design, a superset of ASCII. Any ASCII character (i.e. 0 through 127) is also a UTF-8 character. To translate some non-ASCII character (i.e. >= 128) into UTF-8, you first need to find out what encoding it's in.