ability to read and write unicode data without adding any extra encoding - go

I have some EBCDIC mainframe data (non DBCS) that i am getting from a 3rd party utility that provides me data EBCDIC data in unicode. x'00' = \u0000, etc. I am having a problem when the data starts coming across starting at \u0080. When I convert this data into a byte array, I am getting 2 bytes instead of 1. For example when I receive \u0080, in my byte array I am getting the 2 bytes 0xc2 0x80. I only want my byte array to have 0x80. Below is an example:
func allUnicode() (text string, bytes []byte, err error) {
text = "\u0080\u0081\u0082\u0083\u0084\u0085\u0086\u0087\u0088\u0089\u008a\u008b\u008c\u008d\u008e\u008f\u0090\u0091\u0092\u0093\u0094\u0095\u0096\u0097\u0098\u0099\u009a\u009b\u009c\u009d\u009e\u009f\u00a0\u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\u00a9\u00aa\u00ab\u00ac\u00ad\u00ae\u00af\u00b0\u00b1\u00b2\u00b3\u00b4\u00b5\u00b6\u00b7\u00b8\u00b9\u00ba\u00bb\u00bc\u00bd\u00be\u00bf\u00c0\u00c1\u00c2\u00c3\u00c4\u00c5\u00c6\u00c7\u00c8\u00c9\u00ca\u00cb\u00cc\u00cd\u00ce\u00cf\u00d0\u00d1\u00d2\u00d3\u00d4\u00d5\u00d6\u00d7\u00d8\u00d9\u00da\u00db\u00dc\u00dd\u00de\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e5\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f0\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f7\u00f8\u00f9\u00fa\u00fb\u00fc\u00fd\u00fe\u00ff"
bytes = []byte(text)
return
}
when i inspect the bytes array, i get 0xc2 0x80 ... My process has to parse this array based on position and those extra bytes throw everything off.
Any help would be appreciated.

Related

Base64 encoding doesn't fail with invalid characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to ensure a string coming from an http request is valid for use in a base64 url param. I've been experimenting with base64.RawURLEncoding as I assumed encoding an invalid string would throw an err, or at least decoding the result of this would fail, however it quite happily encodes/decodes the string regardless of the input.
https://play.golang.org/p/3sHUfl2NSJK
I have created the above playground showing the issue I'm having (albeit an extreme example). Is there another way of ascertaining whether a string consists entirely of valid base64 characters?
To clarify, Base64 is an encoding scheme which allows you to take arbitrary binary data and safely encode it into ASCII characters which can later be decoded into the original binary string.
That means that the "Base64-encode" operation can take literally any input and produce valid, encoded data. However, the "Base64-decode" operation will fail if its input string contains characters outside of set of ASCII characters that the encoding uses (meaning that the given string was not produced by a valid Base64-encoder).
To test if a string contains a valid Base64 encoded sequence, you just need to call base64.Encoding.DecodeString(...) and test if the error is "nil".
For example (Go Playground):
func IsValidBase64(s string) bool {
_, err := base64.StdEncoding.DecodeString(s)
return err == nil
}
func main() {
ss := []string{"ABBA", "T0sh", "Foo=", "Bogus\x01"}
for _, s := range ss {
if IsValidBase64(s) {
fmt.Printf("OK: valid Base64 %q\n", s)
} else {
fmt.Printf("ERR: invalid Base64 %q\n", s)
}
}
// OK: valid Base64 "ABBA"
// OK: valid Base64 "T0sh"
// OK: valid Base64 "Foo="
// ERR: invalid Base64 "Bogus\x01"
}
base64 encoding works by interpreting an arbitrary bit stream as a string of 6-bit integers, which are then mapped one-by-one to the chosen base64 alphabet.
Your example string starts with these 8-bit bytes:
11000010 10111010 11000010 10101010 11100010 10000000
Re-arrange them into 6-bit numbers:
110000 101011 101011 000010 101010 101110 001010 000000
And map them to a base64 alphabet (here URL encoding):
w r r C q u K A
Since every 6-bit number can be mapped to a character in the alphabet (there's exactly 64 of them), there are no invalid inputs to base64. This is precisely what base64 is used for: turn arbitrary input into printable ASCII characters.
Decoding, on the other hand, can and will fail if the input contains bytes outside of the base64 alphabet — they can't be mapped back to the 6-bit integer.

Get UTF-8 encoded string from byte[]

I'm looking to convert a slice of bytes []byte into an UTF-8 string.
I want to write a function like that :
func bytesToUTF8string(bytes []byte)(string){
// Take the slice of bytes and encode it to UTF-8 string
// return the UTF-8 string
}
What is the most efficient way to perform this
EDIT :
Specifically I want to convert the output of crypto.rsa.EncryptPKCS1v15 or the output of SignPKCS1v15 to an UTF-8 encoded string.
How can I do it ?
func bytesToUTF8string(bytes []byte) string {
return string(bytes)
}
It's such a common, simple operation that it's arguably not worth wrapping in a function. Unless, of course, you need to translate the from a different source encoding, then it's an entirely different issue, with which the golang.org/x/text/encoding package might help

How to be definite about the number of whitespace fmt.Fscanf consumes?

I am trying to implement a PPM decoder in Go. PPM is an image format that consists of a plaintext header and then some binary image data. The header looks like this (from the spec):
Each PPM image consists of the following:
A "magic number" for identifying the file type. A ppm image's magic number is the two characters "P6".
Whitespace (blanks, TABs, CRs, LFs).
A width, formatted as ASCII characters in decimal.
Whitespace.
A height, again in ASCII decimal.
Whitespace.
The maximum color value (Maxval), again in ASCII decimal. Must be less than 65536 and more than zero.
A single whitespace character (usually a newline).
I try to decode this header with the fmt.Fscanf function. The following call to
fmt.Fscanf parses the header (not addressing the caveat explained below):
var magic string
var width, height, maxVal uint
fmt.Fscanf(input,"%2s %d %d %d",&magic,&width,&height,&maxVal)
The documentation of fmt states:
Note: Fscan etc. can read one character (rune) past the input they
return, which means that a loop calling a scan routine may skip some
of the input. This is usually a problem only when there is no space
between input values. If the reader provided to Fscan implements
ReadRune, that method will be used to read characters. If the reader
also implements UnreadRune, that method will be used to save the
character and successive calls will not lose data. To attach ReadRune
and UnreadRune methods to a reader without that capability, use
bufio.NewReader.
As the very next character after the final whitespace is already the beginning of the image data, I have to be certain about how many whitespace fmt.Fscanf did consume after reading MaxVal. My code must work on whatever reader the was provided by the caller and parts of it must not read past the end of the header, therefore wrapping stuff into a buffered reader is not an option; the buffered reader might read more from the input than I actually want to read.
Some testing suggests that parsing a dummy character at the end solves the issues:
var magic string
var width, height, maxVal uint
var dummy byte
fmt.Fscanf(input,"%2s %d %d %d%c",&magic,&width,&height,&maxVal,&dummy)
Is that guaranteed to work according to the specification?
No, I would not consider that safe. While it works now, the documentation states that the function reserves the right to read past the value by one character unless you have an UnreadRune() method.
By wrapping your reader in a bufio.Reader, you can ensure the reader has an UnreadRune() method. You will then need to read the final whitespace yourself.
buf := bufio.NewReader(input)
fmt.Fscanf(buf,"%2s %d %d %d",&magic,&width,&height,&maxVal)
buf.ReadRune() // remove next rune (the whitespace) from the buffer.
Edit:
As we discussed in the chat, you can assume the dummy char method works and then write a test so you know when it stops working. The test can be something like:
func TestFmtBehavior(t *testing.T) {
// use multireader to prevent r from implementing io.RuneScanner
r := io.MultiReader(bytes.NewReader([]byte("data ")))
n, err := fmt.Fscanf(r, "%s%c", new(string), new(byte))
if n != 2 || err != nil {
t.Error("failed scan", n, err)
}
// the dummy char read 1 extra char past "data".
// one byte should still remain
if n, err := r.Read(make([]byte, 5)); n != 1 {
t.Error("assertion failed", n, err)
}
}

Removing NUL characters from bytes

To teach myself Go I'm building a simple server that takes some input, does some processing, and sends output back to the client (that includes the original input).
The input can vary in length from around 5 - 13 characters + endlines and whatever other guff the client sends.
The input is read into a byte array and then converted to a string for some processing. Another string is appended to this string and the whole thing is converted back into a byte array to get sent back to the client.
The problem is that the input is padded with a bunch of NUL characters, and I'm not sure how to get rid of them.
So I could loop through the array and when I come to a nul character, note the length (n), create a new byte array of that length, and copy the first n characters over to the new byte array and use that. Is that the best way, or is there something to make this easier for me?
Some stripped down code:
data := make([]byte, 16)
c.Read(data)
s := strings.Replace(string(data[:]), "an", "", -1)
s = strings.Replace(s, "\r", "", -1)
s += "some other string"
response := []byte(s)
c.Write(response)
c.close()
Also if I'm doing anything else obviously stupid here it would be nice to know.
In package "bytes", func Trim(s []byte, cutset string) []byte is your friend:
Trim returns a subslice of s by slicing off all leading and trailing UTF-8-encoded Unicode code points contained in cutset.
// Remove any NULL characters from 'b'
b = bytes.Trim(b, "\x00")
Your approach sounds basically right. Some remarks:
When you have found the index of the first nul byte in data, you don't need to copy, just truncate the slice: data[:idx].
bytes.Index should be able to find that index for you.
There is also bytes.Replace so you don't need to convert to string.
The io.Reader documentation says:
Read reads up to len(p) bytes into p. It returns the number of bytes read (0 <= n <= len(p)) and any error encountered.
If the call to Read in the application does not read 16 bytes, then data will have trailing zero bytes. Use the number of bytes read to trim the zero bytes from the buffer.
data := make([]byte, 16)
n, err := c.Read(data)
if err != nil {
// handle error
}
data = data[:n]
There's another issue. There's no guarantee that Read slurps up all of the "message" sent by the peer. The application may need to call Read more than once to get the complete message.
You mention endlines in the question. If the message from the client is terminated but a newline, then use bufio.Scanner to read lines from the connection:
s := bufio.NewScanner(c)
if s.Scan() {
data = s.Bytes() // data is next line, not including end lines, etc.
}
if s.Err() != nil {
// handle error
}
You could utilize the return value of Read:
package main
import "strings"
func main() {
r, b := strings.NewReader("north east south west"), make([]byte, 16)
n, e := r.Read(b)
if e != nil {
panic(e)
}
b = b[:n]
println(string(b) == "north east south")
}
https://golang.org/pkg/io#Reader

Go string to ascii byte array

How can I encode my string as ASCII byte array?
If you're looking for a conversion, just do byteArray := []byte(myString)
The language spec details conversions between strings and certain types of arrays (byte for bytes, int for Unicode points)
You may not need to do anything. If you only need to read bytes of a string, you can do that directly:
c := s[3]
cthom06's answer gives you a byte slice you can manipulate:
b := []byte(s)
b[3] = c
Then you can create a new string from the modified byte slice if you like:
s = string(b)
But you mentioned ASCII. If your string is ASCII to begin with, then you are done. If it contains something else, you have more to deal with and might want to post another question with more details about your data.

Resources