Trouble coverting image to []uint8 - go

How can I extract raw []uint8 from an image?

The size on an 8-bit integer is, by definition, 8 bits (or 1 byte).
(Editing to remove erroneous information, which I'll repost for posterity below.)
The string representation being output is not one character per number - it's several (for example, the int you list first - 65 - is being represented by three characters - a 6, a 5, and a space. That would increase the expected size threefold, from 300k to 900k.
As for the rest, I would think that (as icza said in a comment) image compression may be the culprit.
(The irrelevant information I'd initially posted as part of my answer was:
Go has two character types, byte and rune. A byte being used to store a character is the same size as a byte being used to store an integer - both are 8 bits. But a character being stored as a rune is going to be 32 bits (see https://www.callicoder.com/golang-basic-types-operators-type-conversion/). So, if the characters being written out are runes, that would explain a four-fold increase in size (because 32 / 8 = 4)...

Related

How does UTF-8 represent characters?

I'm reading UTF-8 Encoding, and I don't understand the following sentence.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
If I'm not mistaken, this means UTF-8 requires two bytes to represent 2048 characters. In other words, we need to choose 2048 candidates from 2 to the power of 16 to represent each character.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes.
What's the big deal about choosing 2048 out of 65,536? However, UTF-8 explicitly sets boundary to each byte.
With following statements, The number of combinations is 30 (0xDF - 0xC2 + 0x01) for first byte, and 64 (0xBF - 0x80 + 0x01) for second byte.
The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
How does 1920 numbers (64 times 30) accommodate 2048 combinations?
As you already know, 2047 (0x07FF) contains the raw bits
00000111 11111111
If you look at the bit distribution chart for UTF-8:
You will see that 0x07FF falls in the second line, so it is encoded as 2 bytes using this bit pattern:
110xxxxx 10xxxxxx
Substitute the raw bits into the xs and you get this result:
11011111 10111111 (0xDF 0xBF)
Which is exactly as the description you quoted says:
The first byte will have the two high bits set and the third bit clear (11011111). The second byte will have the top bit set and the second bit clear (10111111).
Think of it as a container, where the encoding reserves a few bits for its own synchronization, and you get to use the remaining bits.
So for the range in question, the encoding "template" is
110 abcde 10 fghijk
(where I have left a single space to mark the boundary between the template and the value from the code point we want to encode, and two spaces between the actual bytes)
and you get to use the 11 bits abcdefghijk for the value you actually want to transmit.
So for the code point U+07EB you get
0x07 00000111
0xEB 11101011
where the top five zero bits are masked out (remember, we only get 11 -- because the maximum value that the encoding can accommodate in two bytes is 0x07FF. If you have a larger value, the encoding will use a different template, which is three bytes) and so
0x07 = _____ 111 (template: _____ abc)
0xEB = 11 101011 (template: de fghijk)
abc de = 111 11 (where the first three come from 0x07, and the next two from 0xEB)
fghijk = 101011 (the remaining bits from 0xEB)
yielding the value
110 11111 10 101011
aka 0xDF 0xAB.
Wikipedia's article on UTF-8 contains more examples with nicely colored numbers to see what comes from where.
The range 0x00-0x7F, which can be represented in a single byte, contains 128 code points; the two-byte range thus needs to accommodate 1920 = 2048-128 code points.
The raw encoding would allow values in the range 0xC0-0xBF in the first byte, but the values 0xC0 and 0xC1 are not ever needed because those would represent code points which can be represented in a single byte, and thus are invalid as per the encoding spec. In other words, the 0x02 in 0xC2 comes from the fact that at least one bit in the high four bits out of the 11 that this segment of the encoding can represent (one of abcd) needs to be a one bit in order for the value to require two bytes.

LZ4 compression algorithm explanation

Description from Wikipedia:
The LZ4 algorithm represents the data as a series of sequences. Each sequence begins with a one byte token that is broken into two 4 bit fields. The first field represents the number of literal bytes that are to be copied to the output. The second field represents the number of bytes to copy from the already decoded output buffer (with 0 representing the minimum match length of 4 bytes). A value of 15 in either of the bitfields indicates that the length is larger and there is an extra byte of data that is to be added to the length. A value of 255 in these extra bytes indicates that yet another byte to be added. Hence arbitrary lengths are represented by a series of extra bytes containing the value 255. After the string of literals comes the token and any extra bytes needed to indicate string length. This is followed by an offset that indicates how far back in the output buffer to begin copying. The extra bytes (if any) of the match-length come at the end of the sequence
I didn't understand that at all! Does anyone have an easy way to understand example?
For example, in the above explanation what is a literal byte and what is a match? How can we have a decoded output buffer when we're just beginning to compress? Length of what?
The explanation at here was also impenetrable for me.
A simple example would be nice unless you have a better way of explaining it.
First, read about LZ77, the core approach being used. The text is a description of a particular way to code a series of literals and string matches in the preceding data.
A match is when the next bytes in the uncompressed data occur in the previously decompressed data. So instead of sending those bytes directly, a length and an offset is sent. Then you go offset bytes backwards and copy length bytes to the output.
Yes, you can't have a match at the beginning of the stream. You have to start with literals. (Unless there is a preset dictionary, which is another topic.)

Efficient encoding of basic numeric datatypes in protobuffers

In Protobuffers documentation, it has been given
"For historical reasons, repeated fields of basic numeric types aren't encoded as
efficiently as they could be. New code should use the special option [packed=true] to get
a more efficient encoding. For example:
repeated int32 samples = 4 [packed=true];"
Can someone clearly explain how does the statement "packed=true" improve the efficieny of encoding basic numeric datatypes??
Basically, under the original encoding the field header (which is composed of the wire type combined with the field-number, bit-shifted and or'd) occurs for every element. Because the header is varint encoded, it is at least one byte per element, but possibly more. So 10 4-byte floats would be at least 50 bytes and quite possibly 90 bytes if the header takes 5 bytes (large field numbers take more space than small field numbers).
With the packed encoding, the field header occurs only once, followed by a varint that indicates the number of bytes to follow. So for 10 floats, the payload length is 40, which is varint-encoded in a single byte for the length prefix. At deserialization time it simply consumes that-many bytes, reading elements as it does so. Therefore for the same data (50 to 90 bytes previously) we are now using 42 to 46 bytes (again, for the range of field numbers that take 1 to 5 bytes each).
These 2 layouts are very different on the wire, and code expecting one can not usually decode the other. As such, it needs to be explicitly enabled to prevent breaking existing messages.

Is there one-byte type in protobuf?

I can't find if there is possible to have char / byte type in proto.
I can see various types here:
https://developers.google.com/protocol-buffers/docs/proto
https://developers.google.com/protocol-buffers/docs/encoding
but I can't find byte type and even int16 types there.
No, there is no fixed 1-byte type. Fixed length has 4 and 8 byte variants only. Most other numeric values are encoded as "varint"s, which is variable length depending on magnitude (and sign, but "zigzag" comes into play there). So you can store bytes with value 0-127 in one byte, and 128-255 in two bytes. 16-bit values will take between 1 and 3 bytes depending on magnitude (and sign /zigzag etc).
For multiples, there is "bytes" for the 8-bit version, and "packed" for the rest; this avoids the cost of a field-header per value.

how to represent a n-byte array in less than 2*n characters

given that a n-byte array can be represented as a 2*n character string using hex, is there a way to represent the n-byte array in less than 2*n characters?
for example, typically, an integer(int32) can be considered as a 4-byte array of data
The advantage of hex is that splitting an 8-bit byte into two equal halves is about the simplest thing you can do to map a byte to printable ASCII characters. More efficient methods consider multiple bytes as a block:
Base-64 uses 64 ASCII characters to represent 6 bits at a time. Every 3 bytes (i.e. 24 bits) are split into 4 6-bit base-64 digits, where the "digits" are:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
(and if the input is not a multiple of 3 bytes long, a 65th character, "=", is used for padding at the end). Note that there are some variant forms of base-64 use different characters for the last two "digits".
Ascii85 is another representation, which is somewhat less well-known, but commonly used: it's often the way that binary data is encoded within PostScript and PDF files. This considers every 4 bytes (big-endian) as an unsigned integer, which is represented as a 5-digit number in base 85, with each base-85 digit encoded as ASCII code 33+n (i.e. "!" for 0, up to "u" for 84) - plus a special case where the single character "z" may be used (instead of "!!!!!") to represent 4 zero bytes.
(Why 85? Because 845 < 232 < 855.)
yes, using binary (in which case it takes n bytes, not surprisingly), or using any base higher than 16, a common one is base 64.
It might depend on the exact numbers you want to represent. For instance, the number 9223372036854775808, which requres 8 bytes to represent in binary, takes only 4 bytes in ascii, if you use the product of primes representation (which is "2^63").
How about base-64?
It all depends on what characters you're willing to use in your encoding (i.e. representation).
Base64 fits 6 bits in each character, which means that 3 bytes will fit in 4 characters.
Using 65536 of about 90000 defined Unicode characters you may represent binary string in N/2 characters.
Yes. Use more characters than just 0-9 and a-f. A single character (assuming 8-bit) can have 256 values, so you can represent an n-byte number in n characters.
If it needs to be printable, you can just choose some set of characters to represent various values. A good option is base-64 in that case.

Resources