how to interpret a two-byte value? - byte

I am trying to read a 2-byte value that is stored lower order byte first. So if the 2 bytes are 10 and 85 in that order (decimal) what is the number they represent? 8510? 851? Or something different?
These values represent the length of an encoded sequence and I need to know the length to properly handle the information it contains. Most only use the first byte and as a decimal number it accurately represents the total number of characters (or bytes) in the sequence... but some use both bytes and I don't understand how to interpret them.
IF anyone can help be with this I would appreciate it.
Thanks

What you're referring to is the "endianness" of the two-byte value (also called a WORD). It's generally not considered beneficial to discuss them in terms of decimal values because no, if the bytes are 10 and 85 in decimal, the values would not be any combination of 10/85. Rather, showing them in the hexadecimal values is far more prudent.
+----+----+
|0x0a|0x55|
+----+----+
To interpret this as Little Endian, the value would be 0x550a (21770 in decimal). And in Big Endian (highest order byte first), the value is 0x0a55 (2645 decimal).
It's very important, for this reason, to know the endianness of your system and be able to properly handle the two. It is also noteworthy that "Network Byte Order" is Big Endian.

Related

When to use fixed value protobuf type? Or under what scenarios?

I want to transfer a serialized protobuf message over TCP and I've tried to use the first field to indicate the total length of the serialized message.
I know that the int32 will change the length after encoding. So, maybe a fixed32 is a good choice.
But at last of the Encoding chapter, I found that I can't depend on it even if I use a fixed32 with field_num #1. Because Field Order said that the order may change.
My question is when do I use fixed value types? Are there any example scenarios?
"My question is when do I use fixed value types?"
When it comes to serializing values, there's always a tradeoff. If we look at the Protobuf-documentation, we see we have a few options when it comes to 32-bit integers:
int32: Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.
uint32: Uses variable-length encoding.
sint32: Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.
fixed32: Always four bytes. More efficient than uint32 if values are often greater than 2^28.
sfixed32: Always four bytes.
int32 is a variable-length data-type. Any information that is not specified in the type itself, needs to be expressed somehow. To deserialize a variable-length number, we need to know what the length is. That is contained in the serialized message as well, which requires additional storage space. The same goes for an optional negative sign. The resulting message may be smaller because of this, but may be larger as well.
Say we have a lot of integers between 0 and 255 to encode. It would be cheaper to send this information as a two bytes (one byte with that actual value, and one byte to indicate that we just have one byte), than to send a full 32-bit (4 bytes) integer [fictional values, actual implementation may differ]. On the other hand, if we want to serialize a large value, that can only fit in 4 bytes the result may be larger (4 bytes and an additional byte to indicate that the value is 4 bytes; a total of 5 bytes). In this case it will be more efficient to use a fixed32. We simply know a fixed32 is 4 bytes; we don't need to serialize that fixed32 is a 4-byte number.
And if we look at fixed32 it actually mentions that the tradeoff point is around 2^28 (for unsigned integers).
So some types are good [as in, more efficient in terms of storage space] for large values, some for small values, some for positive/negative values. It all depends on what the actual values represent.
"Are there any example scenarios?"
32-bit hashes (ie: CRC-32), IPv4 addresses/masks. A predictable message sizes could be relevant.

bits and bytes and what form are them

I'm still confused about the bits and bytes although I've been searching through the internet. Is that one character of ASCII = 1 bytes = 8 bits? So 8 bits have 256 unique pattern that covered all the ASCII code, what form is it stored in our computer?
And if I typed "Hello" does that mean this consists of 5 bytes?
Yes to everything you wrote. "Bit" is a binary digit: a 0 or a 1. Historically there existed bytes of smaller sizes; now "byte" only ever means "8 bits of information", or a number between 0 and 255.
No. ASCII is a character set with 128 codepoints stored as the values 0-127. Modern computers predominantly address 8-bit memory and disk locations so a 7-bit ASCII value takes up 8 bits.
There is no text but encoded text. An encoding maps a member of a character set to one or more bytes. Unless you absolutely know you are using ASCII, you probably aren't. There are quite a few character sets with encodings that cover all 256 byte values and use any combination of byte values to encode a string.
There are several character sets that are similar but have a few less than 256 characters. And others that use more than one byte to encode a codepoint and don't use every combination of byte values.
Just so you know, Unicode is the predominant character set except in very specialized situations. It has several encodings. UTF-8 is often used for storage and streams. UTF-16 is often used in memory, particularly in Java, .NET, JavaScript, XML, …. When text is communicated between systems, there has to be an agreement, specification, standard, or indication about which character set and encoding it uses so a sequence of bytes can be interpreted as characters.
To add to the confusion, programming languages have data types called char, Character, etc. You have to look at the specific language's reference manual to see what they mean. For example in C, char is simply an integer that is defined as the size of the encoding of character used by that C implementation. (C also calls this a "byte" and it is not necessarily 8 bits. In all other contexts, people mean 8 bits when they say "byte". If they want to be exceedingly unambiguous they might say "octet".)
"Hello" is five characters. In a specific character set, it is five codepoints. In a specific encoding for that character set, it could be 5, 10 or 20, or ??? bytes.
Also, in the source code of a specific language, a literal string like that might be "null-terminated". This means that you could say it is 6 "characters". Other languages might store a string as a counted sequence of code units. Again, you have to look at the language reference to know the underlying data structure of strings. Of, if the language and the libraries used with it are sufficiently high-level, you might never need to know such internals.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Efficient encoding of basic numeric datatypes in protobuffers

In Protobuffers documentation, it has been given
"For historical reasons, repeated fields of basic numeric types aren't encoded as
efficiently as they could be. New code should use the special option [packed=true] to get
a more efficient encoding. For example:
repeated int32 samples = 4 [packed=true];"
Can someone clearly explain how does the statement "packed=true" improve the efficieny of encoding basic numeric datatypes??
Basically, under the original encoding the field header (which is composed of the wire type combined with the field-number, bit-shifted and or'd) occurs for every element. Because the header is varint encoded, it is at least one byte per element, but possibly more. So 10 4-byte floats would be at least 50 bytes and quite possibly 90 bytes if the header takes 5 bytes (large field numbers take more space than small field numbers).
With the packed encoding, the field header occurs only once, followed by a varint that indicates the number of bytes to follow. So for 10 floats, the payload length is 40, which is varint-encoded in a single byte for the length prefix. At deserialization time it simply consumes that-many bytes, reading elements as it does so. Therefore for the same data (50 to 90 bytes previously) we are now using 42 to 46 bytes (again, for the range of field numbers that take 1 to 5 bytes each).
These 2 layouts are very different on the wire, and code expecting one can not usually decode the other. As such, it needs to be explicitly enabled to prevent breaking existing messages.

encoding most efficient way 64 character sequence for lesser writing time to memory

The problem is as follows: Given a 64 charater sequences which is built from the english alphabet having 26 charcaters therefore just case characters, the occurrence distribution is such that any character has an equal chance of occurring at a given time.
Due to the fact that I have some computation which needs to be done with regards to the sequences, which requires writing to a text files, since the amount of sequences goes beyond a given ram. I thought of encoding a sequence such that I would be able to have lesser amount of bytes to write to a text file per given sequence.
With such reasoning I thought of the L-Z which would allow me to go down to 40 bytes. Is there any way by which i can go lower to encode a 64 character sequence?
With a large(-ish) lookup table you could encode each of the possible 26^64 character sequences in 301 (actually 300.8281==log2(26^64)) bits. This is slightly less than the 320 bits your straightforward compression would use. It is also the theoretical minimum given that any of the 26 characters occurs with equal probability.
Since you could derive the lookup table at any time you don't even need to store it. I suppose the bits used to represent the functions to encode a character string into a 301-bit integer and vice-versa ought to be counted into your compression ratio.
This is, of course, a long-winded restatement of #lhf's comment.

Resources