Efficiency of sharing an integer variable - performance

First of all, sorry for my English, I'm not a native speaker. That said, I've lately been working on some programs to compress string data into integer values, using the Huffman's tree structure. The compressed data is meant to be shared using LoRa from a board to another.
The problem I'm facing is that I don't know if compressing the data is more efficient, because I've now converted successfully a char into a unique integer code, but the code, in some extreme cases, is a 5-digit number. So I don't know if it's more efficient to send a char value or an integer value.
Basically, I'm trying to understand if communication byte by byte is more efficient than communication char by char, and if so, by how much.
I've tried searching if this question is a duplicate, but the only thing I've found is that for the CPU it's better to work with integers.

I am assuming that your "5 digit number" is all zeros and ones. In order to compress, each digits becomes a single bit in your data stream. You pack the bits, eight at a time, into bytes. So eight 5-bit values are packed into five bytes.

Related

What for people sometimes convert numbers or strings to bytes?

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

When to use fixed value protobuf type? Or under what scenarios?

I want to transfer a serialized protobuf message over TCP and I've tried to use the first field to indicate the total length of the serialized message.
I know that the int32 will change the length after encoding. So, maybe a fixed32 is a good choice.
But at last of the Encoding chapter, I found that I can't depend on it even if I use a fixed32 with field_num #1. Because Field Order said that the order may change.
My question is when do I use fixed value types? Are there any example scenarios?
"My question is when do I use fixed value types?"
When it comes to serializing values, there's always a tradeoff. If we look at the Protobuf-documentation, we see we have a few options when it comes to 32-bit integers:
int32: Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.
uint32: Uses variable-length encoding.
sint32: Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.
fixed32: Always four bytes. More efficient than uint32 if values are often greater than 2^28.
sfixed32: Always four bytes.
int32 is a variable-length data-type. Any information that is not specified in the type itself, needs to be expressed somehow. To deserialize a variable-length number, we need to know what the length is. That is contained in the serialized message as well, which requires additional storage space. The same goes for an optional negative sign. The resulting message may be smaller because of this, but may be larger as well.
Say we have a lot of integers between 0 and 255 to encode. It would be cheaper to send this information as a two bytes (one byte with that actual value, and one byte to indicate that we just have one byte), than to send a full 32-bit (4 bytes) integer [fictional values, actual implementation may differ]. On the other hand, if we want to serialize a large value, that can only fit in 4 bytes the result may be larger (4 bytes and an additional byte to indicate that the value is 4 bytes; a total of 5 bytes). In this case it will be more efficient to use a fixed32. We simply know a fixed32 is 4 bytes; we don't need to serialize that fixed32 is a 4-byte number.
And if we look at fixed32 it actually mentions that the tradeoff point is around 2^28 (for unsigned integers).
So some types are good [as in, more efficient in terms of storage space] for large values, some for small values, some for positive/negative values. It all depends on what the actual values represent.
"Are there any example scenarios?"
32-bit hashes (ie: CRC-32), IPv4 addresses/masks. A predictable message sizes could be relevant.

Fastest algorithm to convert hexadecimal numbers into decimal form without using a fixed length variable to store the result

I want to write a program to convert hexadecimal numbers into their decimal forms without using a variable of fixed length to store the result because that would restrict the range of inputs that my program can work with.
Let's say I were to use a variable of type long long int to calculate, store and print the result. Doing so would limit the range of hexadecimal numbers that my program can handle to between 8000000000000001 and 7FFFFFFFFFFFFFFF. Anything outside this range would cause the variable to overflow.
I did write a program that calculates and stores the decimal result in a dynamically allocated string by performing carry and borrow operations but it runs much slower, even for numbers that are as big as 7FFFFFFFF!
Then I stumbled onto this site which could take numbers that are way outside the range of a 64 bit variable. I tried their converter with numbers much larger than 16^65 - 1 and still couldn't get it to overflow. It just kept on going and printing the result.
I figured that they must be using a much better algorithm for hex to decimal conversion, one that isn't limited to 64 bit values.
So far, Google's search results have only led me to algorithms that use some fixed-length variable for storing the result.
That's why I am here. I wanna know if such an algorithm exists and if it does, what is it?
Well, it sounds like you already did it when you wrote "a program that calculates and stores the decimal result in a dynamically allocated string by performing carry and borrow operations".
Converting from base 16 (hexadecimal) to base 10 means implementing multiplication and addition of numbers in a base 10x representation. Then for each hex digit d, you calculate result = result*16 + d. When you're done you have the same number in a 10-based representation that is easy to write out as a decimal string.
There could be any number of reasons why your string-based method was slow. If you provide it, I'm sure someone could comment.
The most important trick for making it reasonably fast, though, is to pick the right base to convert to and from. I would probably do the multiplication and addition in base 109, so that each digit will be as large as possible while still fitting into a 32-bit integer, and process 7 hex digits at a time, which is as many as I can while only multiplying by single digits.
For every 7 hex digts, I'd convert them to a number d, and then do result = result * ‭(16^7) + d.
Then I can get the 9 decimal digits for each resulting digit in base 109.
This process is pretty easy, since you only have to multiply by single digits. I'm sure there are faster, more complicated ways that recursively break the number into equal-sized pieces.

Advantages of varint encoding use by protobuf

What are the advantages of varint encoding as used in protobuf? Is it simply that for smallish messages varints waste less space and time than variable size encodings that add sublinear space?
I suppose someone thought about typical or average message sizes when they picked varints for protobuf? Any public comments on the reasoning?
Varints save space. This is pretty much the only consideration. Imagine how good it would be if your small integers took up only 1 byte instead of 8 every time you sent a gigantic array of integers.

Algorithm to find most efficient base to store large integer

Very large integers are often stored as variable-length arrays of digits in memory, as opposed to a straightforward binary representation as is the case with most primitive 'int' or 'long' types, as in Java or C. With this in mind, I would be interested to know algorithm(s) that can compute:
At what count an integer must reach before it becomes more efficient to store it as a BigInteger (or equivalent arbitrary-precision arithmetic construct) with a given radix for the integer's digits;
Which radix would be most efficient to store the digits of this large integer.
I have mentioned 'efficiency'; by this, I mean I am mainly concerned with the amount of space such a BigInteger would consume, though I would also be interested to hear any comments on processing speed or time complexity.
An integer should consume the least space if stored in a raw binary format (unless maybe it is a small integer and data type is way too wide for it - to store 1 in 128 bit long long). Storing differently does not save any memory and is used to make the work with such integers easier.
If byte by byte, this translates into 256'ecimal radix - 256 possible values, as much as the byte can hold.
BigInt is never more efficient than one of the integer types directly supported by hardware. If you can use what's supported directly, use it.
What's supported by hardware most efficiently, likely a power of 2 or, often equivalently, binary.

Resources