VIntWritable vs IntWritable - hadoop

I understand that VIntWritable can significantly reduce the size needed to store an integer, when compared to IntWritable.
My questions are: What is the cost of using VIntWritable instead of IntWritable? Is it (only) the time needed for compression? In other words, when should I use IntWritable, instead of VIntWritable?

How do you choose between a fixed-length and a variable-length
encoding?
Fixedlength encodings are good when the distribution of values is
fairly uniform across the whole value space, such as a (well-designed)
hash function. Most numeric variables tend to have nonuniform
distributions, and on average the variable-length encoding will save
space. Another advantage of variable-length encodings is that you can
switch from VIntWritable to VLongWritable, because their encodings are
actually the same. So by choosing a variable-length representation,
you have room to grow without committing to an 8-byte long
representation from the beginning.
I just picked this up from the definitive guide book page 98

Related

What for people sometimes convert numbers or strings to bytes?

Sometimes I encounter questions about converting sth to bytes. Are anything existing where it is vitally important to convert to bytes or what for could I convert sth to bytes?
In most languages the most common string functions come as part of the language or in a library/include/import that comes pre-made, often employing object code to take advantage of processor based strings functions, however, sometimes you need to do something with a string that isnt natively supported by the language so since 8-bit days, people have viewed strings as an array of 7 or 8-bit characters, which fit within a byte and use conventions like ASCII to determine which byte value represents which character.
While standard languages often have functions like "string.replaceChar(OFFSET,'a')" this methodology can be painstaking slow because each call to the replaceChar method results in processing overhead which may be greater than the processing needing to be done.
There is also the simplicity factor when designing your own string algorithms but like I said, most of the common algorithms come prebuilt in modern languages. (stringCompare, trimString, reverseString, etc).
Suppose you want to perform an operation on a string which doesnt come as standard.
Suppose you want to add two numbers which are represented in decimal digits in strings and the size of these numbers are greater than the 64-bit bus size of the processor? The RSA encryption/descryption behind the SSL browser padlocks employs the use of numbers which dont fit into the word size of a desktop computer but none the less the programs on a desktop which deal with RSA certificates and keys must be able to process these data which are actually strings.
There are many and varied reasons you would want to deal with string, as an array of bytes but each of these reasons would be fairly specialised.

why is hash output fixed in length?

Hash functions always produce a fixed length output regardless of the input (i.e. MD5 >> 128 bits, SHA-256 >> 256 bits), but why?
I know that it is how the designer designed them to be, but why they designed the output to have the same length?
So that it can be stored in a consistent fashion? easier to be compared? less complicated?
Because that is what the definition of a hash is. Refer to wikipedia
A hash function is any function that can be used to map digital data
of arbitrary size to digital data of fixed size.
If your question relates to why it is useful for a hash to be a fixed size there are multiple reasons (non-exhaustive list):
Hashes typically encode a larger (often arbitrary size) input into a smaller size, generally in a lossy way, i.e. unlike compression functions, you cannot reconstruct the input from the hash value by "reversing" the process.
Having a fixed size output is convenient, especially for hashes designed to be used as a lookup key.
You can predictably (pre)allocate storage for hash values and index them in a contiguous memory segment such as an array.
For hashes of "native word sizes", e.g. 16, 32 and 64 bit integer values, you can do very fast equality and ordering comparisons.
Any algorithm working with hash values can use a single set of fixed size operations for generating and handling them.
You can predictably combine hashes produced with different hash functions in e.g. a bloom filter.
You don't need to waste any space to encode how big the hash value is.
There do exist special hash functions, that are capable of producing an output hash of a specified fixed length, such as so-called sponge functions.
As you can see it is the standard.
Also what you want is specified in standard :
Some application may require a hash function with a message digest
length different than those provided by the hash functions in this
Standard. In such cases, a truncated message digest may be used,
whereby a hash function with a larger message digest length is applied
to the data to be hashed, and the resulting message digest is
truncated by selecting an appropriate number of the leftmost bits.
Often it's because you want to use the hash value, or some part of it, to quickly store and look up values in a fixed-size array. (This is how a non-resizable hashtable works, for example.)
And why use a fixed-size array instead of some other, growable data structure (like a linked list or binary tree)? Because accessing them tends to be both theoretically and practically fast: provided that the hash function is good and the fraction of occupied table entries isn't too high, you get O(1) lookups (vs. O(log n) lookups for tree-based data structures or O(n) for lists) on average. And these accesses are fast in practice: after calculating the hash, which usually takes linear time in the size of the key with a low hidden constant, there's often just a bit shift, a bit mask and one or two indirect memory accesses into a contiguous block of memory that (a) makes good use of cache and (b) pipelines well on modern CPUs because few pointer indirections are needed.

Efficient storage of large matrix on HDD

I have many large 1GB+ matrices of doubles (floats), many of them 0.0, that need to be stored efficiently. I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving). A string header is optional. The matrices have no missing elements, NaNs, NAs, nulls, etc: they are all doubles.
Some columns will be sparse, others will not be. The proportion of columns that are sparse will vary from file to file.
What is a space efficient alternative to CSV? For my use, I need to parse this matrix quickly into R, python and Java, so a file format specific to a single language is not appropriate. Access may need to be by row or column.
I am also not looking for a commercial solution.
My main objective is to save HDD space without blowing out io times. RAM usage once imported is not the primary consideration.
The most important question is if you always expand the whole matrix into memory or if you need a random access to the compacted form (and how). Expanding is way simpler, so I'm concentrating on this.
You could use a bitmap stating if a number is present or zero. This costs 1 bit per entry and thus can increase the file size by 1/64 in case of no zeros or shrink it to 1/64 in case of all zeros. If there are runs of zeros, you may store the number of following zeros and the number non-zeros instead, e.g., by packing two 4-bit numbers into one byte.
As the double representation is standard, you can use binary representation in both languages. If many of your numbers are actually ints, you may consider something like I did.
If consecutive numbers are related, you could consider storing their differences.
I indend on keeping the double type since some of the elements do require to be a double (but I can consider changing this if it could lead to a significant space saving).
Obviously, switching to float would trade half precision for haltf the memory. This is probably too imprecise, so you could instead omit a few bits from the mantisa and get e.g. 6 bytes per entry. Alternatively, you could reduce the exponent to a single byte instead as the range 1e-38 to 3e38 should suffice.

What is the name of this text compression scheme?

A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.
It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.
Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.
I recall that it was invented at Google, but it was not Snappy.
I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,
The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

Algorithm to find most efficient base to store large integer

Very large integers are often stored as variable-length arrays of digits in memory, as opposed to a straightforward binary representation as is the case with most primitive 'int' or 'long' types, as in Java or C. With this in mind, I would be interested to know algorithm(s) that can compute:
At what count an integer must reach before it becomes more efficient to store it as a BigInteger (or equivalent arbitrary-precision arithmetic construct) with a given radix for the integer's digits;
Which radix would be most efficient to store the digits of this large integer.
I have mentioned 'efficiency'; by this, I mean I am mainly concerned with the amount of space such a BigInteger would consume, though I would also be interested to hear any comments on processing speed or time complexity.
An integer should consume the least space if stored in a raw binary format (unless maybe it is a small integer and data type is way too wide for it - to store 1 in 128 bit long long). Storing differently does not save any memory and is used to make the work with such integers easier.
If byte by byte, this translates into 256'ecimal radix - 256 possible values, as much as the byte can hold.
BigInt is never more efficient than one of the integer types directly supported by hardware. If you can use what's supported directly, use it.
What's supported by hardware most efficiently, likely a power of 2 or, often equivalently, binary.

Resources