Parquet three encodings means? - parquet

After reading the document,I know what every single encodings means.
But I can't understand why one column has three encodings.
For example:
ENC:BIT_PACKED,PLAIN,RLE
ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE

This has to do that each column has at least three different arrays that are serialized.
repetition levels: integer array that is used to specify (roughly) whether we have a single value in a row or the row consists of an array. Either BIT_PACKED or RLE encoded.
definition levels: integer array to specify if a row is null and if so, on which nesting level. Either BIT_PACKED or RLE encoded.
data: The actual data that is stored. Depending on the data this is on of the other encodings like PLAIN or RLE_DICTIONARY. As the data can also be split into several pages, you may get different encodings for each page. For example when a column is dictionary encoded, the first pages will be PLAIN_DICTIONARY or RLE_DICTIONARY. When the dictionary grows too large, the Parquet implementation may switch for all following pages to a different encoding, e.g. PLAIN.

row group 0
--------------------------------------------------------------------------------
x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y: BINARY SNAPPY DO:0 FPO:1636 SZ:864/16573/19.18 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
z: DOUBLE SNAPPY DO:0 FPO:2500 SZ:560097/560067/1.00 VC:70000 ENC:PLAIN,BIT_PACKED ST:[min: 2.0828331581679294E-7, max: 0.9999892375625329, num_nulls: 0]
x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
I guess ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE is just a set of all encoding (repetition levels, definition levels, data) used in all pages in a column chunk in a row group. Not sure if it is ordered.
Data encoding is probably the info you are intersted in. You can see from the page level VLE:PLAIN_DICTIONARY or the column chunk level DS: 5 DE:PLAIN_DICTIONARY which means the dictionary has 5 keys
DLE - Definition level encoding.
RLE - Repetition level encoding.
VLE - Value encoding.

Related

String indexes converted to onehot vector are blank (no index set to 1) for some rows?

I have a pyspark dataframe with a categorical column that is being converted into a onehot encoded vector via...
si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)
when looking at the dataframe after, I see some of the onehot encoded vectors looking like...
(1,[],[])
I would expect the sparse vectors to either look like (1,[0],[1.0]) or (1,[1],[1.0]), but here the vectors are just zeros.
Any idea what could be happening here?
This has to do with how the values are encoded in mllib.
The 1hot is not encoding the binary value like...
[1, 0] or [0, 1]
in a [this, that] fashion but rather
[1] or [0]
In the sparse vector format the [0] case looks like (1,[],[]), meaning length=1, no position indexes have nonzero, and (thus) no nonzero values to list (can see more about how mllib represents sparse vectors here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article on encoding...
One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available
If you don't want the onehot encoder to drop the last category to simplify the representation, the mllib class a dropLast param you can set, see https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator

Why does making one enum variant an `f64` increase the size of this enum?

I have created three enums that are nearly identical:
#[derive(Clone, Debug)]
pub enum Smoller {
Int(u8),
Four([u8; 4]),
Eight([u8; 8]),
Twelve([u8; 12]),
Sixteen([u8; 16]),
}
#[derive(Clone, Debug)]
pub enum Smol {
Float(f32),
Four([u8; 4]),
Eight([u8; 8]),
Twelve([u8; 12]),
Sixteen([u8; 16]),
}
#[derive(Clone, Debug)]
pub enum Big {
Float(f64),
Four([u8; 4]),
Eight([u8; 8]),
Twelve([u8; 12]),
Sixteen([u8; 16]),
}
pub fn main() {
println!("Smoller: {}", std::mem::size_of::<Smoller>()); // => Smoller: 17
println!("Smol: {}", std::mem::size_of::<Smol>()); // => Smol: 20
println!("Big: {}", std::mem::size_of::<Big>()); // => Big: 24
}
What I expect, given my understanding of computers and memory, is that these should be the same size. The biggest variant is the [u8; 16] with a size of 16. Therefore, while these enums do have a different size first variant, they have the same size of their biggest variants and the same number of variants total.
I know that Rust can do some optimizations to acknowledge when some types have gaps (e.g. pointers can collapse because we know that they won't be valid and 0), but this is really the opposite of that. I think if I were constructing this enum by hand, I could fit it into 17 bytes (only one byte being necessary for the discrimination), so both the 20 bytes and the 24 bytes are perplexing to me.
I suspect this might have something to do with alignment, but I don't know why and I don't know why it would be necessary.
Can someone explain this?
Thanks!
The size must be at least 17 bytes, because its biggest variant is 16 bytes big, and it needs an extra byte for the discriminant (the compiler can be smart in some cases, and put the discriminant in unused bits of the variants, but it can't do this here).
Also, the size of Big must be a multiple of 8 bytes to align f64 properly. The smaller multiple of 8 bigger than 17 is 24.
Similarly, Smol cannot be only 17 bytes, because its size must be a multiple of 4 bytes (the size of f32). Smoller only contains u8 so it can be aligned to 1 byte.
As mcarton mentions, this is an effect of alignment of internal fields and alignment/size rules.
Alignment
Specifically, common alignments for built-in types are:
1: i8, u8.
2: i16, u16.
4: i32, u32, f32.
8: i64, u64, f64.
Do note that I say common, in practice alignment is dictated by hardware, and on 32-bits architectures you could reasonably expect f64 to be 4-bytes aligned. Further, the alignment of isize, usize and pointers will vary based on 32-bits vs 64-bits architecture.
In general, for ease of use, the alignment of a compound type is the largest alignment of any of its fields, recursively.
Access to unaligned values is generally architecture specific; on some architecture it will crash (SIGBUS) or return erroneous data, on some it will be slower (x86/x64 not so long ago) and on others it may be just fine (newer x64, on some instructions).
Size and Alignment
In C, the size must always be a multiple of the alignment, because of the way arrays are laid out and iterated over:
Each element in the array must be at its correct alignment.
Iterating is done by incrementing the pointer by sizeof(T) bytes.
Thus the size must be a multiple of the alignment.
Rust has inherited this behavior^1 .
It's interesting to note that Swift decided to define a separate intrinsic, strideof, to represent the stride in an array, which allowed them to remove any tail-padding from the result of sizeof. It did cause some confusions, as people expected sizeof to behave like C, but allows compacting memory more efficiently.
Thus, in Swift, your enums could be represented as:
Smoller: [u8 x 16][discriminant] => sizeof 17 bytes, strideof 17 bytes, alignof 1 byte.
Smol: [u8 x 16][discriminant] => sizeof 17 bytes, strideof 20 bytes, alignof 4 bytes.
Big: [u8 x 16][discriminant] => sizeof 17 bytes, strideof 24 bytes, alignof 8 bytes.
Which clearly shows the difference between the size and the stride, which are conflated in C and Rust.
^1 I seem to remember some discussions over the possible switch to strideof, which did not come to fruition as we can see, but could not find a link to them.
I think that it is because of the alignment requirements of the inner values.
u8 has an alignment of 1, so all works as you expect, and you get a whole size of 17 bytes.
But f32 has an alignment of 4 (technically it is arch-dependent, but that is the most likely value). So even if the discriminant is just 1 byte you get this layout for Smol::Float:
[discriminant x 1] [padding x 3] [f32 x 4] = 8 bytes
And then for Smol::Sixteen:
[discriminant x 1] [u8 x 16] [padding x 3] = 20 bytes
Why is this padding really necessary? Because it is a requirement that the size of a type must be a multiple of the alignment, or else arrays of this type will misalign.
Similarly, the alignment for f64 is 8, so you get a full size of 24, that is the smallest multiple of 8 that fits all the enums.

Convert HEX 32 bit from GPS plot on Ruby

I am working with the following HEX values representing different values from a GPS/GPRS plot. All are given as 32 bit integer.
For example:
296767 is the decimal value (unsigned) reported for hex number: 3F870400
Another one:
34.96987500 is the decimal float value (signed) given on radian resolution 10^(-8) reported for hex humber: DA4DA303.
Which is the process for transforming the hex numbers onto their corresponding values on Ruby?
I've already tried unpack/pack with directives: L, H & h. Also tried adding two's complement and converting them to binary and then decimal with no success.
If you are expecting an Integer value:
input = '3F870400'
output = input.scan(/../).reverse.join.to_i( 16 )
# 296767
If you are expecting degrees:
input = 'DA4DA303'
temp = input.scan(/../).reverse.join.to_i( 16 )
temp = ( temp & 0x80000000 > 1 ? temp - 0x100000000 : temp ) # Handles negatives
output = temp * 180 / (Math::PI * 10 ** 8)
# 34.9698751282937
Explanation:
The hexadecimal string is representing bytes of an Integer stored least-significant-byte first (or little-endian). To store it as raw bytes you might use [296767].pack('V') - and if you had the raw bytes in the first place you would simply reverse that binary_string.unpack('V'). However, you have a hex representation instead. There are a few different approaches you might take (including putting the hex back into bytes and unpacking it), but in the above I have chosen to manipulate the hex string into the most-significant-byte first form and use Ruby's String#to_i

Preallocating arrays of structures in Matlab for efficiency

In Matlab I wish to preallocate a 1x30 array of structures named P with the following structure fields:
imageSize: [128 128]
orientationsPerScale: [8 8 8 8]
numberBlocks: 4
fc_prefilt: 4
boundaryExtension: 32
G: [192x192x32 double]
G might not necessarily be 192x192x32, it could be 128x128x16 for example (though it will have 3 dimensions of type double).
I am doing the preallocation the following way:
P(30) = struct('imageSize', 0, 'orientationsPerScale', [0 0 0 0], ...
'numberBlocks', 0, 'fc_prefilt', 0, 'boundaryExtension', 0, 'G', []);
Is this the correct way of preallocating such a structure, or will there be performance issues relating to G being set to empty []? If there is a better way of allocating this structure please provide an example.
Also, the above approach seems to work (performance issues aside), however, the order of the field name / value pairs seems to be important, since rearranging them leads to error upon assignment after preallocation. Why is this so given that the items/values are referenced by name (not position)?
If G is set to Empty, the interpreter has no way of knowing what size data will be attributed to it later, so it probably will pack the array items tight in memory, and have to redo it all when it doesn't fit.
It's probably more efficient to define upper bounds for the dimensions of G beforehand, and set it to that size. The zeroes function could help.

How does d3.scale.quantile work?

What is the meaning of this statement?
quantize = d3.scale.quantile().domain([0, 15]).range(d3.range(9));
I saw that the domain is:
0 - 0
1 - 15
range is from 0 to 8 and quantize.quantiles
0 - 1.6
1 - 3.3
2 - 4.9
3 - 6.6
4 - 8.3
5 - 9.9
6 -11.6
7 -13.3
How are the values to quantize.quantiles calculated ? I tried to call quantize(2) but the result was 1. How does quantile work?
The motivation of the quantile scale is to obtain classes which are representative of the actual distribution of the values in the dataset. Therefore, it is necessary to provide it during construction with the full list of values. The scale then splits the input domain (defined by these values) into intervals (quantiles) in such a way that about the same number of values falls into each of the intervals.
From the documentation:
To compute the quantiles, the input domain is sorted, and treated as a population of discrete values.
Hence, when specifying the domain we hand in the scale the whole list of values:
var scale = d3.scale.quantile()
.domain([1, 1, 2, 3, 2, 3, 16])
.range(['blue', 'white', 'red']);
If we then run:
scale.quantiles()
It will output [2, 3] which means that our population of values was split into these three subsets represented by 'blue', 'white', and 'red' respectively:
[1, 1] [2, 2] [3, 3, 16]
Note that this scale should be avoided when there are outliers in the data which you want to show. In the above example 16 is an outlier falling into the upper quantile. It is assigned the same class as 3, which is probably not the desired behavior:
scale(3) // will output "red"
scale(16) // will output "red"
I would recommend reading over the quantile scale documentation, especially that on quantize.quantiles()
But basically, d3 sees that there are 9 values in the output range for this scale, so it creates 9 quantiles based on the 2 value data set: [0, 15].
This leads to the quantize.quantiles() values that you show in your question: [1.6, 3.3, .. ,13.3] , these represent the bounds of the quantiles - anything less than 1.6 will be mapped to the first element of the output range (in this case zero). Anything less than 3.3 and greater than 1.6 will be mapped to the second element of the output range (one). Hence quantize(2) = one, as expected.

Resources