Dremel, Null value in repeted field - parquet

I have a structure like this (I used JSON to represent data here, but this can be an object in any form):
[
{
"DocID": ["A", "B"]
},
{},
]
Based on Dremel spec, The repetition level for the only data filed here "DocID" (which is repeated) is {0,1,0} and the definition level is {1,1,0} since the last item is null.
Now if I have something like this:
[
{
"DocID": ["A", "B"]
},
{ "DocID": [null]},
]
Then again, the repetition level is {0,1,0} and definition level is {0,1,1}
For storing Dremel data in parquet, we never store null fields (Here)
So we store two value "A", "B" in this case (encoding doesn't matter), but for constructing the structure, the first RLevel is zero, so this is start of a new object, the first DLevel is 1, so this is not null. we read the first value, which is "A" (Correct), the second RLevel is 1 it means it is still the same object and it is a repeated field, the DLevel is 1 so it is not null, we read the second value which is "B" (Correct).
The third RLevel is 0, this means a new object. In the first example, the DLevel is zero, so it is null, we don't need to read anything (there is nothing left) and it works.
But in the second case, the DLevel is 1, so we need to read something, and there is nothing left to read.
What we should do in this case?
Just for context, I am co-author of fraugster/parquet-go library, and this is the issue we faced recently.

Related

Performing aggregate operations on two JSON-like data

My application periodically consumes data from an API:
[
{
"name": "A",
"val": 12
},
{
"name": "B",
"val": 22
},
{
"name": "C",
"val": 32
}
]
Its task is to perform some operations on this data, for example, to subtract the "A" value from its previous sample. So we always keep the current sample for further use.
The incoming data has two constraints:
In some iterations some parts of data may be missing, i.e. the "A" object may not be available in some JSONs,
The order of objects is not guaranteed to be consistent, i.e. in a JSON we may have A, B, C, and in another one, we may have "C", "A", "B" ...
The intuitive way to perform these operations is a linear search. We loop through the current JSON and for each object we search for the counterpart in the previous JSON. Then we perform the calculations and put them inside another JSON.
What is an efficient way to do this task? I preferably like to do this in Go but the language is not important.

JSONata prevent array flattening

Q: How do I prevent JSONata from "auto-flattening" arrays in an array constructor?
Given JSON data:
{
"w" : true,
"x":["a", "b"],
"y":[1, 2, 3],
"z": 9
}
the JSONata query seems to select 4 values:
[$.w, $.x, $.y, $.z]
The nested arrays at $.x and $.y are getting flattened/inlined into my outer wrapper, resulting in more than 4 values:
[ true, "a", "b", 1, 2, 3, 9 ]
The results I would like to achieve are
[ true, ["a", "b"], [1, 2, 3], 9 ]
I can achieve this by using
[$.w, [$.x], [$.y], $.z]
But this requires me to know a priori that $.x and $.y are arrays.
I would like to select 4 values and have the resulting array contain exactly 4 values, independent of the types of values that are selected.
There are clearly some things about the interactions between JSONata sequences and arrays that I can't get my head around.
In common with XPath/XQuery sequences, it will flatten the results of a path expression into the output array. It is possible to avoid this in your example by using the $each higher-order function to iterate over an object's key/value pairs. The following expression will get what you want without any flattening of results:
$each($, function($v) {
$v
})
This just returns the value for each property in the object.
UPDATE: Extending this answer for your updated question:
I think this is related to a previous github question on how to combine several independent queries into the same question. This uses an object to hold all the queries in a similar manner to the one you arrived at. Perhaps a slightly clearer expression would be this:
{
"1": t,
"2": u.i,
"3": u.j,
"4": u.k,
"5": u.l,
"6": v
} ~> $each(λ($v){$v})
The λ is just a shorthand for function, if you can find it on your keyboard (F12 in the JSONata Exerciser).
I am struggling to rephrase my question in such as way as to describe the difficulties I am having with JSONata's sequence-like treatment of arrays.
I need to run several queries to extract several values from the same JSON tree. I would like to construct one JSONata query expression which extracts n data items (or runs n subqueries) and returns exactly n values in an ordered array.
This example seems to query request 6 values, but because of array flattening the result array does not have 6 values.
This example explicitly wraps each query in an array constructor so that the result has 6 values. However, the values which are not arrays are wrapped in an extraneous & undesirable array. In addition one cannot determine what the original type was ...
This example shows the result that I am trying to accomplish ... I asked for 6 things and I got 6 values back. However, I must know the datatypes of the values I am fetching and explicitly wrap the arrays in an array constructor to work-around the sequence flattening.
This example shows what I want. I queried 6 things and got back 6 answers without knowing the datatypes. But I have to introduce an object as a temporary container in order to work around the array flattening behavior.
I have not found any predicates that allow me to test the type of a value in a query ... which might have let me use the ?: operator to dynamically decide whether or not to wrap arrays in an array constructor. e.g. $isArray($.foo) ? [$.foo] : $.foo
Q: Is there an easier way for me to (effectively) submit 6 "path" queries and get back 6 values in an ordered array without knowing the data types of the values I am querying?
Building on the example from Acoleman, here is a way to pass in n "query" strings (that represent paths):
(['t', 'u.i', 'u.j', 'u.k', 'u.l', 'v'] {
$: $eval('$$.' & $)
}).$each(function($o) {$o})
and get back an array ofn results with their original data format:
[
12345,
[
"i",
"ii",
"iii"
],
[],
"K",
{
"L": "LL"
},
null
]
It seems that using $each is the only way to avoid any flattening...
Granted, probably not the most efficient of expressions, since each has to be evaluated from a path string starting at the root of the data structure -- but there ya go.

Ruby custom sorting of first n elements

I would like to sort an array of string based on my custom ordering. Problem is I dont know all the elements in array but Im sure that it has 3 strings (high/med/low). So I would like those 3 to be first 3 values . Rest at last
Eg:
Incoming arrays
array1 = ["high", "not impt" , "med" , "kind of impt" , "low" ]
array2 = ["low", "rand priority", "med", "high"]
Only high med and low are fixed, rest all keep changing or might not be present at all
required output
["high", "med", "low", rest.(order doesn't matter)]]
I know I can delete and merge, But it will be confusing in code as to why Im doing delete and merge. Any better way?
You can use sort_by method and implement something like this:
["high", "not impt" , "med" , "kind of impt" , "low" ].sort_by do |a|
["high", "med", "low"].index(a) || Float::INFINITY
end
index method returns 0, 1 and 2 for "high", "med" and "low" correspondingly and nil for other values. Thus, "high", "med" and "low" is going to be at the beginning and others at the end since every value is less than Float::INFINITY

RethinkDB: does reduce iterate all grouped data?

I thought I had it with rethinkdb :) but now I'm a bit confused -
for this query, counting grouped data:
groupedRql.count()
I'm getting the expected results (numbers):
[{"group": "a", "reduction": 41}, {"group": "b", "reduction": 39}...]
all reduction results are ~40 which is expected (and correct), but when I count using reduce like this:
groupedRql.map(function(row) {
return row.merge({
count: 0
})
}).reduce(function(left, right) {
return {count: left("count").add(1)}
})
I'm getting much lower results (~10) which MAKE NO SENSE:
[{"group": "a", "reduction": 10}, {"group": "b", "reduction": 9}...]
I need to use reduce, of course, for further manipulation.
Am I missing something?
I'm using v2.0.3 on server, queries tested directly on the dataexplorer.
The problem lay in here
return {count: left("count").add(1)}
It should be
return {count: left("count").add(right("count"))}
The reduce run paralel between multiple shards, multiple CPU core. When you do
return {count: left("count").add(1)}
you ignore some count from the right.
It's noted in this document: https://www.rethinkdb.com/docs/map-reduce/#how-gmr-queries-are-executed
it’s important to keep in mind that the reduce function is not called
on the elements of its input stream from left to right. It’s called on
either the elements of the stream in any order or on the output of
previous calls to the function.

Emit multiple values in RethinkDB map step

I have datasets that consist of arrays and single values
{
"a": "18",
"b": ["x","y","z"]
}
or arrays and arrays
{
"a": ["g", "h", "i"],
"b": ["x", "y", "z"]
}
and i plan to map out each combination (like "18-x", "18-y", "18-z" or "g-x", "g-y"...) to count these afterwards (or do anything else). I'm used to CouchDB with their emit function: I simply emitted multiple combinations per document. How is this supposed to be done in RethinkDB?
Note: The datasets are produced by a join
I would recommend making both fields always be arrays, even if the arrays sometimes only have a single value.
If you do that, you can do this with concat_map:
row('a').concatMap(function(a){
return row('b').map(function(b){
return a.add('-').add(b);
});
});
If you want to continue using a mix of single values and arrays, you can do that by replacing r.row('a') with r.branch(r.row('a').typeOf().eq('ARRAY'), r.row('a'), [r.row('a')]).

Resources