RethinkDB: does reduce iterate all grouped data? - rethinkdb

I thought I had it with rethinkdb :) but now I'm a bit confused -
for this query, counting grouped data:
groupedRql.count()
I'm getting the expected results (numbers):
[{"group": "a", "reduction": 41}, {"group": "b", "reduction": 39}...]
all reduction results are ~40 which is expected (and correct), but when I count using reduce like this:
groupedRql.map(function(row) {
return row.merge({
count: 0
})
}).reduce(function(left, right) {
return {count: left("count").add(1)}
})
I'm getting much lower results (~10) which MAKE NO SENSE:
[{"group": "a", "reduction": 10}, {"group": "b", "reduction": 9}...]
I need to use reduce, of course, for further manipulation.
Am I missing something?
I'm using v2.0.3 on server, queries tested directly on the dataexplorer.

The problem lay in here
return {count: left("count").add(1)}
It should be
return {count: left("count").add(right("count"))}
The reduce run paralel between multiple shards, multiple CPU core. When you do
return {count: left("count").add(1)}
you ignore some count from the right.
It's noted in this document: https://www.rethinkdb.com/docs/map-reduce/#how-gmr-queries-are-executed
it’s important to keep in mind that the reduce function is not called
on the elements of its input stream from left to right. It’s called on
either the elements of the stream in any order or on the output of
previous calls to the function.

Related

Performing aggregate operations on two JSON-like data

My application periodically consumes data from an API:
[
{
"name": "A",
"val": 12
},
{
"name": "B",
"val": 22
},
{
"name": "C",
"val": 32
}
]
Its task is to perform some operations on this data, for example, to subtract the "A" value from its previous sample. So we always keep the current sample for further use.
The incoming data has two constraints:
In some iterations some parts of data may be missing, i.e. the "A" object may not be available in some JSONs,
The order of objects is not guaranteed to be consistent, i.e. in a JSON we may have A, B, C, and in another one, we may have "C", "A", "B" ...
The intuitive way to perform these operations is a linear search. We loop through the current JSON and for each object we search for the counterpart in the previous JSON. Then we perform the calculations and put them inside another JSON.
What is an efficient way to do this task? I preferably like to do this in Go but the language is not important.

Dremel, Null value in repeted field

I have a structure like this (I used JSON to represent data here, but this can be an object in any form):
[
{
"DocID": ["A", "B"]
},
{},
]
Based on Dremel spec, The repetition level for the only data filed here "DocID" (which is repeated) is {0,1,0} and the definition level is {1,1,0} since the last item is null.
Now if I have something like this:
[
{
"DocID": ["A", "B"]
},
{ "DocID": [null]},
]
Then again, the repetition level is {0,1,0} and definition level is {0,1,1}
For storing Dremel data in parquet, we never store null fields (Here)
So we store two value "A", "B" in this case (encoding doesn't matter), but for constructing the structure, the first RLevel is zero, so this is start of a new object, the first DLevel is 1, so this is not null. we read the first value, which is "A" (Correct), the second RLevel is 1 it means it is still the same object and it is a repeated field, the DLevel is 1 so it is not null, we read the second value which is "B" (Correct).
The third RLevel is 0, this means a new object. In the first example, the DLevel is zero, so it is null, we don't need to read anything (there is nothing left) and it works.
But in the second case, the DLevel is 1, so we need to read something, and there is nothing left to read.
What we should do in this case?
Just for context, I am co-author of fraugster/parquet-go library, and this is the issue we faced recently.

JSONata prevent array flattening

Q: How do I prevent JSONata from "auto-flattening" arrays in an array constructor?
Given JSON data:
{
"w" : true,
"x":["a", "b"],
"y":[1, 2, 3],
"z": 9
}
the JSONata query seems to select 4 values:
[$.w, $.x, $.y, $.z]
The nested arrays at $.x and $.y are getting flattened/inlined into my outer wrapper, resulting in more than 4 values:
[ true, "a", "b", 1, 2, 3, 9 ]
The results I would like to achieve are
[ true, ["a", "b"], [1, 2, 3], 9 ]
I can achieve this by using
[$.w, [$.x], [$.y], $.z]
But this requires me to know a priori that $.x and $.y are arrays.
I would like to select 4 values and have the resulting array contain exactly 4 values, independent of the types of values that are selected.
There are clearly some things about the interactions between JSONata sequences and arrays that I can't get my head around.
In common with XPath/XQuery sequences, it will flatten the results of a path expression into the output array. It is possible to avoid this in your example by using the $each higher-order function to iterate over an object's key/value pairs. The following expression will get what you want without any flattening of results:
$each($, function($v) {
$v
})
This just returns the value for each property in the object.
UPDATE: Extending this answer for your updated question:
I think this is related to a previous github question on how to combine several independent queries into the same question. This uses an object to hold all the queries in a similar manner to the one you arrived at. Perhaps a slightly clearer expression would be this:
{
"1": t,
"2": u.i,
"3": u.j,
"4": u.k,
"5": u.l,
"6": v
} ~> $each(λ($v){$v})
The λ is just a shorthand for function, if you can find it on your keyboard (F12 in the JSONata Exerciser).
I am struggling to rephrase my question in such as way as to describe the difficulties I am having with JSONata's sequence-like treatment of arrays.
I need to run several queries to extract several values from the same JSON tree. I would like to construct one JSONata query expression which extracts n data items (or runs n subqueries) and returns exactly n values in an ordered array.
This example seems to query request 6 values, but because of array flattening the result array does not have 6 values.
This example explicitly wraps each query in an array constructor so that the result has 6 values. However, the values which are not arrays are wrapped in an extraneous & undesirable array. In addition one cannot determine what the original type was ...
This example shows the result that I am trying to accomplish ... I asked for 6 things and I got 6 values back. However, I must know the datatypes of the values I am fetching and explicitly wrap the arrays in an array constructor to work-around the sequence flattening.
This example shows what I want. I queried 6 things and got back 6 answers without knowing the datatypes. But I have to introduce an object as a temporary container in order to work around the array flattening behavior.
I have not found any predicates that allow me to test the type of a value in a query ... which might have let me use the ?: operator to dynamically decide whether or not to wrap arrays in an array constructor. e.g. $isArray($.foo) ? [$.foo] : $.foo
Q: Is there an easier way for me to (effectively) submit 6 "path" queries and get back 6 values in an ordered array without knowing the data types of the values I am querying?
Building on the example from Acoleman, here is a way to pass in n "query" strings (that represent paths):
(['t', 'u.i', 'u.j', 'u.k', 'u.l', 'v'] {
$: $eval('$$.' & $)
}).$each(function($o) {$o})
and get back an array ofn results with their original data format:
[
12345,
[
"i",
"ii",
"iii"
],
[],
"K",
{
"L": "LL"
},
null
]
It seems that using $each is the only way to avoid any flattening...
Granted, probably not the most efficient of expressions, since each has to be evaluated from a path string starting at the root of the data structure -- but there ya go.

Emit multiple values in RethinkDB map step

I have datasets that consist of arrays and single values
{
"a": "18",
"b": ["x","y","z"]
}
or arrays and arrays
{
"a": ["g", "h", "i"],
"b": ["x", "y", "z"]
}
and i plan to map out each combination (like "18-x", "18-y", "18-z" or "g-x", "g-y"...) to count these afterwards (or do anything else). I'm used to CouchDB with their emit function: I simply emitted multiple combinations per document. How is this supposed to be done in RethinkDB?
Note: The datasets are produced by a join
I would recommend making both fields always be arrays, even if the arrays sometimes only have a single value.
If you do that, you can do this with concat_map:
row('a').concatMap(function(a){
return row('b').map(function(b){
return a.add('-').add(b);
});
});
If you want to continue using a mix of single values and arrays, you can do that by replacing r.row('a') with r.branch(r.row('a').typeOf().eq('ARRAY'), r.row('a'), [r.row('a')]).

Calculate moving averages with ruby and simple_statistics and appending to JSON

I am trying to calculate moving averages (simple and exponential) and I have come across the simple_statistics gem, which is perfect for my needs. I am trying to modify the code from this link: How to calculate simple moving average) for my purposes.
GOAL:
I have a JSON like this which lists historical prices for a single stock over a long time period:
[
{
"Low": 8.63,
"Volume": 14211900,
"Date": "2012-10-26",
"High": 8.79,
"Close": 8.65,
"Adj Close": 8.65,
"Open": 8.7
},
To this, I would like to add moving averages for each day (simple and exponential - which the simple_statistics gem seems to do easily) for 20, and 50 day averages (and others as required) so it would appear something like this for each day:
[
{
"Low": 8.63,
"Volume": 14211900,
"Date": "2012-10-26",
"High": 8.79,
"Close": 8.65,
"Adj Close": 8.65,
"Open": 8.7,
"SMA20":
"SMA50":
},
I would prefer to use the yahoo_finance, and simple_statistics gems and then append the output to the original JSON as I have a feeling that once I gain a better understanding, it will be easier for me to modify.
Right now, I'm still reading up on how I will do this (any help is appreciated) Below is my attempt to calculate a 20 day simple moving average for Microsoft (doesn't work). This way (using HistoricalQuotes_days) seems to assume that the start date is today, which wont work for my overall goal.
require 'rubygems'
require 'yahoofinance'
require 'simple_statistics'
averages = {}
dates.each do |dates|
quotes = YahooFinance::get_HistoricalQuotes_days ( 'MSFT' , 20 ) start, finish
closes = quotes.collect { |quote| quote.close }
averages = closes.mean
end
Thank you
UPDATE: I don't actually need to use YahooFinance gem as I already have the data in a JSON. What I dont know how to do is pull from the JSON array, make the calculations using the simple_statistics gem, and then add the new data into the original JSON.
Using the gem, I see two ways to get your data. Here they are (note they both can take a block):
YahooFinance::get_HistoricalQuotes_days('MSFT', 20)
Which returns an array of YahooFinance::HistoricalQuote objects with the following methods:
[ :recno, :recno=, :symbol, :symbol=, :date, :date=,
:open, :open=, :high, :high=, :low, :low=, :close,
:close=, :adjClose, :adjClose=, :volume, :volume=,
:to_a, :date_to_Date ]
Or:
YahooFinance::get_historical_quotes_days('MSFT', 20)
which returns an array of values from the documentation :
Getting the historical quote data as a raw array.
The elements of the array are:
[0] - Date
[1] - Open
[2] - High
[3] - Low
[4] - Close
[5] - Volume
[6] - Adjusted Close
And to take an average (simple moving average), you can easily do:
ary.reduce(:+) / ary.length
Where ary would hold the values to average (need to be floats or it will integer divide). To do the exponential moving average, just use the following formula:
(close - previous_ema) * (2 / (amount_of_days_ago + 1) ) + previous_ema
Where close is the stock's close, previous_ema is yesterday's ema, and amount_of_days_ago is the range of the average into the past, for instance 20 (days).
edit
oh. Yeah parsing json is easy: https://github.com/flori/json
I can't write a whole beginning ruby guide, but the basics for what you need are Hash and Array. Look up how to use ruby hashes and arrays, and thats probably a good 30% of ruby programming right there.
For example to get the json objects in an array and then get just the closes, you could use Array#map like so:
stocks = JSON.parse( your_json_here )
array = stocks.map{ |hash| hash["Close"] }
# => [8.65, 9.32, etc... ]
hope that gets you started n good luck

Resources