How can I find the minimum value for a field each day over a timespan using RethinkDb? - rethinkdb

I have a database of readings from weather sensors. One of the items measured is 'sky temperature'. I want to find the minimum sky temperature each day over a period of a month or two.
The first thing I tried was this:
r.db('Weather').table('TAO_SkyNet', {readMode:'outdated'})
.group(r.row('time').dayOfYear(),{index:'time'})
.min('sky')
I think that might work, except that it is a large database and the query times out after 300 seconds. Fair enough, I really don't want the data back to the beginning of time. A few weeks will do. So I tried to restrict the records examined like this:
r.db('Weather').table('TAO_SkyNet', {readMode:'outdated'})
.between(r.time(2018,3,1,'Z'), r.now())
.group(r.row('time').dayOfYear(),{index:'time'})
.min('sky')
..and I get...
e: Expected type TABLE but found TABLE_SLICE:
SELECTION ON table(TAO_SkyNet) in:
r.db("Weather").table("TAO_SkyNet", {"readMode": "outdated"}).between(r.time(2018, 3, 1, "Z"), r.now()).group(r.row("time").dayOfYear(), {"index": "time"}).min("sky")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
So, I'm stuck here. How do I group on a subset of the table?

between returns a table slice, and table slices don't support indexes.
table.between(lowerKey, upperKey[, options]) → table_slice
By the way, between operates over indexes itself.
Once you remove {index:'time'} from your group clause (if TAO_SkyNet has time as its primary key):
r.db('Weather')
.table('TAO_SkyNet', {readMode: 'outdated'})
.between(r.time(2018, 3, 1, 'Z'), r.now())
.group(r.row('time').dayOfYear())
.min('sky')
or move the index option to the between clause (if TAO_SkyNet has time as its secondary key)
r.db('Weather')
.table('TAO_SkyNet', {readMode: 'outdated'})
.between(r.time(2018, 3, 1, 'Z'), r.now(), {index: 'time'})
.group(r.row('time').dayOfYear())
.min('sky')
it should work fine.
Test dataset:
r.db('Weather').table('TAO_SkyNet').insert([
// day 1
{time: r.time(2018, 3, 1, 0, 0, 0, 'Z'), sky: 10},
{time: r.time(2018, 3, 1, 8, 0, 0, 'Z'), sky: 4}, // min
{time: r.time(2018, 3, 1, 16, 0, 0, 'Z'), sky: 7},
// day 2
{time: r.time(2018, 3, 2, 0, 0, 0, 'Z'), sky: 2}, // min
{time: r.time(2018, 3, 2, 8, 0, 0, 'Z'), sky: 4},
{time: r.time(2018, 3, 2, 16, 0, 0, 'Z'), sky: 9},
// day 3
{time: r.time(2018, 3, 3, 0, 0, 0, 'Z'), sky: 7},
{time: r.time(2018, 3, 3, 8, 0, 0, 'Z'), sky: 7},
{time: r.time(2018, 3, 3, 16, 0, 0, 'Z'), sky: 1} // min
]);
Query result:
[{
"group": 60,
"reduction": {"sky": 4, "time": Thu Mar 01 2018 08:00:00 GMT+00:00}
},
{
"group": 61,
"reduction": {"sky": 2, "time": Fri Mar 02 2018 00:00:00 GMT+00:00}
},
{
"group": 62,
"reduction": {"sky": 1, "time": Sat Mar 03 2018 16:00:00 GMT+00:00}
}]

Related

Find Top N Most Frequent Sequence of Numbers in List of a Billion Sequences

Let's say I have the following list of lists:
x = [[1, 2, 3, 4, 5, 6, 7], # sequence 1
[6, 5, 10, 11], # sequence 2
[9, 8, 2, 3, 4, 5], # sequence 3
[12, 12, 6, 5], # sequence 4
[5, 8, 3, 4, 2], # sequence 5
[1, 5], # sequence 6
[2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6], # sequence 7
[7, 1, 7, 3, 4, 1, 2], # sequence 8
[9, 4, 12, 12, 6, 5, 1], # sequence 9
]
Essentially, for any list that contains the target number 5 (i.e., target=5) anywhere within the list, what are the top N=2 most frequently observed subsequences with length M=4?
So, the conditions are:
if target doesn't exist in the list then we ignore that list completely
if the list length is less than M then we ignore the list completely
if the list is exactly length M but target is not in the Mth position then we ignore it (but we count it if target is in the Mth position)
if the list length, L, is longer than M and target is in the i=M position(ori=M+1position, ori=M+2position, ...,i=Lposition) then we count the subsequence of lengthMwheretarget` is in the final position in the subsequence
So, using our list-of-lists example, we'd count the following subsequences:
subseqs = [[2, 3, 4, 5], # taken from sequence 1
[2, 3, 4, 5], # taken from sequence 3
[12, 12, 6, 5], # taken from sequence 4
[8, 8, 3, 5], # taken from sequence 7
[1, 4, 12, 5], # taken from sequence 7
[12, 12, 6, 5], # taken from sequence 9
]
Of course, what we want are the top N=2 subsequences by frequency. So, [2, 3, 4, 5] and [12, 12, 6, 5] are the top two most frequent sequences by count. If N=3 then all of the subsequences (subseqs) would be returned since there is a tie for third.
Important
This is super simplified but, in reality, my actual list-of-sequences
consists of a few billion lists of positive integers (between 1 and 10,000)
each list can be as short as 1 element or as long as 500 elements
N and M can be as small as 1 or as big as 100
My questions are:
Is there an efficient data structure that would allow for fast queries assuming that N and M will always be less than 100?
Are there known algorithms for performing this kind of analysis for various combinations of N and M? I've looked at suffix trees but I'd have to roll my own custom version to even get close to what I need.
For the same dataset, I need to repeatedly query the dataset for various values or different combinations of target, N, and M (where target <= 10,000, N <= 100 and `M <= 100). How can I do this efficiently?
Extending on my comment. Here is a sketch how you could approach this using an out-of-the-box suffix array:
1) reverse and concatenate your lists with a stop symbol (I used 0 here).
[7, 6, 5, 4, 3, 2, 1, 0, 11, 10, 5, 6, 0, 5, 4, 3, 2, 8, 9, 0, 5, 6, 12, 12, 0, 2, 4, 3, 8, 5, 0, 5, 1, 0, 6, 5, 12, 4, 1, 9, 5, 3, 8, 8, 2, 0, 2, 1, 4, 3, 7, 1, 7, 0, 1, 5, 6, 12, 12, 4, 9]
2) Build a suffix array
[53, 45, 24, 30, 12, 19, 33, 7, 32, 6, 47, 54, 51, 38, 44, 5, 46, 25, 16, 4, 15, 49, 27, 41, 37, 3, 14, 48, 26, 59, 29, 31, 40, 2, 13, 10, 20, 55, 35, 11, 1, 34, 21, 56, 52, 50, 0, 43, 28, 42, 17, 18, 39, 60, 9, 8, 23, 36, 58, 22, 57]
3) Build the LCP array. The LCP array will tell you how many numbers a suffix has in common with its neighbour in the suffix array. However, you need to stop counting when you encounter a stop symbol
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 0, 2, 1, 1, 2, 0, 1, 3, 2, 2, 1, 0, 1, 1, 1, 4, 1, 2, 4, 1, 0, 1, 2, 1, 3, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 2, 0]
4) When a query comes in (target = 5, M= 4) you search for the first occurence of your target in the suffix array and scan the corresponding LCP-array until the starting number of suffixes changes. Below is the part of the LCP array that corresponds to all suffixes starting with 5.
[..., 1, 1, 1, 4, 1, 2, 4, 1, 0, ...]
This tells you that there are two sequences of length 4 that occur two times. Brushing over some details using the indexes you can find the sequences and revert them back to get your final results.
Complexity
Building up the suffix array is O(n) where n is the total number of elements in all lists and O(n) space
Building the LCP array is also O(n) in both time and space
Searching a target number in the suffix is O(log n) in average
The cost of scanning through the relevant subsequences is linear in the number of times the target occurs. Which should be 1/10000 on average according to your given parameters.
The first two steps happen offline. Querying is technically O(n) (due to step 4) but with a small constant (0.0001).

What's happening with FromDigits?

I thought I knew how FromDigits works, but it's doing something crazy now.
n[[990;;]]
FromDigits[n[[990;;]]]
outputs:
{9, 50, 0, 50, 1, 50, 2, 50, 3, 50, 4, 50, 5, 50, 6, 50, 7, 50, 8, 50, 9}
1405060708091011121309
instead of, you know, 950050150...
what's going on?
Documentation says that
FromDigits : constructs an integer from the list of its decimal digits.
So each number in the array must be less that 10 (decimal digits) for a simple concatenation.
Digits larger than the base are "carried": For example
FromDigits[{7, 11, 0, 0, 0, 122}] will give 810122
For more information go to http://reference.wolfram.com/language/ref/FromDigits.html
I think "string hacking" might be what you are asking for. This
myn = {9, 50, 0, 50, 1, 50, 2, 50, 3, 50, 4, 50, 5, 50, 6, 50, 7, 50, 8, 50, 9};
ToExpression[StringReplace[ToString[myn], ", " -> ""]][[1]]
gives you this integer
9500501502503504505506507508509
That turns your list into a string, replaces each comma space separator with nothing, turns that resulting string back into an integer and discards the now unneeded curly brackets.
A couple other ways..
FromDigits#Flatten#IntegerDigits#
{9, 50, 0, 50, 1, 50, 2, 50, 3, 50, 4, 50, 5, 50, 6, 50, 7, 50, 8, 50, 9}
9500501502503504505506507508509
(ToString /# # // StringJoin // ToExpression) &#
{9, 50, 0, 50, 1, 50, 2, 50, 3, 50, 4, 50, 5, 50, 6, 50, 7, 50, 8, 50, 9}
9500501502503504505506507508509

Understand disaster model in PyMC

I start learning PyMC and strungle to understand the very first tutorial´s example.
disasters_array = \
np.array([ 4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
switchpoint = DiscreteUniform('switchpoint', lower=0, upper=110, doc='Switchpoint[year]')
early_mean = Exponential('early_mean', beta=1.)
late_mean = Exponential('late_mean', beta=1.)
I don´t understand why early_mean and late_mean is modeled as stochastic variable following exponential distribution with rate = 1. My intuition is that they should be deterministic calculated using disasters_array and switchpoint variable e.g.
#deterministic(plot=False)
def early_mean(s=switchpoint):
return sum(disasters_array[:(s-1)])/(s-1)
#deterministic(plot=False)
def late_mean(s=switchpoint):
return sum(disasters_array[s:])/s
disasters_array are the data generated by a Poisson process, under the assumptions of this model. late_mean and early_mean are the parameters associated with this process, depending on when in the time series they occurred. The true values of the parameters are unknown, so they are specified as stochastic variables. Deterministic objects are only for nodes that are completely determined by the values of their parents.
Think of early_mean and late_mean stochastics as model parameters, and the Exponential as the prior distribution for these parameters. In the version of the model here, the deterministic r and likelihood D lead to posteriors on early_mean and late_mean through MCMC sampling.

Element-wise maximum value for two lists

Given two Mathematica sets of data such as
data1 = {0, 1, 3, 4, 8, 9, 15, 6, 5, 2, 0};
data2 = {0, 1, 2, 5, 8, 7, 16, 5, 5, 2, 1};
how can I create a set giving me the maximum value of the two lists, i.e. how to obtain
data3 = {0, 1, 3, 5, 8, 9, 16, 6, 5, 2, 1};
?
data1 = {0, 1, 3, 4, 8, 9, 15, 6, 5, 2, 0};
data2 = {0, 1, 2, 5, 8, 7, 16, 5, 5, 2, 1};
Max /# Transpose[{data1, data2}]
(* {0, 1, 3, 5, 8, 9, 16, 6, 5, 2, 1} *)
Another possible solution is to use the MapThread function:
data3 = MapThread[Max, {data1, data2}]
belisarius solution however is much faster.
Simplest, though not the fastest:
Inner[Max,data1,data2,List]

Group time intervals by date (in d3.js)

For instance, there is an array of objects with start, end and duration (in hours) attributes.
[{start: new Date(2013, 2, 4, 0),
end: new Date(2013, 2, 4, 8),
duration: 8},
{start: new Date(2013, 2, 4, 22),
end: new Date(2013, 2, 5, 2),
duration: 4},
{start: new Date(2013, 2, 5, 5),
end: new Date(2013, 2, 7, 5),
duration: 48}]
I'd like to visualize them into something like the following (y - hours, x - dates):
I'm thinking about creating additional objects to fill the empty spaces between events like this
[{start: new Date(2013, 2, 4, 0),
end: new Date(2013, 2, 4, 8),
status: "busy"},
{start: new Date(2013, 2, 4, 8, 0, 1),
end: new Date(2013, 2, 4, 21, 59, 59),
status: "free"},
{start: new Date(2013, 2, 4, 22),
end: new Date(2013, 2, 4, 23, 59, 59),
status: "busy"},
{start: new Date(2013, 2, 5, 0),
end: new Date(2013, 2, 5, 2),
status: "busy"}]
And then map this to Stack Layout.
So my question is, how would be better to split and group the array, to make this visualization easier? Maybe there is some built-in D3.js features for this?
I would consider changing the data format to
[{start: new Date(2013, 2, 4, 0),
end: new Date(2013, 2, 4, 8)},
{start: new Date(2013, 2, 4, 22),
end: new Date(2013, 2, 5, 2)},
{start: new Date(2013, 2, 5, 5),
end: new Date(2013, 2, 7, 5)}]
Since you have the start and end date, you don't really need a duration. Alternatively you could have just the start date and a duration.
I'm not extremely familiar with the stacklayout, but it might be sufficent (and easier) for this project to simply append rect elements to the right position. I made an example here: http://tributary.io/inlet/5841372 which doesn't take into account the fact that you need to wrap events that start one day and end the next. This just displays all events in the same column, with the white space representing free time.

Resources