RethinkDB: Can I group by fields between dates efficiently? - rethinkdb

I'd like to group by multiple fields, between two timestamps.
I tried something like:
r.table('my_table').between(r.time(2015, 1, 1, 'Z'), r.now(), {index: "timestamp"}).group("field_a", "field_b").count()
Which takes a lot of time since my table is pretty big. I started thinking about using index in the 'group' part of the query, then I remembered it's impossible to use more than one index in the same rql.
Can I achieve what I need efficiently?

You could create a compound index, and then efficiently compute the count for any of the groups without computing all of them:
r.table('my_table').indexCreate('compound', function(row) {
return [row('field_a'), row('field_b'), row('timestamp')];
})
r.table('my_table').between(
[field_a_val, field_b_val, r.time(2015, 1, 1, 'Z')],
[field_a_val, field_b_val, r.now]
).count()

Related

Optimize query with filter and orderBy

I have a table Items with ~20k records. I need to be able to filter on multiple tags (or) and maintain order of the items, sorted by sticky then created-date.
For example I may fetch items tagged with tagA, tagB or tagC etc. If an item is sticky (bool), it is sorted first, then the rest of the items are sorted by created-date.
Currently my query looks like this:
Item.orderBy({ index: r.desc('priority') }) // priority = doc => [doc('sticky'), doc('created').toEpochTime()]
.filter(row => r.expr(['tagA', 'tagB','tagC']).setIntersection(row('tags')).count().ne(0))
.slice(offset, offset + limit); // 0 to 10 - pagination
But I feel this performs quite a few reads. Is there a way to optimize it further?
I've tried using a multi-index on the tags, and the do regular sort without an index. This seemed to do fewer reads, but was much slower.

Efficient way to get difference of two streams in RethinkDB

I am running some performance benchmarks on RethinkDB (related to a specific use-case). In my simulation, there are 2 tables: contact and event. A contact has many events. The event table has 2 indices: contact_id and compound index on [campaign_id, node_id, event_type]. The contact table has about 500k contacts and about 1.75 million docs in event table.
The query I am struggling with is to find all the contacts who have sent event_type but not open event_type. Following is the query I got to work:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)
.count.run(conn)
But this query uses set difference, not stream difference. I have also tried using difference operator:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'] .difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'])
.count.run(conn)
This query never finishes and the weird thing is even after aborting the query I see (in RethinkDB dashboard) that the reads dont stop.
Whats the most efficient way of doing these kind of queries?
Follow up: find all the male contacts who have sent event_type but not open event_type. What I have now is:
r.table("contacts").get_all(r.args(
r.table("events").get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)))
.filter({gender: 1}).count.run(conn)
One way to make this efficient is to denormalize your data. Instead of having separate contact and event tables, just have the contact table and make each contact have an array of events. Then you can write:
r.table('contacts').indexCreate('sent_but_not_open', function(row) {
return row('events').contains('sent').and(
row('events').contains('open').not());
});
That will work well if the number of events per contact is smallish. If you have thousands or millions of events per contact it will break down though.
RethinkDB doesn't offer a way to diff two streams lazily on the server. The best you could do is to change your compound index to be on [campaign_id, node_id, event_type, contact_id] instead, replace your get_all([1, 5, 'sent'], {index: 'campaign'}) with .between([1, 5, 'sent', r.minval], [1, 5, 'sent', r.maxval], {index: 'campaign'})and then put.distinct({index: 'campaign'})['contact_id']on the end. That will give you a stream of distinct contact IDs rather than an array, and these contact IDs will be sorted. You can then do the same for theopen` events, and diff the two ordered streams in the client by doing a mergesort-like thing.

How can I descending sort a grouping based on the count of the reduction array in rethinkdb

Importing this dataset as a table:
https://data.cityofnewyork.us/Housing-Development/Registration-Contacts/feu5-w2e2#revert
I use the following query to perform an aggregation and then attempt to sort in descending order based on the reduction field. My intention is sort based on the count of that field or to have the aggregation create a second field called count and sort the grouping results in descending order of the reduction array count or length. How can this be done in rethinkdb?
query:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"}).hasFields("CorporationName").group("CorporationName").ungroup().orderBy(r.desc('reduction'))
I don't quite understand what you're going for, but does this do what you want? If not, what do you want to be different in the output?
r.table("contacts")
.filter({"Type": "Agent","ContactDescription" : "CONDO"})
.hasFields("CorporationName")
.group("CorporationName")
.ungroup()
.merge(function(row){ return {count: row('reduction').count()}; })
.orderBy(r.desc('count'))
You are almost there:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"}).hasFields("CorporationName").group("CorporationName").count().ungroup().orderBy(r.desc('reduction'))
See that .count()? That is a map-reduce operation to get the count of each group.
I haven't tested the query on your dataset. Please comment in case you had problems with it.
EDIT:
If you want to add a count field and preserve the original document, you need to use map and reduce. In your case, it should be something like:
r.table("contacts").filter({"Type": "Agent","ContactDescription" : "CONDO"})
.hasFields("CorporationName")
.group("CorporationName")
.map(r.row.merge({count:1}))
.reduce(function(left, right){
return {
count: left('count').add(right('count')),
<YOUR_OTHER_FIELDS>: left('<YOUR_OTHER_FIELDS>'),
...
};
})
.ungroup().orderBy(r.desc(r.row('reduction')('count')))
EDIT:
I am not sure if this can do the trick, but it is worth a try:
.reduce(function(left, right){
return left.merge({count: left('count').add(right('count'))})
})

olap4J - calculations on member grouping

I'm trying to write an olap4j (Mondrian) query that will group the rows by ranges.
Assume we have counts of cards per child and the children ages.
i want to sum the cards amount by age ranges, so i will have counts for ages 0-5,5-10,10-15 and so on.
Is this can be done with olap4j?
You need to define calculated members for that:
With member [Age].[0-4] as [Age].[0]:[Age].[4]
member [Age].[5-9] as [Age].[5]:[Age].[9]
etc.
Alternatively, you may want to re-design your dimension table. I'm guessing you have age as a degenerate dimension in the fact table. I suggest creating a separate dimension dim_age with a structure like this:
age_id, age, age_group
0, null, null
1, 0, 0-4
2, 1, 0-4
(...)
Then it's easy to define a first level on the dimension based on the age_group.

MongoDB ranged pagination

It's said that using skip() for pagination in MongoDB collection with many records is slow and not recommended.
Ranged pagination (based on >_id comparsion) could be used
db.items.find({_id: {$gt: ObjectId('4f4a3ba2751e88780b000000')}});
It's good for displaying prev. & next buttons - but it's not very easy to implement when you want to display actual page numbers 1 ... 5 6 7 ... 124 - you need to pre-calculate from which "_id" each page starts.
So I have two questions:
1) When should I start worry about that? When there're "too many records" with noticeable slowdown for skip()? 1 000? 1 000 000?
2) What is the best approach to show links with actual page numbers when using ranged pagination?
Good question!
"How many is too many?" - that, of course, depends on your data size and performance requirements. I, personally, feel uncomfortable when I skip more than 500-1000 records.
The actual answer depends on your requirements. Here's what modern sites do (or, at least, some of them).
First, navbar looks like this:
1 2 3 ... 457
They get final page number from total record count and page size. Let's jump to page 3. That will involve some skipping from the first record. When results arrive, you know id of first record on page 3.
1 2 3 4 5 ... 457
Let's skip some more and go to page 5.
1 ... 3 4 5 6 7 ... 457
You get the idea. At each point you see first, last and current pages, and also two pages forward and backward from the current page.
Queries
var current_id; // id of first record on current page.
// go to page current+N
db.collection.find({_id: {$gte: current_id}}).
skip(N * page_size).
limit(page_size).
sort({_id: 1});
// go to page current-N
// note that due to the nature of skipping back,
// this query will get you records in reverse order
// (last records on the page being first in the resultset)
// You should reverse them in the app.
db.collection.find({_id: {$lt: current_id}}).
skip((N-1)*page_size).
limit(page_size).
sort({_id: -1});
It's hard to give a general answer because it depends a lot on what query (or queries) you are using to construct the set of results that are being displayed. If the results can be found using only the index and are presented in index order then db.dataset.find().limit().skip() can perform well even with a large number of skips. This is likely the easiest approach to code up. But even in that case, if you can cache page numbers and tie them to index values you can make it faster for the second and third person that wants to view page 71, for example.
In a very dynamic dataset where documents will be added and removed while someone else is paging through data, such caching will become out-of-date quickly and the limit and skip method may be the only one reliable enough to give good results.
I recently encounter the same problem when trying to paginate a request while using a field that wasn't unique, for example "FirstName". The idea of this query is to be able to implement pagination on a non-unique field without using skip()
The main problem here is being able to query for a field that is not unique "FirstName" because the following will happen:
$gt: {"FirstName": "Carlos"} -> this will skip all the records where first name is "Carlos"
$gte: {"FirstName": "Carlos"} -> will always return the same set of data
Therefore the solution I came up with was making the $match portion of the query unique by combining the targeted search field with a secondary field in order to make it a unique search.
Ascending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$gt: 'Carlos'}}]}},
{$sort: {'FirstName': 1, '_id': 1}},
{$limit: 10}
])
Descending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$lt: 'Carlos'}}]}},
{$sort: {'FirstName': -1, '_id': 1}},
{$limit: 10}
])
The $match part of this query is basically behaving as an if statement:
if firstName is "Carlos" then it needs to also be greater than this id
if firstName is not equal to "Carlos" then it needs to be greater than "Carlos"
Only problem is that you cannot navigate to an specific page number (it can probably be done with some code manipulation) but other than it solved my problem with pagination for non-unique fields without having to use skip which eats a lot of memory and processing power when getting to the end of whatever dataset you are querying for.

Resources