I am running some performance benchmarks on RethinkDB (related to a specific use-case). In my simulation, there are 2 tables: contact and event. A contact has many events. The event table has 2 indices: contact_id and compound index on [campaign_id, node_id, event_type]. The contact table has about 500k contacts and about 1.75 million docs in event table.
The query I am struggling with is to find all the contacts who have sent event_type but not open event_type. Following is the query I got to work:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)
.count.run(conn)
But this query uses set difference, not stream difference. I have also tried using difference operator:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'] .difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'])
.count.run(conn)
This query never finishes and the weird thing is even after aborting the query I see (in RethinkDB dashboard) that the reads dont stop.
Whats the most efficient way of doing these kind of queries?
Follow up: find all the male contacts who have sent event_type but not open event_type. What I have now is:
r.table("contacts").get_all(r.args(
r.table("events").get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)))
.filter({gender: 1}).count.run(conn)
One way to make this efficient is to denormalize your data. Instead of having separate contact and event tables, just have the contact table and make each contact have an array of events. Then you can write:
r.table('contacts').indexCreate('sent_but_not_open', function(row) {
return row('events').contains('sent').and(
row('events').contains('open').not());
});
That will work well if the number of events per contact is smallish. If you have thousands or millions of events per contact it will break down though.
RethinkDB doesn't offer a way to diff two streams lazily on the server. The best you could do is to change your compound index to be on [campaign_id, node_id, event_type, contact_id] instead, replace your get_all([1, 5, 'sent'], {index: 'campaign'}) with .between([1, 5, 'sent', r.minval], [1, 5, 'sent', r.maxval], {index: 'campaign'})and then put.distinct({index: 'campaign'})['contact_id']on the end. That will give you a stream of distinct contact IDs rather than an array, and these contact IDs will be sorted. You can then do the same for theopen` events, and diff the two ordered streams in the client by doing a mergesort-like thing.
Related
Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.
I am currently trying to use google sheets to sort a table based on the unique values found in only two of the columns.
What I want to happen is that columns A (Brand) and B (Dimensions) are both checked for unique values, removing any duplicate data.
The problem is using these two columns to filter and show the rest of table. I can't managed to achieve it.
Original Data
What is should look like after being culled
.
You can use Query function:
=query(A1:D8,"select A, B, min(C), min(D) group by A, B",1)
Example spreadsheet
try:
=ARRAYFORMULA(SORT(SPLIT(TRANSPOSE(QUERY(TRANSPOSE(SORTN(
IF(A2:C<>"", {"♦"&A2:A&"♦"&B2:B, "♦"&C2:D}, ), 99^99, 2, 1, 1))
,,99^99)), "♦"), 2, 1))
I am reading stripe documentation and I want to fetch the first charge of the a customer. Currently I am doing
charge_list = Stripe::Charge.list(
{
customer: "cus_xxx"
},
"sk_test_xxxxxx"
)
first_charge = charge_list.data.last
Since stripe api returns the charges list in sorted order with the most recent charges appearing first. But I don't think it is a good approach. Can anyone help me with how can I fetch the first charge by a customer or how can I sort the list with descending order of created date so that I could get the first object from the array.
It seems there is no reverse order sorting feature in stripe API.
Also remember the first charge may not be on the first page result set, so you have to iterate using #auto_paging_each.
A quick possible solution:
charge_list = Stripe::Charge.list(
{customer: "cus_xxx", limit: 100 }, # limit 100 to reduce request
"sk_test_xxxxxx")
first_charge = nil
charge_list.auto_paging_each {|c| first_charge = c }
You may want to persist the result somewhere since it is a heavy operation.
But the cleanest solution IMO would be to store all charge records into your DB and make subsequent queries against it.
I'd like to group by multiple fields, between two timestamps.
I tried something like:
r.table('my_table').between(r.time(2015, 1, 1, 'Z'), r.now(), {index: "timestamp"}).group("field_a", "field_b").count()
Which takes a lot of time since my table is pretty big. I started thinking about using index in the 'group' part of the query, then I remembered it's impossible to use more than one index in the same rql.
Can I achieve what I need efficiently?
You could create a compound index, and then efficiently compute the count for any of the groups without computing all of them:
r.table('my_table').indexCreate('compound', function(row) {
return [row('field_a'), row('field_b'), row('timestamp')];
})
r.table('my_table').between(
[field_a_val, field_b_val, r.time(2015, 1, 1, 'Z')],
[field_a_val, field_b_val, r.now]
).count()
It's said that using skip() for pagination in MongoDB collection with many records is slow and not recommended.
Ranged pagination (based on >_id comparsion) could be used
db.items.find({_id: {$gt: ObjectId('4f4a3ba2751e88780b000000')}});
It's good for displaying prev. & next buttons - but it's not very easy to implement when you want to display actual page numbers 1 ... 5 6 7 ... 124 - you need to pre-calculate from which "_id" each page starts.
So I have two questions:
1) When should I start worry about that? When there're "too many records" with noticeable slowdown for skip()? 1 000? 1 000 000?
2) What is the best approach to show links with actual page numbers when using ranged pagination?
Good question!
"How many is too many?" - that, of course, depends on your data size and performance requirements. I, personally, feel uncomfortable when I skip more than 500-1000 records.
The actual answer depends on your requirements. Here's what modern sites do (or, at least, some of them).
First, navbar looks like this:
1 2 3 ... 457
They get final page number from total record count and page size. Let's jump to page 3. That will involve some skipping from the first record. When results arrive, you know id of first record on page 3.
1 2 3 4 5 ... 457
Let's skip some more and go to page 5.
1 ... 3 4 5 6 7 ... 457
You get the idea. At each point you see first, last and current pages, and also two pages forward and backward from the current page.
Queries
var current_id; // id of first record on current page.
// go to page current+N
db.collection.find({_id: {$gte: current_id}}).
skip(N * page_size).
limit(page_size).
sort({_id: 1});
// go to page current-N
// note that due to the nature of skipping back,
// this query will get you records in reverse order
// (last records on the page being first in the resultset)
// You should reverse them in the app.
db.collection.find({_id: {$lt: current_id}}).
skip((N-1)*page_size).
limit(page_size).
sort({_id: -1});
It's hard to give a general answer because it depends a lot on what query (or queries) you are using to construct the set of results that are being displayed. If the results can be found using only the index and are presented in index order then db.dataset.find().limit().skip() can perform well even with a large number of skips. This is likely the easiest approach to code up. But even in that case, if you can cache page numbers and tie them to index values you can make it faster for the second and third person that wants to view page 71, for example.
In a very dynamic dataset where documents will be added and removed while someone else is paging through data, such caching will become out-of-date quickly and the limit and skip method may be the only one reliable enough to give good results.
I recently encounter the same problem when trying to paginate a request while using a field that wasn't unique, for example "FirstName". The idea of this query is to be able to implement pagination on a non-unique field without using skip()
The main problem here is being able to query for a field that is not unique "FirstName" because the following will happen:
$gt: {"FirstName": "Carlos"} -> this will skip all the records where first name is "Carlos"
$gte: {"FirstName": "Carlos"} -> will always return the same set of data
Therefore the solution I came up with was making the $match portion of the query unique by combining the targeted search field with a secondary field in order to make it a unique search.
Ascending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$gt: 'Carlos'}}]}},
{$sort: {'FirstName': 1, '_id': 1}},
{$limit: 10}
])
Descending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$lt: 'Carlos'}}]}},
{$sort: {'FirstName': -1, '_id': 1}},
{$limit: 10}
])
The $match part of this query is basically behaving as an if statement:
if firstName is "Carlos" then it needs to also be greater than this id
if firstName is not equal to "Carlos" then it needs to be greater than "Carlos"
Only problem is that you cannot navigate to an specific page number (it can probably be done with some code manipulation) but other than it solved my problem with pagination for non-unique fields without having to use skip which eats a lot of memory and processing power when getting to the end of whatever dataset you are querying for.