MongoDB ranged pagination - performance

It's said that using skip() for pagination in MongoDB collection with many records is slow and not recommended.
Ranged pagination (based on >_id comparsion) could be used
db.items.find({_id: {$gt: ObjectId('4f4a3ba2751e88780b000000')}});
It's good for displaying prev. & next buttons - but it's not very easy to implement when you want to display actual page numbers 1 ... 5 6 7 ... 124 - you need to pre-calculate from which "_id" each page starts.
So I have two questions:
1) When should I start worry about that? When there're "too many records" with noticeable slowdown for skip()? 1 000? 1 000 000?
2) What is the best approach to show links with actual page numbers when using ranged pagination?

Good question!
"How many is too many?" - that, of course, depends on your data size and performance requirements. I, personally, feel uncomfortable when I skip more than 500-1000 records.
The actual answer depends on your requirements. Here's what modern sites do (or, at least, some of them).
First, navbar looks like this:
1 2 3 ... 457
They get final page number from total record count and page size. Let's jump to page 3. That will involve some skipping from the first record. When results arrive, you know id of first record on page 3.
1 2 3 4 5 ... 457
Let's skip some more and go to page 5.
1 ... 3 4 5 6 7 ... 457
You get the idea. At each point you see first, last and current pages, and also two pages forward and backward from the current page.
Queries
var current_id; // id of first record on current page.
// go to page current+N
db.collection.find({_id: {$gte: current_id}}).
skip(N * page_size).
limit(page_size).
sort({_id: 1});
// go to page current-N
// note that due to the nature of skipping back,
// this query will get you records in reverse order
// (last records on the page being first in the resultset)
// You should reverse them in the app.
db.collection.find({_id: {$lt: current_id}}).
skip((N-1)*page_size).
limit(page_size).
sort({_id: -1});

It's hard to give a general answer because it depends a lot on what query (or queries) you are using to construct the set of results that are being displayed. If the results can be found using only the index and are presented in index order then db.dataset.find().limit().skip() can perform well even with a large number of skips. This is likely the easiest approach to code up. But even in that case, if you can cache page numbers and tie them to index values you can make it faster for the second and third person that wants to view page 71, for example.
In a very dynamic dataset where documents will be added and removed while someone else is paging through data, such caching will become out-of-date quickly and the limit and skip method may be the only one reliable enough to give good results.

I recently encounter the same problem when trying to paginate a request while using a field that wasn't unique, for example "FirstName". The idea of this query is to be able to implement pagination on a non-unique field without using skip()
The main problem here is being able to query for a field that is not unique "FirstName" because the following will happen:
$gt: {"FirstName": "Carlos"} -> this will skip all the records where first name is "Carlos"
$gte: {"FirstName": "Carlos"} -> will always return the same set of data
Therefore the solution I came up with was making the $match portion of the query unique by combining the targeted search field with a secondary field in order to make it a unique search.
Ascending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$gt: 'Carlos'}}]}},
{$sort: {'FirstName': 1, '_id': 1}},
{$limit: 10}
])
Descending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$lt: 'Carlos'}}]}},
{$sort: {'FirstName': -1, '_id': 1}},
{$limit: 10}
])
The $match part of this query is basically behaving as an if statement:
if firstName is "Carlos" then it needs to also be greater than this id
if firstName is not equal to "Carlos" then it needs to be greater than "Carlos"
Only problem is that you cannot navigate to an specific page number (it can probably be done with some code manipulation) but other than it solved my problem with pagination for non-unique fields without having to use skip which eats a lot of memory and processing power when getting to the end of whatever dataset you are querying for.

Related

How to view all records with rating 1, 3 and 5 Laravel query

I am trying to filter by rating. It can be for a record from 1 to 5. From the frontend comes a string, for example "1,3,5", which means - show all entries with a rating of 1, 3 and 5 at once.
$reviews = $this->data['sort_by'] = Review::query()
->where('rating', $data['rating'])
->get()
This is how I can get with only one value, but I need several at once. Also, the difficulty is that there can be not only "1, 3, 5", but also any other combinations, for example, "1, 4, 3" or "3,5,2"
If someone means how to compose a request before the get () method, it would be ideal so that all the work is at the database level and the server does not have to filter the collection, but in any case, I will be very happy with any solution where there will be only one request to the database data. Of course, I can make a separate request for each rating and glue them later, but that would be bad practice.
explode the posted rating so it becomes an array and use whereIn() instead of where():
$reviews = $this->data['sort_by'] = Review::query()
->whereIn('rating', explode(',', $data['rating']))
->get()

Performance with pagination

Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

Optimize query with filter and orderBy

I have a table Items with ~20k records. I need to be able to filter on multiple tags (or) and maintain order of the items, sorted by sticky then created-date.
For example I may fetch items tagged with tagA, tagB or tagC etc. If an item is sticky (bool), it is sorted first, then the rest of the items are sorted by created-date.
Currently my query looks like this:
Item.orderBy({ index: r.desc('priority') }) // priority = doc => [doc('sticky'), doc('created').toEpochTime()]
.filter(row => r.expr(['tagA', 'tagB','tagC']).setIntersection(row('tags')).count().ne(0))
.slice(offset, offset + limit); // 0 to 10 - pagination
But I feel this performs quite a few reads. Is there a way to optimize it further?
I've tried using a multi-index on the tags, and the do regular sort without an index. This seemed to do fewer reads, but was much slower.

Efficient way to get difference of two streams in RethinkDB

I am running some performance benchmarks on RethinkDB (related to a specific use-case). In my simulation, there are 2 tables: contact and event. A contact has many events. The event table has 2 indices: contact_id and compound index on [campaign_id, node_id, event_type]. The contact table has about 500k contacts and about 1.75 million docs in event table.
The query I am struggling with is to find all the contacts who have sent event_type but not open event_type. Following is the query I got to work:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)
.count.run(conn)
But this query uses set difference, not stream difference. I have also tried using difference operator:
r.table("events").
get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'] .difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'])
.count.run(conn)
This query never finishes and the weird thing is even after aborting the query I see (in RethinkDB dashboard) that the reads dont stop.
Whats the most efficient way of doing these kind of queries?
Follow up: find all the male contacts who have sent event_type but not open event_type. What I have now is:
r.table("contacts").get_all(r.args(
r.table("events").get_all([1, 5, 'sent'], {index: 'campaign'})['contact_id'].distinct
.set_difference
(r.table("events").get_all([1, 5, 'open'], {index: 'campaign'})['contact_id'].distinct)))
.filter({gender: 1}).count.run(conn)
One way to make this efficient is to denormalize your data. Instead of having separate contact and event tables, just have the contact table and make each contact have an array of events. Then you can write:
r.table('contacts').indexCreate('sent_but_not_open', function(row) {
return row('events').contains('sent').and(
row('events').contains('open').not());
});
That will work well if the number of events per contact is smallish. If you have thousands or millions of events per contact it will break down though.
RethinkDB doesn't offer a way to diff two streams lazily on the server. The best you could do is to change your compound index to be on [campaign_id, node_id, event_type, contact_id] instead, replace your get_all([1, 5, 'sent'], {index: 'campaign'}) with .between([1, 5, 'sent', r.minval], [1, 5, 'sent', r.maxval], {index: 'campaign'})and then put.distinct({index: 'campaign'})['contact_id']on the end. That will give you a stream of distinct contact IDs rather than an array, and these contact IDs will be sorted. You can then do the same for theopen` events, and diff the two ordered streams in the client by doing a mergesort-like thing.

Resources