How to exclude large number of IDs from an Elastic Search query - elasticsearch

I'm working on an app similar to Tinder. In ElasticSearch I have a collection of about half a million users and their locations). Whenever the user opens the app to search for nearby users I run an Elastic Search query over that collection. The query is fairly complex, it takes into consideration not only the location but also how active the user is or how many photos he has.
What I struggle with is how to exclude those users who the current user already swiped through from the query. A naive way to implement this would probably be to maintaint a nested array of user IDs as part of every user document in the index and exclude based on that. But as every user does dozens of thousands swipes that array could potentially grow super big, so it's not a scalable solution.
Is there a way to exclude large number of entities from an Elastic Search query based on their IDs which does not hurt performace?

Use the lookup feature of the Terms query: Terms lookup mechanism
When it’s needed to specify a terms filter with a lot of terms it can be beneficial to fetch those term values from a document in an index. A concrete example would be to filter tweets tweeted by your followers. Potentially the amount of user ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.

You can try adding the ids filter into a bool/must_not clause of your complex query and see how it behaves.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
... <--- your other "must" constraints
],
"must_not": [
{
"ids": {
"values": [ "id1", "id2", "id3" ] <--- your list of ids to exclude
}
}
]
}
}
}
}
}

Related

How to model this multi-tenancy usecase?

We are a multi-tenant platform.
The platform has a construct called Entity.
Users can create entities to model any real-life object eg: Customers, Orders, Payment, Inventory, Cart, pretty much anything.
Each entity will have its set of attributes, for example, a customer entity can have: name, email, phone, address (can be another nested entity), etc.
The requirement is to provide query/OLAP capabilities on these entities. For example, find all customers where name = 'john'.
The requirement includes all types of queries such as DATE RANGE, CONTAINS, LIKE, NUMERIC RANGE, FULL-TEXT Queries, etc. We also need Sorting, Aggregation, Pagination features.
Current design
We use elasticsearch to store entity data.
Each tenant is assigned a separate index.
When an entity is created in a tenant, the corresponding mappings are created inside the associated index. The mappings have roughly the following form:
{
"properties": {
"Customer": {
"properties": {
"name": {
"values": {
"type": "text"
}
},
"age": {
"values": {
"type": "integer"
}
}
}
},
"Order": {
"properties": {
"id": {
"values": {
"type": "text"
}
},
"eta": {
"values": {
"type": "integer"
}
}
}
}
//... other entities of this tenant
}
}
Major Problems with this design
Ever-growing mappings.
Frequent updates to mappings and hence the nodes are busy circulating cluster update information, leading to search/indexing latencies and occasional timeouts.
Existing mappings can't be altered if required. We have to go for the entire re-index procedure.
The current design was able to serve us for a few years until recently when the issues started popping up.
What would be a good design to model the above multi-tenancy requirement? Which database solution and schema modeling will be appropriate?
If you decide to stick with ES, you'll need to give your tenants more than one index, i.e. one index per tenant/entity instead of one index per tenant. The reasons for this are exactly the ones you mentioned, i.e. ever-growing mapping and the difficulty to update existing mappings.
You'll end up with more indexes for sure (N tenants x M entities) and the challenge will be to properly size those indexes in terms of how many primary shards you need for each of their entities. But I think it's even more difficult now that all entities are stored in a single index per tenant, so it'll turn out to be easier to fan out your tenant index into several.
Another option is to come up with a very generic mapping, which only contains typed fields like int_field_1, int_field_2, text_field_1, text_field_2, etc, and you keep a per-tenant mapping between the generic field names and the tenant specific field names:
name -> text_field_1
age -> int_field_1
...
That way you've less mappings to manage, it's more flexible in terms of what kind of data you can accommodate, but it comes at the price of keeping the above field mapping up-to-date.
In any case, you need to end up having more indexes for your tenants in order to make it easier to manage their mappings and keep their size at bay. it will also allow you to scale better, because it's easier to spread several smaller indices over your data nodes, than very big indexes, especially given the new sizing recommendation made by Elastic.

Getting aggregated results with selected facet

Not sure if this is possible, but I'm running into the current issue:
While being on the page, without any facet selected I run a query with some aggregations on my facets.
For example: on the "ladies shoes" page I run a query with "gender=ladies" and category "shoes" as filter, which gives me all the wanted results. Also there is an aggregation on "brand" which returns me all the brands. However, this also contains brands with a count of 0, since they don't match the "ladies shoes" criteria. But since no facet is selected, I can simply hide them, so the user won't see them.
So far, so good.
Now, when I run a query for "ladies shoes from Nike" (brand=nike as filter), I get the same list of aggregations, but now all the brands have a count of 0, except Nike. Now, it's hard to just hide them, since we want to offer the possibility to filter on multiple (available) brands.
What should be the best approach to this, with as less queries as possible?
When you're talking about multi select faceting as in your example - there is a very handy feature in the Elasticsearch - post_filter
The post_filter is applied to the search hits at the very end of a
search request, after aggregations have already been calculated.
All you need to do, is to move your Nike brand filter to the post_filter of the query like this:
{
"query": {
...
},
"aggs": {
...
},
"post_filter": {
"term": { "brand": "Nike" }
}
}
which would allow you to calculate aggregations on all brands and only after it filter out selected brand.

ElasticSearch content ACL Filtering performance

Following is my content model.
Document(s) are associated with user & group acls defining the principals who have access to the document.
The document itself is a bunch of metadata & a large content body (extracted from pdfs/docs etc).
The user performing the search has to be limited to only the set of documents he/she is entitled to (as defined by the acls on the document). He/She could have access to the document owing to user acls or owing to the group the user belongs to.
Both group membership and acls on the document are highly transient in nature meaning a user's group membership changes quite often so are the ACLs on the document itself.
Approach 1
Store the acls on the document along with its metadata as a non-stored field. Expand the groups in the ACL to the individual users (since the acl can be a group).
At the time of query, append a filter to the user query which will do a bool filter to include only documents with the userid in the acl field
"filter" : {
        "query" : {
            "term": {
                "acls": "1234"
            }
        }
      }
The problem i see with this approach is that documents need to get re-indexed though the document metadata/content is not changed.
Every time a user's group membership changes
Every time the ACL on the document changes (permission changed for the document)
I am assuming that this will lead to a large number of segment creation and merges and especially since the document body (one of the fields of the document) is a pretty large text section.
Approach 2:
This is a modification on the approach 1. This approach attempts to limit the updates on the document when the updates are strictly acl related.
Instead of having the acls defined on the metadata. This approach entails creating multiple types
In the Document Index
Document (with metadata & text body) as a parent
id
text
userschild Document (parent id & user acls only). This document will exist for each parent
id
parentid
useracls
groupschild Document (parent id & group acls only). This document will exist for each parent with group acls
id
parentid
groupacls
In the Users Index
An entry for each user in the system with the groups he/she is associated with
User
id
groups
The idea here is that updates are now localized to the different ElasticSearch entities.
In case of user acl changes only the userschild document will get updated (avoiding a potentially costly update on the parent document).
In case of the group acl changes only the groupschild document will get updated (again avoiding a potentially costly update on the parent document).
In case of user group membership changes again only the secondary index will get updated (avoiding the update on the parent document).
The query itself will look as follows.
"filter" : {
"query" : {
"bool": {
"should": [
{
"has_child": {
"type": "userschild",
"query": {
"term": {
"users": "1234"
}
}
}
},{
"has_child": {
"type": "groupschild",
"query": {
"terms" : {
"groups" : {
"index" : "users",
"type" : "user",
"id" : "1234",
"path" : "groups"
}
}
}
}
}
]
}
}
}
I have doubts with regards to its scalability owing to the nature of the query that will be involved. It involves two terms query one of which that has to be built from a separate index. I am considering improving the terms lookup using fields with docvalues enabled.
Will the approach 2 scale? The concerns I have are around the has_child query and its scalability.
Could someone clarify my understanding in this regard?
I think perhaps this is overcomplicated by expanding groups before querying. How about leaving group identifiers intact in the documents index instead?
Typically, I'd represent this in two indices (no parent-child, or any sort of nested relationships at all).
Users Index
(sample doc)
{
"user_id": 12345,
"user_name": "Swami PR"
"user_group_ids": [900, 901, 902]
}
Document Index
(sample doc)
{
"doc_id": 98765,
"doc_name": "Lunch Order for Tuesday - Top Secret and Confidential",
"doc_acl_read_users": [12345, 12346, 12347],
"doc_acl_write_users": [12345],
"doc_acl_read_groups": [435, 620],
"doc_acl_write_groups": []
}
That Users Index could just as easily be in a database... your app just needs "Swami's" user_id and group_ids available when querying for documents.
Then, when you query [Top Secret] documents as Swami PR, (to read), make sure to add:
"should": [
{
"term": {
"doc_acl_read_users": 12345
}
},
{
"terms": {
"doc_acl_read_groups": [900, 901, 902]
}
},
"minimum_should_match": 1
]
I can see 2 main types of updates that can happen here:
Users or Groups updated on Document := reindex one record in Document Index
User added to/removed from Group := reindex one record in User Index
With an edge-case
User or Group deleted
okay, here you might want to batch through and reindex all documents
periodically to clean out stale user/group identifiers... but
theoretically, stale user/group identifiers won't exist in the
application anymore, so don't cause issues in the index.
i have implemented the approach #2 in my company, now i'm researching a graph DB to handle ACL. Once we have crossed 4 million documents for one our clients the updates and search queries became quite frequent and didn't scale as expected.
my suggestion to look into graph frameworks to solve this.

Elasticsearch with multiple parent/child relationship

I'm building an application with complicated model, says Book, User and Review.
A Review contains both Book and User id.
To be able to search for Books that contain at least one review, I've set the Book as Review's parent and have routing as such. However I also need to find Users who wrote reviews that contain certain phrases.
Is it possible to have both the Book and User as Review's parent? Is there a better way to handle such situation?
Note that I'm not able to change the way data is modeled/not willing to do so because the data is transfered to Elasticsearch from a persistence database.
As far as I know you can't have a document with two parents.
My suggestion based on Application-side join chapter of Elasticsearch the definitive guide:
Create a parent/child relationship Book/Review
Be sure you have user_id property in Review mapping which contain the user id who wrote that review.
I think that covers both uses cases you described as follows:
Books that contain at least one review It can be solved with has child filter/query
Users who wrote reviews that contain certain phrases It can be solved by querying to reviews with the phrase you want to search and perform a cardinality aggregation on field user_id. If you need users information you have to query your database (or another elasticsearch index) with the ids retrieved.
Edit: "give me the books that have reviews this month written by user whose name started with John"
I recommend you to collect all those advanced uses cases and denormalize the data you need to achieve them. In this particular case it's enough with denormalizing the user name into Review. In any case elasticsearch people has written about managing relations in their blog or elasticsearch the definitive guide
Somths like (just make Books type as parent for Users and Reviews types)
.../index/users/_search?pretty" -d '
{
"query": {
"filtered": {
"filter": {
"and": [
{
"has_parent": {
"parent_type": "books",
"filter": {
"has_child": {
"type": "Reviews",
"query": {
"term": {
"text_review": "some word"
}
}
}
}
}
}
]
}
}
}
}
'
You have two options
Elasticsearch Nested Objects
Elasticsearch parent&child
both are compared and evaluated nicely here

Exclude setting on integer field in term query

My documents contain an integer array field, storing the id of tags describing them. Given a specific tag id, I want to extract a list of top tags that occur most frequently together with the provided one.
I can solve this problem associating a term aggregation over the tag id field to a term filter over the same field, but the list I get back obviously always starts with the album id I provide: all documents matching my filter have that tag, and it is thus the first in the list.
I though of using the exclude field to avoid creating the problematic bucket, but as I'm dealing with an integer field, that seems not to be possible: this query
{
"size": 0,
"query": {
"term": {
"tag_ids": "00001"
}
},
"aggs": {
"tags": {
"terms": {
"size": 3,
"field": "tag_ids",
"exclude": "00001"
}
}
}
}
returns an error saying that Aggregation [tags] cannot support the include/exclude settings as it can only be applied to string values.
Is it possible to avoid getting back this bucket?
This is, as of Elasticsearch 1.4, a shortcoming of ES itself.
After the community proposed this change, the functionality has been added and will be included in Elasticsearch 1.5.0.
It's supposed to be fixed since version 1.5.0.
Look at this: https://github.com/elasticsearch/elasticsearch/pull/7727
While it is enroute to being fixed: My workaround is to have the aggregation use a script instead of direct access to the field, and let that script use the value as string.
Works well and without measurable performance loss.

Resources