Elasticsearch Terms Query exclude large amount of users - elasticsearch

I'm working on a tinder like app. In order to exclude profiles that user has swiped before, I use a "must_not" query like this:
must_not : [{"terms": { "swipedusers": ["userid1", "userid1", "userid1"…]}}]
I wonder what are the limits using this approach? is this a scalable approach that would also work when the swipedusers array contains 2000 user ids? If there is a better scalable approach to this I would be happy to know...

there is a better approach! and it called "terms lookup", is something like the traditional join that you could do on relational databases...
I could try to explain you here, but, all the information that you need is well documented on the official Elastic Search page:
https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-terms-query.html#query-dsl-terms-lookup
The final solution is having 2 indices, one for the registered users and another one to track swipes for each user.
Then, for each swipe, you should update the document containing current user swipes... Here you will need to add elements to an array, and this is another problem in ElasticSearch (big problem if you are using AWS managed ElasticSearch) that only can be solved using scripting...
More info at https://www.elastic.co/guide/en/elasticsearch/guide/current/partial-updates.html#_using_scripts_to_make_partial_updates
For your case, the query will result in something like:
GET /possible_matches/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "swiped",
"type" : "users",
"id" : "current-user-id",
"path" : "swipedUserId"
}
}
}
}
Another thing that you should take in account is the replication configuration for the swipes index, since each node will perform "joins" with that index, is highly recommended to have a full copy of that index in each node. You could achieve this creating the index with the "auto_expand_replicas" with "0-all" value.
PUT /swipes
{
"settings": {
"auto_expand_replicas": "0-all"
}
}

Related

preserving UI in post filter aggregated faceted search

I'm moving a sql server product catalog over to elasticsearch and want to preserve how the ui currently allows the user to navigate the options. I am using aggregates with post filter but cannot get the selected options siblings to show up in the aggregates.
An example of what I am trying to achieve is from the elastic docs.
GET /cars/transactions/_search
{
"size" : 0,
"query": {
"match": {
"make": "ford"
}
},
"post_filter": {
"term" : {
"color" : "green"
}
},
"aggs" : {
"all_colors": {
"terms" : { "field" : "color" }
}
}
}
So, the user has clicked on the green option and the returned documents show only green ford cars, but the aggregates list all of the colors available for ford with their counts, which can be added to a ui.
All of this is ok. But, there are many makes of car other than ford. If I added a 'makes' aggregate, then this query will only return ford in the aggregates list. If building the navigation ui dynamically from the returned results (as I am), then there would be no way to place all the other makes of car into the ui, unless I queried elasticsearch many times to build up my ui - which I don't want to do.
If I changed the query to a match-all and added the query to the post filter, then I would get the full list of car makes in the aggregation, but the counts would always be a global count from the match-all query and not reflective of the drill-down count.
Is it possible to do this with elasticsearch? I've gone through all the documents - several times, and tried many different query formats, but nothing has produced quite the right results so far.

ElasticSearch content ACL Filtering performance

Following is my content model.
Document(s) are associated with user & group acls defining the principals who have access to the document.
The document itself is a bunch of metadata & a large content body (extracted from pdfs/docs etc).
The user performing the search has to be limited to only the set of documents he/she is entitled to (as defined by the acls on the document). He/She could have access to the document owing to user acls or owing to the group the user belongs to.
Both group membership and acls on the document are highly transient in nature meaning a user's group membership changes quite often so are the ACLs on the document itself.
Approach 1
Store the acls on the document along with its metadata as a non-stored field. Expand the groups in the ACL to the individual users (since the acl can be a group).
At the time of query, append a filter to the user query which will do a bool filter to include only documents with the userid in the acl field
"filter" : {
        "query" : {
            "term": {
                "acls": "1234"
            }
        }
      }
The problem i see with this approach is that documents need to get re-indexed though the document metadata/content is not changed.
Every time a user's group membership changes
Every time the ACL on the document changes (permission changed for the document)
I am assuming that this will lead to a large number of segment creation and merges and especially since the document body (one of the fields of the document) is a pretty large text section.
Approach 2:
This is a modification on the approach 1. This approach attempts to limit the updates on the document when the updates are strictly acl related.
Instead of having the acls defined on the metadata. This approach entails creating multiple types
In the Document Index
Document (with metadata & text body) as a parent
id
text
userschild Document (parent id & user acls only). This document will exist for each parent
id
parentid
useracls
groupschild Document (parent id & group acls only). This document will exist for each parent with group acls
id
parentid
groupacls
In the Users Index
An entry for each user in the system with the groups he/she is associated with
User
id
groups
The idea here is that updates are now localized to the different ElasticSearch entities.
In case of user acl changes only the userschild document will get updated (avoiding a potentially costly update on the parent document).
In case of the group acl changes only the groupschild document will get updated (again avoiding a potentially costly update on the parent document).
In case of user group membership changes again only the secondary index will get updated (avoiding the update on the parent document).
The query itself will look as follows.
"filter" : {
"query" : {
"bool": {
"should": [
{
"has_child": {
"type": "userschild",
"query": {
"term": {
"users": "1234"
}
}
}
},{
"has_child": {
"type": "groupschild",
"query": {
"terms" : {
"groups" : {
"index" : "users",
"type" : "user",
"id" : "1234",
"path" : "groups"
}
}
}
}
}
]
}
}
}
I have doubts with regards to its scalability owing to the nature of the query that will be involved. It involves two terms query one of which that has to be built from a separate index. I am considering improving the terms lookup using fields with docvalues enabled.
Will the approach 2 scale? The concerns I have are around the has_child query and its scalability.
Could someone clarify my understanding in this regard?
I think perhaps this is overcomplicated by expanding groups before querying. How about leaving group identifiers intact in the documents index instead?
Typically, I'd represent this in two indices (no parent-child, or any sort of nested relationships at all).
Users Index
(sample doc)
{
"user_id": 12345,
"user_name": "Swami PR"
"user_group_ids": [900, 901, 902]
}
Document Index
(sample doc)
{
"doc_id": 98765,
"doc_name": "Lunch Order for Tuesday - Top Secret and Confidential",
"doc_acl_read_users": [12345, 12346, 12347],
"doc_acl_write_users": [12345],
"doc_acl_read_groups": [435, 620],
"doc_acl_write_groups": []
}
That Users Index could just as easily be in a database... your app just needs "Swami's" user_id and group_ids available when querying for documents.
Then, when you query [Top Secret] documents as Swami PR, (to read), make sure to add:
"should": [
{
"term": {
"doc_acl_read_users": 12345
}
},
{
"terms": {
"doc_acl_read_groups": [900, 901, 902]
}
},
"minimum_should_match": 1
]
I can see 2 main types of updates that can happen here:
Users or Groups updated on Document := reindex one record in Document Index
User added to/removed from Group := reindex one record in User Index
With an edge-case
User or Group deleted
okay, here you might want to batch through and reindex all documents
periodically to clean out stale user/group identifiers... but
theoretically, stale user/group identifiers won't exist in the
application anymore, so don't cause issues in the index.
i have implemented the approach #2 in my company, now i'm researching a graph DB to handle ACL. Once we have crossed 4 million documents for one our clients the updates and search queries became quite frequent and didn't scale as expected.
my suggestion to look into graph frameworks to solve this.

Can I use parent-child relationships on Kibana?

On a relational DB, I have two tables connected by a foreign key, on a typical one-to-many relationship. I would like to translate this schema into ElasticSearch, so I researched and found two options: the nested and parent-child. My ultimate goal was to visualize this dataset in Kibana 4.
Parent-child seemed the most adequate one, so I'll describe the steps that I followed, based on the official ES documentation and a few examples I found on the web.
curl -XPUT http://server:port/accident_struct -d '
{
"mappings" : {
"event" : {
},
"details": {
"_parent": {
"type": "event"
} ,
"properties" : {
}
}
}
}
';
here I create the index accident_struct, which contains two types (corresponding to the two relational tables): event and details.
Event is the parent, thus each document of details has an event associated to it.
Then I upload the documents using the bulk API. For event:
{"index":{"_index":"accident_struct","_type":"event","_id":"17f14c32-53ca-4671-b959-cf47e81cf55c"}}
{values here...}
And for details:
{"index":{"_index":"accident_struct","_type":"details","_id": "1", "_parent": "039c7e18-a24f-402d-b2c8-e5d36b8ad220" }}
The event does not know anything about children, but each child (details) needs to set its parent. In the ES documentation I see the parent being set using "parent", while in other examples I see it using "_parent". I wonder what is the correct option (although at this point, none works for me).
The requests are successfully completed and I can see that the number of documents contained in the index corresponds to the sum of events + types.
I can also query parents for children and children for parents, on ES. For example:
curl -XPOST host:port/accident_struct/details/_search?pretty -d '{
"query" : {
"has_parent" : {
"type" : "event",
"query" : {
"match_all" : {}
}
}
}
}'
After setting the index on Kibana, I am able to list all the fields from parent and child. However, if I go to the "discover" tab, only the parent fields are listed.
If I uncheck a box that reads "hide missing fields", the fields from the child documents are shown as grey out, along with an error message (see image)
Am I doing something wrong or is the parent-child not supported in Kibana4? And if it is not supported, what would be the best alternative to represent this type of relationship?
Per the comment in this discussion on the elastic site, P/C is, like nested objects, at least not supported in visualizations. Le sigh.

suggestion completion across multiple types in an index

Is it possible to do a suggestion completion on a type? I'm able to do it on an index.
POST /data/_suggest
{
"data" : {
"text" : "tr",
"completion" : {
"field" : "sattributes",
"size":50
}
}
}
when I do on a type:
POST /data/suggestion/_suggest
{
"data" : {
"text" : "tr",
"completion" : {
"field" : "sattributes",
"size":50
}
}
}
suggestion is the type.
I don't get any results. I need to do suggestion on two different types articles and books. Do I need to create separate indexes to make them work or is there a way in elasticsearch to accomplish this? In case if I have to search on my index data is there way to get 50 results for type article and 50 results for type book.
Any help is highly appreciated.
Lucene has no concept of types, so in Elasticsearch they are simply implemented as a hidden field called _type. When you search on a particular type, Elasticsearch adds a filter on that field.
The completion suggester doesn't use traditional search at all, which means that it can't apply a filter on the _type field. So you have a couple of options:
Use a different completion suggester field per type, eg suggestion_sattributes, othertype_sattributes
Index your data with the _type as a prefix, eg type1 actual words to suggest, then when you ask for suggestions, prepend type1 to the query
Use separate indices
In fact, option (2) above is being implemented at the moment as the new ContextSuggester which will allow you to do this (and more) automatically.

How does elasticsearch facet feature work with async search query?

I am aware of how using the facet feature of elasticsearch, we can get the aggregated value of values for a specified field/s based on search query result data.
I have an application where I am monitoring logs and using elasticsearch to search through the log entries. On UI front I have a paging mechanism in place and hence using async feature of the search to fetch 'n' entries at a time.
So my question is, if I modify my async search query to fetch the facet information for certain fields, will it give the aggregated value for the sub-set of result that is fetched as a result of an async query. or will it get the aggregated value for the entire search result (and not the sub-set which is returned to user).
Many thanks and regards,
Komal
Facets are returned for the entire search result. You can even set size to 0 in your request, which will result in not fetching any results and you will still get all facets.
Please refer here for detail documentation. You can give match all query to fetch facet on all documents {
"query" : {
"match_all" : { }
},
"facets" : {
"tag" : {
"terms" : {
"field" : "tag",
"size" : 10
}
}
}
}
Please post your code gist for more information.

Resources