Filtering collapsed results in Elasticsearch - elasticsearch

I have an elasticsearch index containing documents that represent entities at a given point in time. When an entity changes state, a new document is created with a timestamp. When I need to get the current state of all entities, I can do the following:
GET https://127.0.0.1:9200/myindex/_search
{
"collapse": {
"field": "entity_id"
},
"sort" : [{
"timestamp": {
"order": "desc"
}
}]
}
However, I would like to further filter the result of the collapse. When entities are deleted I create a new document that includes an is_deleted flag along with the timestamp in a nested metadata field. I would like to extend the above query to entirely filter out those entities that have been deleted. Using a term filter on entity_metadata.is_deleted: true obviously does not work, because then my result just includes the last document with that entity_id before it got marked as deleted. How can I filter my results after the collapse is done to exclude any tombstoned entites?

What I would suggest is that instead of adding an is_deleted flag to all entity_id documents, you could add a date_deleted field with the date of the deletion to all documents of that entity, and then when you view a document, given its date and the deleted_date you'd know if the document was LIVE or deleted at that date.
In addition, it would allow you to consider:
all documents that don't have a deleted_date field (i.e. not deleted) and
all documents that have a deleted_date before/after a given date.

Related

Documents with new field added before mapping update not queryable via new field

I have an index that for one reason or another we've added fields to that don't exist in our mapping. For example:
{
"name": "Bob" // Exists in mapping
"age": 12 // doesn't existing in mapping
}
After updating the mapping to add the age field, any document we add the age field to is queryable, but none of the documents that had age added before we updated the mapping are queryable.
Is there a way to tell Elastic to make those older documents queryable, not just any net-new/updated after the mapping update?
This implies that you must have dynamic: false in your mapping, i.e. whenever you send a new field, you prevent ES from creating it automatically.
Once you have updated your mapping, you can then simply call _update_by_query on your index in order to update it and have it reindex the data it contains with the new mappings.
Your queries will then work also on the "older" data.

Change _type of a document in elasticsearch

I have two TYPES in my elasticsearch index. Both have same mapping. I am using one for active documents, while the other for archived ones.
Now, i want to archive a document i.e. change its _type from active to archived. Both are in same index, so i cannot reindex them as well.
Is there a way to do this in Elasticsearch 5.0 ?
Changing the type is tricky. You would have to remove and then index the document with the new type.
Why not have a field in your document indicating "activeness". Then you can use a bool query to filter by what you want:
{"query": {
"bool": {
"filter": [{"term": {"status", "active"}}],
"query": { /* your query object here */ }
}
}
}
Agree with having a field which indicates the activeness of the document.
(Or)
Use two different indices for "active" and "inactive" types.
Use aliases which map to these indices.
Aliases will give you flexibility to change your indices without downtimes.

ElasticSearch content ACL Filtering performance

Following is my content model.
Document(s) are associated with user & group acls defining the principals who have access to the document.
The document itself is a bunch of metadata & a large content body (extracted from pdfs/docs etc).
The user performing the search has to be limited to only the set of documents he/she is entitled to (as defined by the acls on the document). He/She could have access to the document owing to user acls or owing to the group the user belongs to.
Both group membership and acls on the document are highly transient in nature meaning a user's group membership changes quite often so are the ACLs on the document itself.
Approach 1
Store the acls on the document along with its metadata as a non-stored field. Expand the groups in the ACL to the individual users (since the acl can be a group).
At the time of query, append a filter to the user query which will do a bool filter to include only documents with the userid in the acl field
"filter" : {
        "query" : {
            "term": {
                "acls": "1234"
            }
        }
      }
The problem i see with this approach is that documents need to get re-indexed though the document metadata/content is not changed.
Every time a user's group membership changes
Every time the ACL on the document changes (permission changed for the document)
I am assuming that this will lead to a large number of segment creation and merges and especially since the document body (one of the fields of the document) is a pretty large text section.
Approach 2:
This is a modification on the approach 1. This approach attempts to limit the updates on the document when the updates are strictly acl related.
Instead of having the acls defined on the metadata. This approach entails creating multiple types
In the Document Index
Document (with metadata & text body) as a parent
id
text
userschild Document (parent id & user acls only). This document will exist for each parent
id
parentid
useracls
groupschild Document (parent id & group acls only). This document will exist for each parent with group acls
id
parentid
groupacls
In the Users Index
An entry for each user in the system with the groups he/she is associated with
User
id
groups
The idea here is that updates are now localized to the different ElasticSearch entities.
In case of user acl changes only the userschild document will get updated (avoiding a potentially costly update on the parent document).
In case of the group acl changes only the groupschild document will get updated (again avoiding a potentially costly update on the parent document).
In case of user group membership changes again only the secondary index will get updated (avoiding the update on the parent document).
The query itself will look as follows.
"filter" : {
"query" : {
"bool": {
"should": [
{
"has_child": {
"type": "userschild",
"query": {
"term": {
"users": "1234"
}
}
}
},{
"has_child": {
"type": "groupschild",
"query": {
"terms" : {
"groups" : {
"index" : "users",
"type" : "user",
"id" : "1234",
"path" : "groups"
}
}
}
}
}
]
}
}
}
I have doubts with regards to its scalability owing to the nature of the query that will be involved. It involves two terms query one of which that has to be built from a separate index. I am considering improving the terms lookup using fields with docvalues enabled.
Will the approach 2 scale? The concerns I have are around the has_child query and its scalability.
Could someone clarify my understanding in this regard?
I think perhaps this is overcomplicated by expanding groups before querying. How about leaving group identifiers intact in the documents index instead?
Typically, I'd represent this in two indices (no parent-child, or any sort of nested relationships at all).
Users Index
(sample doc)
{
"user_id": 12345,
"user_name": "Swami PR"
"user_group_ids": [900, 901, 902]
}
Document Index
(sample doc)
{
"doc_id": 98765,
"doc_name": "Lunch Order for Tuesday - Top Secret and Confidential",
"doc_acl_read_users": [12345, 12346, 12347],
"doc_acl_write_users": [12345],
"doc_acl_read_groups": [435, 620],
"doc_acl_write_groups": []
}
That Users Index could just as easily be in a database... your app just needs "Swami's" user_id and group_ids available when querying for documents.
Then, when you query [Top Secret] documents as Swami PR, (to read), make sure to add:
"should": [
{
"term": {
"doc_acl_read_users": 12345
}
},
{
"terms": {
"doc_acl_read_groups": [900, 901, 902]
}
},
"minimum_should_match": 1
]
I can see 2 main types of updates that can happen here:
Users or Groups updated on Document := reindex one record in Document Index
User added to/removed from Group := reindex one record in User Index
With an edge-case
User or Group deleted
okay, here you might want to batch through and reindex all documents
periodically to clean out stale user/group identifiers... but
theoretically, stale user/group identifiers won't exist in the
application anymore, so don't cause issues in the index.
i have implemented the approach #2 in my company, now i'm researching a graph DB to handle ACL. Once we have crossed 4 million documents for one our clients the updates and search queries became quite frequent and didn't scale as expected.
my suggestion to look into graph frameworks to solve this.

Elasticsearch - Extra unmapped fields on geo-shape type index

I have some extra inner fields on a geo-shape type field. For example, "shape" is a geo-shape type field which has the regular required fields like "coordinates", "radius" etc., but it may also have other fields like "metadata" which I want elasticsearch to not parse and not store in the index. For example:
"shape": {
"coordinates":[6.77,8.99]
"radius": 500
"metadata": "some value"
}
Mapping schema looks like this:
"shape":{
"type":"geo_shape"
}
How can I achieve this ? By using "dynamic": false on mapping schema does not seem to be working.
Setting dynamic to false in your root mapping, like you did, is the way to go : are your sure it desn't work? Or are you saying that because it appears in your result hit _source?
Actually, by default, the _source attribute will contains the exact same document that you submitted.
However, it doesn't mean the extra metadata field has been indexed and/or stored.
If you want to check this, request specifically that field in your search like this :
POST _search
{
"fields": ["shape.metadata"]
}
You should have your search hits but without any fields value.
If it still bother you, disabled the _source attribute in your mapping.

Exclude setting on integer field in term query

My documents contain an integer array field, storing the id of tags describing them. Given a specific tag id, I want to extract a list of top tags that occur most frequently together with the provided one.
I can solve this problem associating a term aggregation over the tag id field to a term filter over the same field, but the list I get back obviously always starts with the album id I provide: all documents matching my filter have that tag, and it is thus the first in the list.
I though of using the exclude field to avoid creating the problematic bucket, but as I'm dealing with an integer field, that seems not to be possible: this query
{
"size": 0,
"query": {
"term": {
"tag_ids": "00001"
}
},
"aggs": {
"tags": {
"terms": {
"size": 3,
"field": "tag_ids",
"exclude": "00001"
}
}
}
}
returns an error saying that Aggregation [tags] cannot support the include/exclude settings as it can only be applied to string values.
Is it possible to avoid getting back this bucket?
This is, as of Elasticsearch 1.4, a shortcoming of ES itself.
After the community proposed this change, the functionality has been added and will be included in Elasticsearch 1.5.0.
It's supposed to be fixed since version 1.5.0.
Look at this: https://github.com/elasticsearch/elasticsearch/pull/7727
While it is enroute to being fixed: My workaround is to have the aggregation use a script instead of direct access to the field, and let that script use the value as string.
Works well and without measurable performance loss.

Resources