Logstash -> Elasticsearch - update denormalized data - elasticsearch

Use case explanation
We have a relational database with data about our day-to-day operations. The goal is to allow users to search the important data with a full-text search engine. The data is normalized and thus not in the best form to make full-text queries, so the idea was to denormalize a subset of the data and copy it in real-time to Elasticsearch, which allows us to create a fast and accurate search application.
We already have a system in place that enables Event Sourcing of our database operations (inserts, updates, deletes). The events only contains the changed columns and primary keys (on an update we don't get the whole row). Logstash already gets notified for each event so this part is already handled.
Actual problem
Now we are getting to our problem. Since the plan is to denormalize our data we will have to make sure updates on parent objects are propagated to the denormalized child objects in Elasticsearch. How can we configure logstash to do this?
Example
Lets say we maintain a list of Employees in Elasticsearch. Each Employee is assigned to a Company. Since the data is denormalized (for the purpose of faster search), each Employee also carries the name and address of the Company. An update changes the name of a Company - how can we configure logstash to update the company name in all Employees, assigned to the Company?
Additional explanation
#Darth_Vader:
The problem we are facing is, that we get an event that a Company has changed, but we want to modify documents of type Employee in Elasticsearch, because they carry the data about the company in itself. Your answer expects that we will get an event for every Employee, which is not the case.
Maybe this will make it clearer. We have 3 employees in Elasticsearch:
{type:'employee',id:'1',name:'Person 1',company.cmp_id:'1',company.name:'Company A'}
{type:'employee',id:'2',name:'Person 2',company.cmp_id:'1',company.name:'Company A'}
{type:'employee',id:'3',name:'Person 3',company.cmp_id:'2',company.name:'Company B'}
Then an update happens in the source DB.
UPDATE company SET name = 'Company NEW' WHERE cmp_id = 1;
We get an event in logstash, where it says something like this:
{type:'company',cmp_id:'1',old.name:'Company A',new.name:'Company NEW'}
This should then be propagated to Elasticsearch, so that the resulting employees are:
{type:'employee',id:'1',name:'Person 1',company.cmp_id:'1',company.name:'Company NEW'}
{type:'employee',id:'2',name:'Person 2',company.cmp_id:'1',company.name:'Company NEW'}
{type:'employee',id:'3',name:'Person 3',company.cmp_id:'2',company.name:'Company B'}
Notice that the field company.name changed.

I suggest a similar solution to what I've posted here, i.e. to use the http output plugin in order to issue an update by query call to the Employee index. The query would need to look like this:
POST employees/_update_by_query
{
"script": {
"source": "ctx._source.company.name = params.name",
"lang": "painless",
"params": {
"name": "Company NEW"
}
},
"query": {
"term": {
"company.cmp_id": "1"
}
}
}
So your Logstash config should look like this:
input {
...
}
filter {
mutate {
add_field => {
"[script][lang]" => "painless"
"[script][source]" => "ctx._source.company.name = params.name"
"[script][params][name]" => "%{new.name}"
"[query][term][company.cmp_id]" => "%{cmp_id}"
}
remove_field => ["host", "#version", "#timestamp", "type", "cmp_id", "old.name", "new.name"]
}
}
output {
http {
url => "http://localhost:9200/employees/_update_by_query"
http_method => "post"
format => "json"
}
}

Related

How to model this multi-tenancy usecase?

We are a multi-tenant platform.
The platform has a construct called Entity.
Users can create entities to model any real-life object eg: Customers, Orders, Payment, Inventory, Cart, pretty much anything.
Each entity will have its set of attributes, for example, a customer entity can have: name, email, phone, address (can be another nested entity), etc.
The requirement is to provide query/OLAP capabilities on these entities. For example, find all customers where name = 'john'.
The requirement includes all types of queries such as DATE RANGE, CONTAINS, LIKE, NUMERIC RANGE, FULL-TEXT Queries, etc. We also need Sorting, Aggregation, Pagination features.
Current design
We use elasticsearch to store entity data.
Each tenant is assigned a separate index.
When an entity is created in a tenant, the corresponding mappings are created inside the associated index. The mappings have roughly the following form:
{
"properties": {
"Customer": {
"properties": {
"name": {
"values": {
"type": "text"
}
},
"age": {
"values": {
"type": "integer"
}
}
}
},
"Order": {
"properties": {
"id": {
"values": {
"type": "text"
}
},
"eta": {
"values": {
"type": "integer"
}
}
}
}
//... other entities of this tenant
}
}
Major Problems with this design
Ever-growing mappings.
Frequent updates to mappings and hence the nodes are busy circulating cluster update information, leading to search/indexing latencies and occasional timeouts.
Existing mappings can't be altered if required. We have to go for the entire re-index procedure.
The current design was able to serve us for a few years until recently when the issues started popping up.
What would be a good design to model the above multi-tenancy requirement? Which database solution and schema modeling will be appropriate?
If you decide to stick with ES, you'll need to give your tenants more than one index, i.e. one index per tenant/entity instead of one index per tenant. The reasons for this are exactly the ones you mentioned, i.e. ever-growing mapping and the difficulty to update existing mappings.
You'll end up with more indexes for sure (N tenants x M entities) and the challenge will be to properly size those indexes in terms of how many primary shards you need for each of their entities. But I think it's even more difficult now that all entities are stored in a single index per tenant, so it'll turn out to be easier to fan out your tenant index into several.
Another option is to come up with a very generic mapping, which only contains typed fields like int_field_1, int_field_2, text_field_1, text_field_2, etc, and you keep a per-tenant mapping between the generic field names and the tenant specific field names:
name -> text_field_1
age -> int_field_1
...
That way you've less mappings to manage, it's more flexible in terms of what kind of data you can accommodate, but it comes at the price of keeping the above field mapping up-to-date.
In any case, you need to end up having more indexes for your tenants in order to make it easier to manage their mappings and keep their size at bay. it will also allow you to scale better, because it's easier to spread several smaller indices over your data nodes, than very big indexes, especially given the new sizing recommendation made by Elastic.

How can I let ES support mixed type of a field?

I am saving logs to Elasticsearch for analysis but I found there are mixed types of a particular field which causing error when indexing the document.
For example, I may save below log to the index where uuid is an object.
POST /index-000001/_doc
{
"uuid": {"S": "001"}
}
but from another event, the log would be:
POST /index-000001/_doc
{
"uuid": "001"
}
the second POST will fail because the type of uuid is not an object. so I get this error: object mapping for [uuid] tried to parse field [uuid] as object, but found a concrete value
I wonder what the best solution for that? I can't change the log because they are from different application. The first log is from the data of dynamodb while the second one is the data from application. How can I save both types of logs into ES?
If I disable dynamic mapping, I will have to specify all fields in the index mapping. For any new fields, I am not able to search them. so I do need dynamic mapping.
There will be many cases like that. so I am looking for a solution which can cover all conflict fields.
It's perfectly possible using ingest pipelines which are run before the indexing process.
The following would be a solution for your particular use case, albeit somewhat onerous:
create a pipeline
PUT _ingest/pipeline/uuid_normalize
{
"description" : "Makes sure uuid is a hash map",
"processors" : [
{
"script": {
"source": """
if (ctx.uuid != null && !(ctx.uuid instanceof java.util.HashMap)) {
ctx.uuid = ['S': ctx.uuid]; // hash map init
}
"""
}
}
]
}
run the pipeline when ingesting a new doc
POST /index-000001/_doc
{
"uuid": {"S": "001"}
}
POST /index-000001/_doc?pipeline=uuid_normalize <------
{
"uuid": "001"
}
You could now extend this to be as generic as you like but it is assumed that you know what you expect as input in each and every doc. In other words, unlike dynamic templates, you need to know what you want to safeguard against.
You can read more about painless script operators here.
You just cannot.
You should either normalize all your field in a way or another.
Or use 2 separate field.
I can suggest to use a field like this :
"uuid": {"key": "S", "value": "001"}
and skip the key when not necessary.
But you will have to preprocess your value before ingestion.

Using numerics as type in Elasticsearch

I am going to store transaction logs on elasticsearch. I am new to ELK stack and not sure about how I should implement this on ELK stack. My transaction is printing lines of log sequentially(upserts) and instead of logging these to a file I want to store these on ElastichSearch and later I will query the logs by the transactionId I have created.
Normally the URI for querying will be
/bookstore/books/_search
but in my case it must be like
/transactions/transactionId/_search
because I dont want to store lines as array attached to a single transaction record but I am not sure if this is a good practice to create a new type in the beginning of every transaction. I am not even sure if this is possible.
Can you give advices about storing these transaction data on elasticsearch?
if you want to query with a URI like /transactions/transactionId/_search, that means you are planning to create multiple types every time a new transactionid comes. Now , apart from this being a bad design, its not even possible to have more than one type in an index(post version 5.X I guess) and types have been completely removed since version 7.X .
One work-around is if you use the transactionId itself as the document ID while creation. Then you can get the log associated with one transactionId by querying GET transactions/transactionId (read about the length restrictions of the document id though) but this might cause another issue, that being , there can be multiple logs for the same transaction, so each log entry having the same id would simply overwrite the previous entry.
The best solution here will be to change how you query those records.
For this you can put transactionId as one of the fields in the json body, along with maybe a created time stamp at the time of insertion ( let ES create the documents with the auto generated id) and then query all logs associated with a transaction like :
POST transactions/_search
{
"sort": [
{
"createdDate": {
"order": "asc"
}
}
],
"query":{
"bool":{
"must":[
{
"term":{
"transactionId.keyword":"<transaction id>"
}
}
]
}
}
}
Hope, this helps

Elasticsearch: document relationship

I'm doing a elastic search autocomplete-as-you-type
I'm using cool features like ngram's and other stuff to create needed analyzer.
currently I break my had around indexing following data.
Let say I have Payments type,
each document in this type looks like this
{
..elastic meta data..
paymentId: 123453425342,
providerAccount : {
id: 123456
firstName: Alex,
lastName: Web
},
consumerAccount : {
id: 7575757,
firstName: John,
lastName: Doe
},
amount: 556,
date : 342523454235345 (some unix timestamp)
}
so basically this document represents not only the payment itself but it also shows the relationship of the payment, the 2 entities which related to the payment.
Payment always have its provider and consumer.
I need this data in payment document because I want to show it in UI.
By indexing it like so, it might be a big pain for handling the updates of Consumer or Provider because each time some of them change its properties I have to update all the payments which has this entity.
Another possible solution is to store only id's of this consumers/providers and make a query on payments and then 2 queries for the entities for retrieving needed fields, but i'm not sure about this because i'm doing ajax requests each time a character entered, so here comes the performance question.
I have also looked into parent/child relationship solution which basically fits my case but I wasn't able to figure out if I can retrieve also the parent(consumer/provider) fields while I querying child(payment).
What would you suggest?
Thanks!
Yes, you can retrieve the parent while querying child using has_child.
Considering payment as child and consumer as parent, You can search all the consumers by :
GET /index_name/consumer/_search
{
"query": {
"has_child": {
"type": "payment",
"query": {
// any query on payment table
},
"inner_hits": {}
}
}
}
This would fetch you all the consumer based on the query on child i.e payment in your case.
inner_hits is what you are looking for. This will retrieve you the children as well. But it was introduced in elasticsearch 1.5.0. So version should be greater than elasticsearch 1.5.0.
You can refer https://www.elastic.co/blog/elasticsearch-1-5-0-released.
Your problem is not an issue. I suppose you want tot freeze data after the pay, right? So you don't need to update the accounts data in existing payment documents.
Further: parent/schild is easy for updating, but less efficient with querying. For auto complete, stay using your current mapping!

ElasticSearch content ACL Filtering performance

Following is my content model.
Document(s) are associated with user & group acls defining the principals who have access to the document.
The document itself is a bunch of metadata & a large content body (extracted from pdfs/docs etc).
The user performing the search has to be limited to only the set of documents he/she is entitled to (as defined by the acls on the document). He/She could have access to the document owing to user acls or owing to the group the user belongs to.
Both group membership and acls on the document are highly transient in nature meaning a user's group membership changes quite often so are the ACLs on the document itself.
Approach 1
Store the acls on the document along with its metadata as a non-stored field. Expand the groups in the ACL to the individual users (since the acl can be a group).
At the time of query, append a filter to the user query which will do a bool filter to include only documents with the userid in the acl field
"filter" : {
        "query" : {
            "term": {
                "acls": "1234"
            }
        }
      }
The problem i see with this approach is that documents need to get re-indexed though the document metadata/content is not changed.
Every time a user's group membership changes
Every time the ACL on the document changes (permission changed for the document)
I am assuming that this will lead to a large number of segment creation and merges and especially since the document body (one of the fields of the document) is a pretty large text section.
Approach 2:
This is a modification on the approach 1. This approach attempts to limit the updates on the document when the updates are strictly acl related.
Instead of having the acls defined on the metadata. This approach entails creating multiple types
In the Document Index
Document (with metadata & text body) as a parent
id
text
userschild Document (parent id & user acls only). This document will exist for each parent
id
parentid
useracls
groupschild Document (parent id & group acls only). This document will exist for each parent with group acls
id
parentid
groupacls
In the Users Index
An entry for each user in the system with the groups he/she is associated with
User
id
groups
The idea here is that updates are now localized to the different ElasticSearch entities.
In case of user acl changes only the userschild document will get updated (avoiding a potentially costly update on the parent document).
In case of the group acl changes only the groupschild document will get updated (again avoiding a potentially costly update on the parent document).
In case of user group membership changes again only the secondary index will get updated (avoiding the update on the parent document).
The query itself will look as follows.
"filter" : {
"query" : {
"bool": {
"should": [
{
"has_child": {
"type": "userschild",
"query": {
"term": {
"users": "1234"
}
}
}
},{
"has_child": {
"type": "groupschild",
"query": {
"terms" : {
"groups" : {
"index" : "users",
"type" : "user",
"id" : "1234",
"path" : "groups"
}
}
}
}
}
]
}
}
}
I have doubts with regards to its scalability owing to the nature of the query that will be involved. It involves two terms query one of which that has to be built from a separate index. I am considering improving the terms lookup using fields with docvalues enabled.
Will the approach 2 scale? The concerns I have are around the has_child query and its scalability.
Could someone clarify my understanding in this regard?
I think perhaps this is overcomplicated by expanding groups before querying. How about leaving group identifiers intact in the documents index instead?
Typically, I'd represent this in two indices (no parent-child, or any sort of nested relationships at all).
Users Index
(sample doc)
{
"user_id": 12345,
"user_name": "Swami PR"
"user_group_ids": [900, 901, 902]
}
Document Index
(sample doc)
{
"doc_id": 98765,
"doc_name": "Lunch Order for Tuesday - Top Secret and Confidential",
"doc_acl_read_users": [12345, 12346, 12347],
"doc_acl_write_users": [12345],
"doc_acl_read_groups": [435, 620],
"doc_acl_write_groups": []
}
That Users Index could just as easily be in a database... your app just needs "Swami's" user_id and group_ids available when querying for documents.
Then, when you query [Top Secret] documents as Swami PR, (to read), make sure to add:
"should": [
{
"term": {
"doc_acl_read_users": 12345
}
},
{
"terms": {
"doc_acl_read_groups": [900, 901, 902]
}
},
"minimum_should_match": 1
]
I can see 2 main types of updates that can happen here:
Users or Groups updated on Document := reindex one record in Document Index
User added to/removed from Group := reindex one record in User Index
With an edge-case
User or Group deleted
okay, here you might want to batch through and reindex all documents
periodically to clean out stale user/group identifiers... but
theoretically, stale user/group identifiers won't exist in the
application anymore, so don't cause issues in the index.
i have implemented the approach #2 in my company, now i'm researching a graph DB to handle ACL. Once we have crossed 4 million documents for one our clients the updates and search queries became quite frequent and didn't scale as expected.
my suggestion to look into graph frameworks to solve this.

Resources