Elasticsearch Entity-centric indexing with Transforms

Elasticsearch Entity-centric indexing with Transforms - elasticsearch

We're working with Elasticsearch for the first time and we are currently deciding on what would be the best solution for our problem at hand.
We are receiving Event based logs (in JSON form) from our applications directly to Elasticsearch index. These logs are highly interconnected (they share a common unique ID) and therefore we need to convert/aggregate them in an Entity-centered fashion.
Each event usually has a status change in the target field. There are more statuses than just start/end. Document has more data which can be used to create more than just one Entity-centered index.
{
*uniqueID*: ain123in145512kn
name: Bob
target: {
eventStart: {timestamp: 2020-06-01T13:50:55.000Z}
}
}
{
*uniqueID*: ain123in145512kn
name: Bob
target: {
eventStop: {timestamp: 2021-06-01T13:50:55.000Z}
}
}
We were already able to join these documents using Python or Logstash. We basically created an index that contains the following documents:
{
*uniqueID*: ain123in145512kn
name: Bob
target: {
eventStart: {timestamp: 2020-06-01T13:50:55.000Z},
eventStop: {timestamp: 2021-06-01T13:50:55.000Z}
**time_dif_Start_Stop : xxxx**
}
}
We assigned all events document ID that is the same as uniqueID which updated them automatically. Next step just calculated the difference between eventStart and eventStop timestamps.
We have certain requirements for our pipeline so we would prefer if data never has to leave elasticsearch. Therefore, we are wondering if it is possible to do this with any of the tools that already exist in the ELK stack or are hosted in the Elastic cloud? We tried using Transforms but we were only able to calculate aggregated fields in a new index. Is it possible to also basically merge/update all the documents into a single one with this tool or any other? It would be ideal for us as it is running on a schedule and we do not need any external tools to modify documents.
Any other suggestions or help would also be greatly appreciated.

A transform sounds good. I tried the following quick example:
PUT test/_doc/1
{
"uniqueID": "one",
"eventStart": {
"timestamp": "2020-06-01T13:50:55.000Z"
}
}
PUT test/_doc/2
{
"uniqueID": "one",
"eventStop": {
"timestamp": "2020-06-01T13:53:55.000Z"
}
}
PUT test/_doc/3
{
"uniqueID": "one",
"eventStop": {
"timestamp": "2020-06-01T13:54:55.000Z"
}
}
PUT test/_doc/4
{
"uniqueID": "other",
"eventStop": {
"timestamp": "2020-06-01T13:54:55.000Z"
}
}
GET test/_mapping
POST _transform/_preview
{
"source": {
"index": "test"
},
"dest": {
"index": "test_transformed"
},
"pivot": {
"group_by": {
"id": {
"terms": {
"field": "uniqueID.keyword"
}
}
},
"aggregations": {
"event_count": {
"value_count": {
"field": "_id"
}
},
"start": {
"min": {
"field": "eventStart.timestamp"
}
},
"stop": {
"max": {
"field": "eventStop.timestamp"
}
},
"duration": {
"bucket_script": {
"buckets_path": {
"start": "start.value",
"stop": "stop.value"
},
"script": """
return (params.stop - params.start)/1000; //in seconds (initially in ms)
"""
}
}
}
}
}
Which generates the following result — aggregation and calculation look correct:
[
{
"duration" : 240.0,
"stop" : "2020-06-01T13:54:55.000Z",
"event_count" : 3,
"start" : "2020-06-01T13:50:55.000Z",
"id" : "one"
},
{
"stop" : "2020-06-01T13:54:55.000Z",
"event_count" : 1,
"start" : null,
"id" : "other"
}
]
PS: I've turned the answer into a blog post that dives a little deeper into the general topic :)

Related

Transforming in elasticsearch not update aggregated data

I am working on a scenario to aggregate daily data per user. The data processed realtime and stored in elasticsearch. Now I wanno use elasticsearch feature for aggregating data in real time.Iv'e read about Transfrom in elasticsearch and found this is the case we need.
The problem is when the source index is updated, the destination index which is proposed to calculate aggregation is not updated. This is the case I have tested:
source_index data model:
{
"my_datetime": "2021-06-26T08:50:59",
"client_no": "1",
"my_date": "2021-06-26",
"amount": 1000
}
and the transform I defined:
PUT _transform/my_transform
{
"source": {
"index": "dest_index"
},
"pivot": {
"group_by": {
"client_no": {
"terms": {
"field": "client_no"
}
},
"my_date": {
"terms": {
"field": "my_date"
}
}
},
"aggregations": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"count_amount": {
"value_count": {
"field": "amount"
}
}
}
},
"description": "total amount sum per client",
"dest": {
"index": "my_analytic"
},
"frequency": "60s",
"sync": {
"time": {
"field": "my_datetime",
"delay": "10s"
}
}
}
Now when I add another document or update current documents in source index, destination index is not updated and not consider new documents.
Also note that elasticsearch version I used is 7.13
I also changed date field to be timestamp(epoch format like 1624740659000) but still have the same problem.
What am I doing wrong here?

Could it be that your "my_datetime" is further in the past than the "delay": "10s" (plus the time of "frequency": "60s")?
The docs for sync.field note:
In general, it’s a good idea to use a field that contains the ingest timestamp. If you use a different field, you might need to set the delay such that it accounts for data transmission delays.
You might just need a higher delay.

how to count the total number of documents that have more than one object in an Elasticsearch field array?

The structure of the documents in my index is similar to:
{
"_index": "blabla",
"_type": "_doc",
"_source": {
"uid": 5366492,
"aField": "Hildegard",
"aNestedField": [{
"prop": {
"data": "xxxxxxx"
}
},
{
"prop": {
"data": "yyyyyyyy"
}
}
]
}
}
I would like to have the total number of documents in the whole index that have more than one object in the aNestedField field. So, the above one will be counted, because it has 2.
If my index has 100 documents, and the above one is the only one with more than 2 objects in that field, I would expect to have my query to return 1.
Is there a way of doing it?
Updated after having read the comments.
The mapping for the field is:
{
"aNestedField": {
"properties": {
"prop": {
"properties": {
"data": {
"type": "text",
"index": false
}
}
}
}
}
}
The data will not be updated often, no need to worry about it.

Since the prop.data field is not being indexed ("index": false), you'll need at least one field inside of each aNestedField object that is being indexed -- either by explicitly setting "index": true or by not setting "index": false in its mapping.
Your docs should then look something like this:
{
"uid": 5366492,
"aField": "Hildegard",
"aNestedField": [
{
"id": 1, <--
"prop": {
"data": "xxxxxxx"
}
},
{
"id": 2, <--
"prop": {
"data": "yyyyyyyy"
}
},
{
"id": 3, <--
"prop": {
"data": "yyyyyyyy"
}
}
]
}
id is arbitrary -- use anything that makes sense.
After that you'll be able to query for docs with more than 2 array objects using:
GET /_search
{
"query": {
"script": {
"script": "doc['aNestedField.id'].size() > 2"
}
}
}

Elasticsearch Mapping - Rename existing field

Is there anyway I can rename an element in an existing elasticsearch mapping without having to add a new element ?
If so whats the best way to do it in order to avoid breaking the existing mapping?
e.g. from fieldCamelcase to fieldCamelCase
{
"myType": {
"properties": {
"timestamp": {
"type": "date",
"format": "date_optional_time"
},
"fieldCamelcase": {
"type": "string",
"index": "not_analyzed"
},
"field_test": {
"type": "double"
}
}
}
}

You could do this by creating an Ingest pipeline, that contains a Rename Processor in combination with the Reindex API.
PUT _ingest/pipeline/my_rename_pipeline
{
"description" : "describe pipeline",
"processors" : [
{
"rename": {
"field": "fieldCamelcase",
"target_field": "fieldCamelCase"
}
}
]
}
POST _reindex
{
"source": {
"index": "source"
},
"dest": {
"index": "dest",
"pipeline": "my_rename_pipeline"
}
}
Note that you need to be running Elasticsearch 5.x in order to use ingest. If you're running < 5.x then you'll have to go with what #Val mentioned in his comment :)

Updating field name in ES (version>5, missing has been removed) using _update_by_query API:
Example:
POST http://localhost:9200/INDEX_NAME/_update_by_query
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "NEW_FIELD_NAME"
}
}
}
},
"script" : {
"inline": "ctx._source.NEW_FIELD_NAME = ctx._source.OLD_FIELD_NAME; ctx._source.remove(\"OLD_FIELD_NAME\");"
}
}

First of all, you must understand how elasticsearch and lucene store data, by immutable segments (you can read about easily on Internet).
So, any solution will remove/create documents and change mapping or create a new index so a new mapping as well.
The easiest way is to use the update by query API: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-update-by-query.html
POST /XXXX/_update_by_query
{
"query": {
"missing": {
"field": "fieldCamelCase"
}
},
"script" : {
"inline": "ctx._source.fieldCamelCase = ctx._source.fieldCamelcase; ctx._source.remove(\"fieldCamelcase\");"
}
}

Starting with ES 6.4 you can use "Field Aliases", which allow the functionality you're looking for with close to 0 work or resources.
Do note that aliases can only be used for searching - not for indexing new documents.

How to sort items by array size in ElasticSearch?

I have 3 millions items with this structure:
{
"id": "some_id",
"title": "some_title",
"photos": [
{...},
{...},
...
]
}
Some items may have empty photos field:
{
"id": "some_id",
"title": "some_title",
"photos": []
}
I want to sort by the number of photos to result in elements without photos were at the end of the list.
I have the one working solution but it's very slow on 3 million items:
GET myitems/_search
{
"filter": {
...some filters...
},
"sort": [
{
"_script": {
"script": "_source.photos.size()",
"type": "number",
"order": "desc"
}
}
]
}
This query executes 55 seconds. How to optimize this query?

As suggested in the comments, adding a new field with the number of photos would be the way to go. There's a way to achieve this without reindexing all your data by using the update by query plugin.
Basically, after installing the plugin, you can run the following query and all your documents will get that new field. However, make sure that your indexing process also populates that new field in the new documents:
curl -XPOST 'localhost:9200/myitems/_update_by_query' -d '{
"query" : {
"match_all" : {}
},
"script" : "ctx._source.nb_photos = ctx._source.photos.size();"
}'
After this has run, you'll be able to sort your results simply with:
"sort": {"nb_photos": "desc"}
Note: for this plugin to work, one needs to have scripting enabled, it is already the case for you since you were able to use a sort script, but I'm just mentioning this for completeness' sake.

Problem was solved with Transform directive. Now I have a mapping:
PUT /myitems/_mapping/lol
{
"lol" : {
"transform": {
"lang": "groovy",
"script": "ctx._source['has_photos'] = ctx._source['photos'].size() > 0"
},
"properties" : {
... fields ...
"photos" : {"type": "object"},
"has_photos": {"type": "boolean"}
... fields ...
}
}
}
Now I can sort items by photos existence:
GET /test/_search
{
"sort": [
{
"has_photos": {
"order": "desc"
}
}
]
}
Unfortunately, this will cause full reindexation.

Aggregation over "LastUpdated" property or _timestamp

My Elasticsearch mapping looks like roughly like this:
{
"myIndex": {
"mappings": {
"myType": {
"_timestamp": {
"enabled": true,
"store": true
},
"properties": {
"LastUpdated": {
"type": "date",
"format": "dateOptionalTime"
}
/* lots of other properties */
}
}
}
}
}
So, _timestamp is enabled, and there's also a LastUpated property on every document. LastUpdated can have a different value than _timestamp: sometimes, documents get updated physically (e.g. updates to denormalized data) which updates _timestamp, but LastUpdated remains unchanged because the document hasn't actually been "updated" from a business perspective.
Also, there are many of documents without a LastUpdated value (mostly old data).
What I'd like to do is run an aggregation which counts the number of documents per calendar day (kindly ignore the fact that the dates need to be midnight-aligned, please). For every document, use LastUpdated if it's there, otherwise use _timestamp.
Here's what I've tried:
{
"aggregations": {
"counts": {
"terms": {
"script": "doc.LastUpdated == empty ? doc._timestamp : doc.LastUpdated"
}
}
}
}
The bucketization appears to work to some extent, but the keys in the result looks weird:
buckets: [
{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#7ba1f463doc_count: 300544
}{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#5a298acbdoc_count: 257222
}{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#6e451b5edoc_count: 101117
},
...
]
What's the proper way to run this aggregation and get meaningful keys (i.e. timestamps) in the result?

I've tested and made a groovy script for you,
POST index/type/_search
{
"aggs": {
"counts": {
"terms": {
"script": "ts=doc['_timestamp'].getValue();v=doc['LastUpdated'].getValue();rv=v?:ts;rv",
"lang": "groovy"
}
}
}
}
This returns the required result.
Hope this helps!! Thanks!!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio