Use Kafka Connect to update Elasticsearch field on existing document instead of creating new - elasticsearch

I have Kafka set-up running with the Elasticsearch connector and I am successfully indexing new documents into an ES index based on the incoming messages on a particular topic.
However, based on incoming messages on another topic, I need to append data to a field on a specific document in the same index.
Psuedo-schema below:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": []
}
^ This document is being created fine in ES based on the data in the topic mentioned above.
However, how do I then add items to the views field using messages from another topic. Like so:
article-view topic schema:
{
"article_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"user_id": 123456,
"timestamp: 136389734
}
and instead of simply creating a new document on the article-view index (which I dont' want to even have). It appends this to the views field on the article document with corresponding _id equal to article_id from the message.
so the end result after one message would be:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": [
{
"user_id": 123456,
"timestamp: 136389734
}
]
}
Using the ES API it is possible using a script. Like so:
{
"script": {
"lang": "painless",
"params": {
"newItems": [{
"timestamp": 136389734,
"user_id": 123456
}]
},
"source": "ctx._source.views.addAll(params.newItems)"
}
}
I can generate scripts like above dynamically in bulk, and then use the helpers.bulk function in the ES Python library to bulk update documents this way.
Is this possible with Kafka Connect / Elasticsearch? I haven't found any documentation on Confluent's website to explain how to do this.
It seems like a fairly standard requirement and an obvious thing people would need to do with Kafka / A sink connector like ES.
Thanks!

Edit: Partial updates are possible with write.method=upsert (src)
The Elasticsearch connector doesn't support this. You can update documents in-place but need to send the full document, not a delta for appending which I think it what you're after.

Related

Calculate field data size and store to other field at indexing time ElasticSearch 7.17

I am looking for a way to store the size of a field (bytes) in a new field of a document.
I.e. when a document is created with a field message that contains the value hello, I want another field message_size_bytes written that in this example has the value 5.
I am aware of the possibilities using _update_by_query and _search using scripting fields, but I have so much data that I do not want to calculate the sizes while querying but at index time.
Is there a possibility to do this using Elasticsearch 7.17 only? I do not have access to the data before it's passed to elasticsearch.
You can use Ingest Pipeline with Script processor.
You can create pipeline using below command:
PUT _ingest/pipeline/calculate_bytes
{
"processors": [
{
"script": {
"description": "Calculate bytes of message field",
"lang": "painless",
"source": """
ctx['message_size_bytes '] = ctx['message'].length();
"""
}
}
]
}
After creating pipeline, you cna use pipeline name while indexing data like below (same you can use in logstash, java or anyother client as well):
POST 74906877/_doc/1?pipeline=calculate_bytes
{
"message":"hello"
}
Result:
"hits": [
{
"_index": "74906877",
"_id": "1",
"_score": 1,
"_source": {
"message": "hello",
"message_size_bytes ": 5
}
}
]

How to query arbitrary fields in Kibana?

We're importing logs that contain the the full request/response for a given endpoint. Using the escLayout c# library. The import is working fine, however the 'structured' part of the log is stored under metadata, like so:
"metadata": {
"event": {
"controller": "main",
"method": "GetData",
"request": {
"userId": 1,
"clientTypeId": 2
},
"response": {
"marketOpen": true,
"price": 18.56
}
}
}
How do I go about querying this metadata as the fields do not appear in the 'Lens' page.
Is it a case of creating an index of some description? There are a lot of different (and occasionally large) data sets so this seems really impactable.
Is querying 'ad-hoc' data like this not a good use of Kibana? Should I look elsewhere, say Grafana before I go too far down the Elastic road?
Note: We're on Elastic 8.2.0

Elastic/Opensearch: HowTo create a new document from an _ingest/pipeline

I am working with Elastic/Opensearch and want to create a new document in a different index out of an _ingest/pipeline
I found no help in the www...
All my documents (filebeat) get parsed and modified in the beginning by a pipline, lets say "StartPipeline".
Triggered by an information in a field of the incoming document, lets say "Start", I want to store that value in a special way by creating a new document in a different long-termindex - with some more information from the triggering document.
If found possibilities, how to do this manually from the console (update_by_query / reindex / painlesscripts) but it has to be triggered by an incoming document...
Perhaps this is easier to understand - in my head it looks like something like that.
PUT _ingest/pipeline/StartPipeline
{
"description" : "create a document in/to a different index",
"processors" : [ {
"PutNewDoc" : {
"if": "ctx.FieldThatTriggers== 'start'",
"index": "DestinationIndex",
"_id": "123",
"document": { "message":"",
"script":"start",
"server":"alpha
...}
}
} ]
}
Does anyone has an idea?
And sorry, I am no native speaker, I am from Germany

Elasticsearch document count doesn't reflect higher indexing rate

When we monitor our elasticsearch cluster health through kibana, For a particular index we see very higher indexing rate. But it seems document count doesn't increase proportionally. How to tackle these two.
sample document
{
"_index": "finance_report_fgl_reporting_log",
"_type": "fgl_reporting_logs",
"_id": "1907688_POINTS_ACCOUNT_DEBIT",
"_score": 9.445704,
"_source": {
"reportingLogId": {
"journalId": 1907688,
"postingAccountId": "POINTS_ACCOUNT",
"postingAccountingEntry": "DEBIT"
},
"journalId": 1907688,
"journalEventId": "trip_completed",
"journalEventLogId": "15db1f2b-b9d0-4edd-96f0-c4e4f8e68150",
"journalAccountingRuleId": "trip_completed_points_payment_rule",
"journalReferenceId": "174558200",
"journalGrossAmount": 154.11,
"postingJournalId": 1907688,
"postingAccountingRuleId": "trip_completed_points_payment_rule",
"postingReferenceId": "174558200",
"postingAccountId": "POINTS_ACCOUNT",
"postingAccountingPeriod": "2019_08",
"postingAccountingEntry": "DEBIT",
"postingCurrencyTypeId": "POINTS",
"postingAmount": 154.11,
"accountId": "POINTS_ACCOUNT",
"accountStakeholderId": "OPERATOR",
"accountCurrencyTypeId": "POINTS",
"accountTypeId": "CONTROLLER",
"accountingRuleId": "trip_completed_points_payment_rule",
"accountingRuleDescription": "Points payment",
"eventId": "trip_completed",
"eventReferenceParam": "body.trip.id",
"createdDate": "2019-08-29T10:03:32.000+0530",
"modifiedDate": "2019-08-29T10:03:32.000+0530",
"createdBy": "ENGINE",
"modifiedBy": "ENGINE",
"version": "3.12.6",
"createYear": 2019,
"routingKey": "_2019"
}
},
The reason why this usually happens is because your indexing operations do not create new documents but update existing ones. Mainly because you're sending updates to an ID that already exists.
Every few hours, a new batch of documents is created (according to the jumps in the graphs) because you're creating a new set of IDs.
Make sure to verify how you're creating your IDs as the solution is hidden in there somewhere.
You might get some info when performing GET _cat/indices?v check out the "docs.deleted" column, as an update operation is merely a "create new+delete older" operation.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Resources