Calculate field data size and store to other field at indexing time ElasticSearch 7.17 - elasticsearch

I am looking for a way to store the size of a field (bytes) in a new field of a document.
I.e. when a document is created with a field message that contains the value hello, I want another field message_size_bytes written that in this example has the value 5.
I am aware of the possibilities using _update_by_query and _search using scripting fields, but I have so much data that I do not want to calculate the sizes while querying but at index time.
Is there a possibility to do this using Elasticsearch 7.17 only? I do not have access to the data before it's passed to elasticsearch.

You can use Ingest Pipeline with Script processor.
You can create pipeline using below command:
PUT _ingest/pipeline/calculate_bytes
{
"processors": [
{
"script": {
"description": "Calculate bytes of message field",
"lang": "painless",
"source": """
ctx['message_size_bytes '] = ctx['message'].length();
"""
}
}
]
}
After creating pipeline, you cna use pipeline name while indexing data like below (same you can use in logstash, java or anyother client as well):
POST 74906877/_doc/1?pipeline=calculate_bytes
{
"message":"hello"
}
Result:
"hits": [
{
"_index": "74906877",
"_id": "1",
"_score": 1,
"_source": {
"message": "hello",
"message_size_bytes ": 5
}
}
]

Related

How to make _source field dynamic in elasticsearch search template?

While using search query in elastic search we define what fields we required in the response
"_source": ["name", "age"]
And while working with search templates we have to set _source fields value while inserting search template to ES Cluster.
"_source": ["name", "age"]
but the problem with the search template is that it will always return us name and age and to get other fields we have to change our search template accordingly.
Is there any way we can pass search fields from the client so that it will only return fields in response to which the user asked?
I have achieved that just for one field like if you do this
"_source": "{{field}}"
then while search index via template you can do this
POST index_name/_search/template
{
"id": template_id,
"params": {
"field": "name"
}
}
This search query returning the name field in response but I could not find a way to pass it as in array or in another format so I can get multiple fields.
Absolutely!!
Your search template should look like this:
"_source": {{#toJson}}fields{{/toJson}}
And then you can call it like this:
POST index_name/_search/template
{
"id": template_id,
"params": {
"fields": ["name"]
}
}
What it's going to do is to transform the params.fields array into JSON and so the generated query will look like this:
"_source": ["name"]

Use Kafka Connect to update Elasticsearch field on existing document instead of creating new

I have Kafka set-up running with the Elasticsearch connector and I am successfully indexing new documents into an ES index based on the incoming messages on a particular topic.
However, based on incoming messages on another topic, I need to append data to a field on a specific document in the same index.
Psuedo-schema below:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": []
}
^ This document is being created fine in ES based on the data in the topic mentioned above.
However, how do I then add items to the views field using messages from another topic. Like so:
article-view topic schema:
{
"article_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"user_id": 123456,
"timestamp: 136389734
}
and instead of simply creating a new document on the article-view index (which I dont' want to even have). It appends this to the views field on the article document with corresponding _id equal to article_id from the message.
so the end result after one message would be:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": [
{
"user_id": 123456,
"timestamp: 136389734
}
]
}
Using the ES API it is possible using a script. Like so:
{
"script": {
"lang": "painless",
"params": {
"newItems": [{
"timestamp": 136389734,
"user_id": 123456
}]
},
"source": "ctx._source.views.addAll(params.newItems)"
}
}
I can generate scripts like above dynamically in bulk, and then use the helpers.bulk function in the ES Python library to bulk update documents this way.
Is this possible with Kafka Connect / Elasticsearch? I haven't found any documentation on Confluent's website to explain how to do this.
It seems like a fairly standard requirement and an obvious thing people would need to do with Kafka / A sink connector like ES.
Thanks!
Edit: Partial updates are possible with write.method=upsert (src)
The Elasticsearch connector doesn't support this. You can update documents in-place but need to send the full document, not a delta for appending which I think it what you're after.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Elasticsearch query to get results irrespective of spaces in search text

I am trying to fetch data from Elasticsearch matching from a field name. I have following two records
{
"_index": "sam_index",
"_type": "doc",
"_id": "key",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
and
{
"_index": "sam_index",
"_type": "doc",
"_id": "key1",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
When I try to search using texts like sam, sample, Sa, etc, I able fetch both records by using match_phrase_prefix query. The query I tried with match_phrase_prefix is
GET sam_index/doc/_search
{
"query": {
"match_phrase_prefix" : {
"name": "sample"
}
}
}
I am not able to fetch the records when I try to search with string samplen. I need search and get results irrespective of spaces between texts. How can I achieve this in Elasticsearch?
First, you need to understand how Elasticsearch works and why it gives the result and doesn't give the result.
ES works on the token match, Documents which you index in ES goes through the analysis process and creates and stores the tokens generated from this process to inverted index which is used for searching.
Now when you make a query then that query also generates the search tokens, these can be as it is in the search query in case of term query or tokens based on the analyzer defined on the search field in case of match query. Hence it's very important to understand the internals of your search query.
Also, it's very important to understand the mapping of your index, ES uses the standard analyzer by default on the text fields.
You can use the Explain API to understand the internals of the query like which search tokens are generated by your search query, how documents matched to it and on what basis score is calculated.
In your case, I created the name field as text, which uses the word joined analyzer explained in Ignore spaces in Elasticsearch and I was able to get the document which consists of sample name when searched for samplen.
Let us know if you also want to achieve the same and if it solves your issue.

Truncate and Index String values in Elasticsearch 2.3.x

I am running ES 2.3.3. I want to index a non-analyzed String but truncate it to a certain number of characters. The ignore_above property, according to the documentation, will NOT index a field above the provided value. I don't want that. I want to take say a field that could potentially be 30K long and truncate it to 10K long, but still be able to filter and sort on the 10K that is retained.
Is this possible in ES 2.3.3 or do I need to do this using Java prior to indexing a document.
I want to index a non-analyzed String but truncate it to a certain number of characters.
Technically it's possible with Update API and Upsert option, but, depending on your exact needs, it may not be very handy.
Let's say you want to index this document:
{
"name": "foofoofoofoo",
"age": 29
}
but you need to truncate name field so that it has only 5 characters. Using Update API, you'd have to execute a script:
POST http://localhost:9200/insert/test/1/_update
{
"script" : "ctx._source.name = ctx._source.name.substring(0,5);",
"scripted_upsert": true,
"upsert" : {
"name": "foofoofoofoo",
"age": 29
}
}
It means that, if ES does not find the document with given id (here id=1), it should index the document that is inside upsert element, and perform given script. So as you can see, it's rather inconvenient if you want to have automatically generated ids, as you have to provide the id in URI.
Result:
GET http://localhost:9200/insert/test/1
{
"_index": "insert",
"_type": "test",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"name": "foofo",
"age": 29
}
}

Resources