Remove fields by their name pattern - elasticsearch

We currently are using logstash with elasticsearch to log some of out application events.
some of event holds fields that are dynamically named.
We want to apply a filter that will removed or merged them before entering to elasticsearch.
for example :
{
"Root": {
"EventType": "Info",
"Timestamp": 20150713153757.758
},
"Event": {
"Message": "itemsViews Created in 1 mSec",
"Cache_11542": true,
"Cache_10242": false,
"Cache_55240": 124
}
}
In this case we would like to remove all the fields starting with "Cache_" under the object Event.
so the output to elasticsearch will be
{
"Root": {
"EventType": "Info",
"Timestamp": 20150713153757.758
},
"Event": {
"Message": "itemsViews Created in 1 mSec"
}
}
Is there a way to define a filler in the logstash configuration file to achieve this ?
Many thanks in advance.

Looks like the Ruby filter solution that #magnus-bäck points out might be your solution. I had originally suggested the the mutate filter using the "remove_field" array in conjunction with the gsub filter. Gsub to regex match your Cache* fields that can then be renamed into a variable for use in mutate. However, since you have n-number of Cache fields, I like the ruby script better. :)

Related

Streamsets Data Collector: Replace a Field With Its Child Value

I have a data structure like this
{
"id": 926267,
"updated_sequence": 2304899,
"published_at": {
"unix": 1589574240,
"text": "2020-05-15 21:24:00 +0100",
"iso_8601": "2020-05-15T20:24:00Z"
},
"updated_at": {
"unix": 1589574438,
"text": "2020-05-15 21:27:18 +0100",
"iso_8601": "2020-05-15T20:27:18Z"
},
}
I want to replace the updated_at field with its unix field value using Streamsets Data Collector. As far as I know, it can be done using field replacer. But I'm still didn't get it how to make a mapping expression. How can I achieve that?
In Field Replacer, set Fields to /rec/updated_at and New value to ${record:value('/rec/updated_at/unix')} and it will replace the value. See below.
Cheers,
Dash

Use Kafka Connect to update Elasticsearch field on existing document instead of creating new

I have Kafka set-up running with the Elasticsearch connector and I am successfully indexing new documents into an ES index based on the incoming messages on a particular topic.
However, based on incoming messages on another topic, I need to append data to a field on a specific document in the same index.
Psuedo-schema below:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": []
}
^ This document is being created fine in ES based on the data in the topic mentioned above.
However, how do I then add items to the views field using messages from another topic. Like so:
article-view topic schema:
{
"article_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"user_id": 123456,
"timestamp: 136389734
}
and instead of simply creating a new document on the article-view index (which I dont' want to even have). It appends this to the views field on the article document with corresponding _id equal to article_id from the message.
so the end result after one message would be:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": [
{
"user_id": 123456,
"timestamp: 136389734
}
]
}
Using the ES API it is possible using a script. Like so:
{
"script": {
"lang": "painless",
"params": {
"newItems": [{
"timestamp": 136389734,
"user_id": 123456
}]
},
"source": "ctx._source.views.addAll(params.newItems)"
}
}
I can generate scripts like above dynamically in bulk, and then use the helpers.bulk function in the ES Python library to bulk update documents this way.
Is this possible with Kafka Connect / Elasticsearch? I haven't found any documentation on Confluent's website to explain how to do this.
It seems like a fairly standard requirement and an obvious thing people would need to do with Kafka / A sink connector like ES.
Thanks!
Edit: Partial updates are possible with write.method=upsert (src)
The Elasticsearch connector doesn't support this. You can update documents in-place but need to send the full document, not a delta for appending which I think it what you're after.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

How do I use FreeFormTextRecordSetWriter

I my Nifi controller I want to configure the FreeFormTextRecordSetWriter, but I have no Idea what I should put in the "Text" field. I'm getting the text from my source (in my case GetSolr), and just want to write this, period.
Documentation and mailinglist do not seem to tell me how this is done, any help appreciated.
EDIT: Here the sample input + output I want to achieve (as you can see: not ransformation needed, plain text, no JSON input)
EDIT: I now realize, that I can't tell GetSolr to return just CSV data - but I have to use Json
So referencing with attribute seems to be fine. What the documentation omits is, that the ${flowFile} attribute should containt the complete flowfile that is returned.
Sample input:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"_": "1553686715465"
}
},
"response": {
"numFound": 3194,
"start": 0,
"docs": [
{
"id": "{402EBE69-0000-CD1D-8FFF-D07756271B4E}",
"MimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"FileName": "Test.docx",
"DateLastModified": "2019-03-27T08:05:00.103Z",
"_version_": 1629145864291221504,
"LAST_UPDATE": "2019-03-27T08:16:08.451Z"
}
]
}
}
Wanted output
{402EBE69-0000-CD1D-8FFF-D07756271B4E}
BTW: The documentation says this:
The text to use when writing the results. This property will evaluate the Expression Language using any of the fields available in a Record.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
I want to use my source's text, so I'm confused
You need to use expression language as if the record's fields are the FlowFile's attributes.
Example:
Input:
{
"t1": "test",
"t2": "ttt",
"hello": true,
"testN": 1
}
Text property in FreeFormTextRecordSetWriter:
${t1} k!${t2} ${hello}:boolean
${testN}Num
Output(using ConvertRecord):
test k!ttt true:boolean
1Num
EDIT:
Seems like what you needed was reading from Solr and write a single column csv. You need to use CSVRecordSetWriter. As for the same,
I should tell you to consider to upgrade to 1.9.1. Starting from 1.9.0, the schema can be inferred for you.
otherwise, you can set Schema Access Strategy as Use 'Schema Text' Property
then, use the following schema in Schema Text
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
this should work
I'll edit it into my answer. If it works for you, please choose my answer :)

Specifying Field Types Indexing from Logstash to Elasticsearch

I have successfully ingested data using the XML filter plugin from Logstash to Elasticsearch, however all the field types are of the type "text."
Is there a way to manually or automatically specify the correct type?
I found the following technique good for my use:
Logstash would filter the data and change a field from the default - text to whatever form you want. The documentation would be found here. The example given in the documentation is:
filter {
mutate {
convert => { "fieldname" => "integer" }
}
}
This you add in the /etc/logstash/conf.d/02-... file in the body part. I believe the downside of this practice is that from my understanding it is less recommended to alter data entering the ES.
After you do this you will probably get the this problem. If you have this problem and your DB is a test DB that you can erase all old data just DELETE the index until now that there would not be a conflict (for example you have a field that was until now text and now it is received as date there would be a conflict between old and new data). If you can't just erase the old data then read into the answer in the link I linked.
What you want to do is specify a mapping template.
PUT _template/template_1
{
"index_patterns": ["te*", "bar*"],
"settings": {
"number_of_shards": 1
},
"mappings": {
"type1": {
"_source": {
"enabled": false
},
"properties": {
"host_name": {
"type": "keyword"
},
"created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z YYYY"
}
}
}
}
}
Change the settings to match your needs such as listing the properties to map what you want them to map to.
Setting index_patterns is especially important because it tells elastic how to apply this template. You can set an array of index patterns and can use * as appropriate for wildcards. i.e logstash's default is to rotate by date. They will look like logstash-2018.04.23 so your pattern could be logstash-* and any that match the pattern will receive the template.
If you want to match based on some pattern, then you can use dynamic templates.
Edit: Adding a little update here, if you want logstash to apply the template for you, here is a link to the settings you'll want to be aware of.

Resources