Why does Elasticsearch ignore_malformed add malformed value to index? - elasticsearch

I am using Serilog in C# to create a log file, which is ingested by Filebeat and sent via Logstash to Elasticsearch. The Elasticsearch indexes conform to ECS 1.5.
The log file sometimes contains erroneous values for the field "host.ip", it can contain values like "localhost:5000". This lead to rejected log posts, since a string like that cannot be converted into an ip number. This is all expected, and the issue of correcting the log file is not in the scope of this question.
I decided to add the "ignore_malformed: true" setting, on the index level. After that, the log posts are no longer rejected - I can find them in Elasticsearch. So, the setting is proven to have had effect. BUT the field "host.ip" now actually contains the malformed value "localhost:5000". I can't see how that is even possible, it is not what I expected or wanted.
From the documetation of "ignore_malformed", it would appear as if values that do not match the field type are supposed to be discarded - not written into the field. I also find no added "_ignored" field.
It's as if setting ignore_malformed to true actually allows the malformed data into the index, instead of dropping it. I'm expecting/wanting the field to be empty, if the value is malformed. Is this a bug, or am I missing something?

Whatever you send in the source document will always be there, ES will never modify it. However, the fact that you're now specifying ignore_malformed means that ES will not try to index malformed data, but the value will still be visible in your source document.

Related

How do I exclude/predefine fields for Index Patterns in Kibana?

I am using ELK to monitor REST API servers. Logstash decomposes the URL into a JSON object with fields for query parameters, header params, request duration, headers.
TLDR: I want all these fields retained so when I look at a specific message, I can see all the details. But only need a few of them to query and generate reports/visualizations in Kibana.
I've been testing for a few weeks and adding some new fields on the server side. So whenever I do, I need to rescan the index. However the auto-detection now finds 300+ fields and I'm guessing it indexes all of them.
I would like to control it to just index a set of fields as I think the more it detects, the larger the index file gets?
It was about 300MB/day for a week (100-200 fields), and then when I added a new field I needed to refresh, it went to 350 fields; 1 GB/day. After I accidentally deleted the ELK instance yesterday, I redid everything and now the indexes are like 100MB/day so far which is why I got curious.
I found these docs but not sure which one's are relevant or how they relate/need to be put together.
Mapping, index patterns, indices, templates/filebeats/rollup policy
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
https://discuss.elastic.co/t/index-lifecycle-management-for-existing-indices/181749/3
https://www.elastic.co/guide/en/elasticsearch/reference/7.3/indices-templates.html
(One has a PUT call that sends a huge JSON text but not sure how you would enter something like that in putty. POSTMAN/JMeter maybe but these need to be executed on the server itself which is just an SSH session, no GUI/text window.)
To remove fields from your log (since you are using logstash), you can use remove_field option of logstash mutate filter.
Ref: Mutate filter plugin

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

elasticsearch / kibana, search for documents where message contains '=' char

i have an issue which i suspect is quite basic but i have been stuck on this for too long and i fear i am missing something so basic that i can't see it by now.
we are using the ELK stack today for log analysis of our application logs.
logs are created by the JAVA application into JSON format, shipped using filebeat into logstash which in turn processes the input and queues it into ES.
some of the messages contain unstructured data in the message field which i currently cannot parse into separate fields so i need to catch them in the message field. problem is this:
the string i need to catch is: "57=1" this is an indication of something which i need to filter documents upon. i need to get documents which contain this exact string.
no matter what i try i can't get kibana to match this. it seems to always ignore the equal char and match either 57 or 1.
please advise.
thanks
You may check the Elasticsearch mapping on the field type of the referring field. If it is analyzed, the '=' may not have been indexed due to the default-analyzer. (source 1, source 2)

Dealing with random failure datatypes in Elasticsearch 2.X

So im working on a system that logs bad data sent to an api and what the full request was. Would love to be able to see this in Kibana.
Issue is the datatypes could be random, so when I send them to the bad_data field it fails if it dosen't match the original mapping.
Anyone have a suggestion for the right way to handle this?
(2.X Es is required due to a sub dependancy)
You could use ignore_malformed flag in your field mappings. In that case wrong format values will not be indexed and your document will be saved.
See elastic documentation for more information.
If you want to be able to query such fields as original text you could use fields in your mapping for multi-type indexing, to get fast queries on raw text values.

Where do .raw fields come from when using Logstash with Elasticsearch output?

When using Logstash and Elasticsearch together, fields with .raw are appended for analyzed fields, so that when querying Elasticsearch with tools like Kibana, it's possible to use the field's value as-is without per-word splitting and what not.
I built a new installation of the ELK stack with the latest greatest versions of everything, and noticed my .raw fields are no longer being created as they were on older versions of the stack. There are a lot of folks posting solutions of creating templates on Elasticsearch, but I haven't been able to find much information as to why this fixes things. In an effort to better understand the broader problem, I ask this specific question:
Where do the .raw fields come from?
I had assumed that Logstash was populating Elasticsearch with strings as-analyzed and strings as-raw when it inserted documents, but considering the fact that the fix lies in Elasticsearch templates, I question whether or not my assumption is correct.
You're correct in your assumption that the .raw fields are the result of a dynamic template for string fields contained in the default index template that Logstash creates IF manage_template: true (which it is by default).
The default template that Logstash creates (as of 2.1) can be seen here. As you can see on line 26, all string fields (except the message one) have a not_analyzed .raw sub-field created.
However, the template hasn't changed in the latest Logstash versions as can be seen in the template.json change history, so either something else must be wrong with your install or you've changed your Logstash config to use your own index template (without .raw fields) instead.
If you run curl -XGET localhost:9200/_template/logstash* you should see the template that Logstash has created.

Resources