Add an object value to a field to Elastic Search during ingest and drop empty valued fields all during ingest - elasticsearch

I am ingesting csv data into elasticsearch using the append processor. I already have two fields that are objects (object1 and object2) and I want to append them both into an array of a different field (mainlist). So it would come out as mainlist:[ {object1}, {object}] I have tried the set processor with the copy_from parameter and I am getting an error that I am missing the required property name "value" even though the ElasticSearch documentation clearly doesn't use the "value" property when it uses the "copy_from". {"set": {"field": "mainlist","copy_from": ["object1", "object"]}}. My syntax is even copied exactly from the documentation. Please help.
Furthermore I need to drop empty fields at the ingest level so they are not returned. I don't wish to have "fieldname: "", returned to the user. What is the best way to do that. I am new to ElasticSearch and it has not been going well.

As to dropping the empty fields at ingest level -- set up a pipeline:
PUT _ingest/pipeline/no_empty_fields
{
"description": "Removes empty-ish fields from a doc",
"processors": [
{
"script": {
"source": """
def keys_to_remove = ctx.keySet()
.stream()
.filter(field -> ctx[field] == null ||
ctx[field] == "")
.collect(Collectors.toList());
for (key in keys_to_remove) {
ctx.remove(key);
}
"""
}
}
]
}
and apply it upon indexing
POST myindex/_doc?pipeline=no_empty_fields
{
"fieldname23": 123,
"fieldname": null,
"fieldname123": ""
}
You can of course extend the conditions to ditch other fields such as "undefined", "Infinity" and others.

Related

ElasticSearch field with different types in one single index

We have a scenario in a service that accepts multiple types of data and we want to store in ElasticSearch so we can benefit from its search capabilities.
Data could be a String, Number, Object or an Array of objects as the following:
POST my-index/_doc/1
{
"additionalData": [
{
"values": {
"some-field": "some-value",
"some-other-field": "some-value"
}
}
]
}
POST my-index/_doc/1
{
"additionalData": [
{
"values": [12345, 9875]
}
]
}
POST my-index/_doc/1
{
"additionalData": [
{
"values": "Some text"
}
]
}
Is there a way to store that in elasticSearch? or better to store in other NoSQL Databases like Mongodb?
PS: we are using Es 7.x, and would like to keep using ES.
If you don't need to search on those values, it's possible with a disabled field (i.e. not indexed, not stored)
However, if you want to search on those value, it's not possible. Each field must have a specific type (object, numeric, text, etc) and then you can only store values of that type in the field.

Insert data when no match by update_by_query in elastic search

I have this command that don't match any data in elastic search and I want to insert it after that.
//localhost:9200/my_index/my_topic/_update_by_query
{
"script": {
"source": "ctx._source.NAME = params.NAME",
"lang": "painless",
"params": {
"NAME": "kevin"
}
},
"query": {
"terms": {
"_id": [
999
]
}
}
}
I try using upsert but it return errors Unknown key for a START_OBJECT in [upsert].
I don't want using update + doc_as_upsert cause I have a case that I will don't send id in my update query.
How can I insert this with update_by_query. Thank you.
If elastic search don't support. I think I will check condition if have id or not, and use indexAPI to create and update to update.
_update_by_query runs on existing documents contained in an existing index. What _update_by_query does is scroll over all documents in your index (that optionally match a query) and perform some logic on each of them via a script or an ingest pipeline.
Hence, logically, you cannot create/upsert data that doesn't already exist in the index. The Index API will always overwrite your document. Upsert only works with in conjunction with the _update endpoint, which is what you should probably do.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

How to rename a nested field containing dots with elasticsearch rename processor and ingest pipeline

I have a field in elasticsearch (5.5.1) which I need to rename because the name contains a '.' and it is causing various problems. The field I want to rename is nested inside another field.
I am trying to use a Rename Processor in an Ingest Pipeline to do a Reindex as described here: https://stackoverflow.com/a/43142634/5114
Here is my pipeline simulation request (you can copy this verbatim into the Dev Tools utility in Kibana to test it):
POST _ingest/pipeline/_simulate
{
"pipeline" : {
"description": "rename nested fields to remove dot",
"processors": [
{
"rename" : {
"field" : "message.message.group1",
"target_field" : "message_group1"
}
},
{
"rename" : {
"field" : "message.message.group2",
"target_field" : "message.message_group2"
}
}
]
},
"docs":[
{
"_type": "status",
"_id": "1509533940000-m1-bfd7183bf036bd346a0bcf2540c05a70fbc4d69e",
"_version": 5,
"_score": null,
"_source": {
"message": {
"_job-id": "AV8wHJEaa4J0sFOfcZI5",
"message.group1": 0,
"message.group2": "foo"
},
"timestamp": 1509533940000
}
}
]
}
The problem is that I get an error when trying to use my pipeline:
{
"docs": [
{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
"header": {
"processor_type": "rename"
}
}
],
"type": "exception",
"reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "field [message.message.group1] doesn't exist"
}
},
"header": {
"processor_type": "rename"
}
}
}
]
}
I think the problem is caused by the field "message.group1" being inside another field ("message"). I'm not sure how to refer to the field I want in the context of the processor. It seems that there could be ambiguity between cases of nested fields, fields containing dots and nested fields containing dots.
I'm looking for the correct way to reference these fields, or if Elasticsearch can not do what I want, confirmation that this is not possible. If Elasticsearch can do this, then it will probably go very fast, else I have to write an external script to pull the documents, transform them, and re-save them to the new index.
Ok, investigating in the Elasticsearch code, I think I know why this won't work.
First we look at the Elasticsearch Rename Processor:
https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RenameProcessor.java#L76-L84
Object value = document.getFieldValue(field, Object.class);
document.removeField(field);
try {
document.setFieldValue(targetField, value);
} catch (Exception e) {
// setting the value back to the original field shouldn't as we just fetched it from that field:
document.setFieldValue(field, value);
throw e;
}
What this is doing is looking for the field to rename, getting its value, then removing the field and adding a new field with the same value but with the new name.
Now we look at what happens in document.getFieldValue:
https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/core/src/main/java/org/elasticsearch/ingest/IngestDocument.java#L101-L108
public <T> T getFieldValue(String path, Class<T> clazz) {
FieldPath fieldPath = new FieldPath(path);
Object context = fieldPath.initialContext;
for (String pathElement : fieldPath.pathElements) {
context = resolve(pathElement, path, context);
}
return cast(path, context, clazz);
}
Notice it uses a FieldPath object to represent the path to the field in the document.
Now look at how the FieldPath represents the path:
https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/core/src/main/java/org/elasticsearch/ingest/IngestDocument.java#L688
this.pathElements = newPath.split("\\.");
This is splitting the path on any "." character, because that is the delimiter between path elements in field names.
The problem is that the source document has a field named "message.group1", so we need to be able to reference that. Just splitting the path on "." does not account for field names containing a "." in the name. We would need a syntax more like javascript for that, where we could use brackets and quotes to make the dot mean something different.
If the source documents were all transformed so that a "." in the field name would turn that field into an object before saving, then this path scheme would work. But with source documents having field names containing "." we can not reference them in certain contexts.
To solve my problem and reindex my index, I wrote a python script which pulled a batch of documents, transformed them and bulk inserted them in a new index. This is basically what the Elasticsearch reindex api does, but I did it in python instead.
More than two year later, I come across the same issue. You can manage to have your dotted-properties expanded to real nested objects with the the dot_expander processor.
Expands a field with dots into an object field. This processor allows fields with dots in the name to be accessible by other processors in the pipeline. Otherwise these fields can’t be accessed by any processor
Issue 37507 on Elasticsearch's Github pointed me in the right direction.

Find documents in Elasticsearch where `ignore_malformed` was triggered

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.
This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.
This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.
Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.
You can use the exists query to find document where this field does not exist, see this example
PUT foo
{
"mappings": {
"bar": {
"properties": {
"baz": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT foo/bar/1
{
"baz": "field"
}
GET foo/bar/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must_not": [
{
"exists": {
"field": "baz"
}
}
]
}
}
}
}
}
There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally
You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.
ignore_malformed flag is saying "always store document, analyze if possible".
You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.
So forget the inverted index, let's use the source!
Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.
Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:
Here I want to retrieve the document where country_name is not a string:
{
"_source": false,
"timeout" : "30s",
"query" : {
"query_string" : {
"query" : "locale:de_ch"
}
},
"post_filter": {
"script": {
"script": "!(_source.country_name instanceof String)"
}
}
}
"_source:false" => I want only document ID
"timeout" => prevent crash
As you notice, this is a missing feature, I know logstash will tag
document that fail, so elasticsearch could implement the same thing.

Resources