Indexing metadata field in ElasticSearch - elasticsearch

I have a metadata field inside the model I'm indexing, but when I index a field inside metadata that was previously indexed as another type, I get a "no mapping" error... How can I disable the dynamic mapping of the metadata field only?
If I previously indexed this document:
{
...
"metadata": {
"key": {
"value": "test"
}
},
...
}
Then, if I index this document:
{
...
"metadata": {
"key": "test"
},
...
}
I get the "tried to parse as object, but got EOF, has a concrete value been provided to it?" error because metadata[key] is no longer an object. But this might happen when indexing metadata.
Thanks,
Pedro

Related

Update restrictions on Elasticsearch Object type field

I have to store documents with a single field contains a single Json object. this object has a variable depth and variable schema.
I config a mapping like this:
"mappings": {
"properties": {
"#timestamp": {
"type": "date"
},
"message": {
"type": "object"
}
}
}
It works fine and Elasticsearch creates and updates mapping with documents that received.
The problem is that after some updates in mapping, it rejects new documents and do not update mapping anymore. At this time I change the indices and mapping update occurred for that indies. I'm looking forward to know the right solution.
for example the first document is:
{
personalInfo:{
fistName: "tom"
}
moviesStatistics: {
count: 100
}
}
the second document that will update Elasticsearch mapping is:
{
personalInfo:{
fistName: "tom",
lastName: "hanks"
},
moviesStatistics: {
count: 100
},
education: {
title: "a title..."
}
}
Elasticsearch creates mapping with doc1 and updates it with doc2, doc3, ... until a number of documents received. After that it starts to reject every document that is not matched to the last mapping fields.
After all I found the solution in the home page of Elasticsearch https://www.elastic.co/guide/en/elasticsearch/reference/7.13//dynamic-field-mapping.html
We can use Dynamic mapping and simply use this mapping:
"mappings": {
"dynamic": "true"
}
You should also change some default restrictions that mentioned here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.13//mapping-settings-limit.html

Elasticsearch nested objects with query_string as first class attributes

I'm trying to index a nested field as a first-class attribute in my document so that I can search them using query_string without dot syntax.
For example, if I have a document like
"data": { "name": "Bob" }
instead of searching for data.name:Bob I would like to be able to search for name:Bob
The root of my issue is that we index a jsonb column that may have varying attributes. In some instances the data property may contain a data.business attribute, etc. I would like users to be able to search on these attributes without needing to "dig" into the object.
The data field does not have to be indexed as a nested type unless necessary; I was indexing it as an object previously.
I have tried to leverage the _all field as suggested in this post.
I have also tried to use include_in_parent:true and set the datatype as nested for my data field as suggested in this post.
I have also looked into the inner_hits feature to no avail.
Here's an example of my mapping for the data attribute.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"data": {
"type": "object"
}
}
}
}
}
Example document
PUT my_index/_doc/1
{
"data": {
name: "bob",
business: "None of yours"
}
}
And how my query currently looks:
GET my_index/_search
{
"query": {
"query_string": {
"query": "name:bob",
"fields": ["data.*"]
}
}
}
With the current setup I almost get my desired results. I can search on individual properties like data.name:bob and data.business:"None of yours" and get back the correct documents.
However I want to be able to get the exact same results with business:"None of yours" or name:bob.
Thanks in advance for any help!
I figured it out using dynamic templates. For anyone coming across this in the future, here is how I solved the issue:
I used path_match to match the data object (data.*).
Then using copy_to and {name} I dynamically created top-level fields on my parent object.
{
"dynamic_templates":[
{"template_1":
{"mapping":
{"copy_to":"{name}"},
"path_match":"data.*"
}
}
]
}

How to rename a nested field containing dots with elasticsearch rename processor and ingest pipeline

I have a field in elasticsearch (5.5.1) which I need to rename because the name contains a '.' and it is causing various problems. The field I want to rename is nested inside another field.
I am trying to use a Rename Processor in an Ingest Pipeline to do a Reindex as described here: https://stackoverflow.com/a/43142634/5114
Here is my pipeline simulation request (you can copy this verbatim into the Dev Tools utility in Kibana to test it):
POST _ingest/pipeline/_simulate
{
"pipeline" : {
"description": "rename nested fields to remove dot",
"processors": [
{
"rename" : {
"field" : "message.message.group1",
"target_field" : "message_group1"
}
},
{
"rename" : {
"field" : "message.message.group2",
"target_field" : "message.message_group2"
}
}
]
},
"docs":[
{
"_type": "status",
"_id": "1509533940000-m1-bfd7183bf036bd346a0bcf2540c05a70fbc4d69e",
"_version": 5,
"_score": null,
"_source": {
"message": {
"_job-id": "AV8wHJEaa4J0sFOfcZI5",
"message.group1": 0,
"message.group2": "foo"
},
"timestamp": 1509533940000
}
}
]
}
The problem is that I get an error when trying to use my pipeline:
{
"docs": [
{
"error": {
"root_cause": [
{
"type": "exception",
"reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
"header": {
"processor_type": "rename"
}
}
],
"type": "exception",
"reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "java.lang.IllegalArgumentException: field [message.message.group1] doesn't exist",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "field [message.message.group1] doesn't exist"
}
},
"header": {
"processor_type": "rename"
}
}
}
]
}
I think the problem is caused by the field "message.group1" being inside another field ("message"). I'm not sure how to refer to the field I want in the context of the processor. It seems that there could be ambiguity between cases of nested fields, fields containing dots and nested fields containing dots.
I'm looking for the correct way to reference these fields, or if Elasticsearch can not do what I want, confirmation that this is not possible. If Elasticsearch can do this, then it will probably go very fast, else I have to write an external script to pull the documents, transform them, and re-save them to the new index.
Ok, investigating in the Elasticsearch code, I think I know why this won't work.
First we look at the Elasticsearch Rename Processor:
https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RenameProcessor.java#L76-L84
Object value = document.getFieldValue(field, Object.class);
document.removeField(field);
try {
document.setFieldValue(targetField, value);
} catch (Exception e) {
// setting the value back to the original field shouldn't as we just fetched it from that field:
document.setFieldValue(field, value);
throw e;
}
What this is doing is looking for the field to rename, getting its value, then removing the field and adding a new field with the same value but with the new name.
Now we look at what happens in document.getFieldValue:
https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/core/src/main/java/org/elasticsearch/ingest/IngestDocument.java#L101-L108
public <T> T getFieldValue(String path, Class<T> clazz) {
FieldPath fieldPath = new FieldPath(path);
Object context = fieldPath.initialContext;
for (String pathElement : fieldPath.pathElements) {
context = resolve(pathElement, path, context);
}
return cast(path, context, clazz);
}
Notice it uses a FieldPath object to represent the path to the field in the document.
Now look at how the FieldPath represents the path:
https://github.com/elastic/elasticsearch/blob/9eff18374d68355f6acb58940a796268c9b6f2de/core/src/main/java/org/elasticsearch/ingest/IngestDocument.java#L688
this.pathElements = newPath.split("\\.");
This is splitting the path on any "." character, because that is the delimiter between path elements in field names.
The problem is that the source document has a field named "message.group1", so we need to be able to reference that. Just splitting the path on "." does not account for field names containing a "." in the name. We would need a syntax more like javascript for that, where we could use brackets and quotes to make the dot mean something different.
If the source documents were all transformed so that a "." in the field name would turn that field into an object before saving, then this path scheme would work. But with source documents having field names containing "." we can not reference them in certain contexts.
To solve my problem and reindex my index, I wrote a python script which pulled a batch of documents, transformed them and bulk inserted them in a new index. This is basically what the Elasticsearch reindex api does, but I did it in python instead.
More than two year later, I come across the same issue. You can manage to have your dotted-properties expanded to real nested objects with the the dot_expander processor.
Expands a field with dots into an object field. This processor allows fields with dots in the name to be accessible by other processors in the pipeline. Otherwise these fields can’t be accessed by any processor
Issue 37507 on Elasticsearch's Github pointed me in the right direction.

Find documents in Elasticsearch where `ignore_malformed` was triggered

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.
This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.
This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.
Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.
You can use the exists query to find document where this field does not exist, see this example
PUT foo
{
"mappings": {
"bar": {
"properties": {
"baz": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT foo/bar/1
{
"baz": "field"
}
GET foo/bar/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must_not": [
{
"exists": {
"field": "baz"
}
}
]
}
}
}
}
}
There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally
You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.
ignore_malformed flag is saying "always store document, analyze if possible".
You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.
So forget the inverted index, let's use the source!
Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.
Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:
Here I want to retrieve the document where country_name is not a string:
{
"_source": false,
"timeout" : "30s",
"query" : {
"query_string" : {
"query" : "locale:de_ch"
}
},
"post_filter": {
"script": {
"script": "!(_source.country_name instanceof String)"
}
}
}
"_source:false" => I want only document ID
"timeout" => prevent crash
As you notice, this is a missing feature, I know logstash will tag
document that fail, so elasticsearch could implement the same thing.

Why are fields specified by type rather than index in Elasticsearch?

If multiple types in an Elasticsearch index have fields with the same name, those fields must have the same mapping that tries to create a "foobar" property as both string and long"...
For example if you try to PUT the following index mapping:
{
"mappings": {
"type_one": {
"properties": {
"foobar": {
"type": "string"
}
}
},
"type_two": {
"properties": {
"foobar": {
"type": "long"
}
}
}
}
}
...the following error will be returned
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "Failed to parse mapping [type_one]: mapper [foobar] cannot be changed from type [long] to [string]"
}
],
"type": "mapper_parsing_exception",
"reason": "Failed to parse mapping [type_one]: mapper [foobar] cannot be changed from type [long] to [string]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "mapper [foobar] cannot be changed from type [long] to [string]"
}
},
"status": 400
}
The following is from the elasticsearch site:
Conflicts between fields in different types
Fields in the same index with the same name in two different types
must have the same mapping, as they are backed by the same field
internally. Trying to update a mapping parameter for a field which
exists in more than one type will throw an exception, unless you
specify the update_all_types parameter, in which case it will update
that parameter across all fields with the same name in the same index.
If fields with the same name must have the same mapping for all types in the index then why are field mappings specified per type? Why not specify the fields for the entire index, and then identify which fields are assigned to each type.
For example something like this:
{
"fields":{
"PropA":{
"type":"string"
},
"PropB":{
"type":"long"
},
"PropC":{
"type":"boolean"
}
},
"types":{
"foo":[
"PropA",
"PropB"
],
"foo":[
"PropA",
"PropC"
],
"foo":[
"PropA",
"PropC",
"PropC"
]
}
}
Wouldn't a mapping format like this be more succinct, and a better representation of what is actually allowed?
The reason I ask is because I'm working on creating an index template JSON file with about 80 different fields used across 15 types. Many of the fields are used across multiple if not all the types. So anytime I need to update a field I have to make sure I update it for every type where it's used.
Looks like I'm not the only one that found this confusing.
Remove support for types? #15613
Sounds like removing support for multiple types per index and specifying fields at the index level is on the roadmap for future versions.

Resources