Preserve wrong messages in Elasticsearch - elasticsearch

I have a static mapping in Elasticsearch index. When a message doesn't match this mapping, it is discarded. Is there a way to route it to a default index for wrong messages?
To give you example, I have some fields with integer type:
"status_code": {
"type": "integer"
},
When a message contains a number
"status_code": 123,
it's ok. But when it is
"status_code": "abc"
it fails.

You can have ES do this triage pretty easily using ingest nodes/processors.
The main idea is to create an ingest pipeline with a convert processor for the status_code field and if the conversion doesn't work, you can add an on_failure condition which will direct the document at another index that you can later process.
So create the failures ingest pipeline:
PUT _ingest/pipeline/failures
{
"processors": [
{
"convert": {
"field": "status_code",
"type": "integer"
}
}
],
"on_failure": [
{
"set": {
"field": "_index",
"value": "failed-{{ _index }}"
}
}
]
}
Then when you index a document, you can simply specify the pipeline in parameter. Indexing a document with correct status code will succeed:
PUT test/doc/1?pipeline=failures
{
"status_code": 123
}
However, trying to index a document with a bad status code, will actually also succeed, but your document will be indexed in the failed-test index and not the test one:
PUT test/doc/2?pipeline=failures
{
"status_code": "abc"
}
After running these two commands, you'll see this:
GET failed-test/_search
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "failed-test",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"status_code" : "abc"
}
}
]
}
}
To sum up, you didn't have to handle that exceptional case in your client code and could fully leverage ES ingest nodes to achieve the same task.

You can set the parameter ignore malformed to ignore just the field with the type mismatch and not the whole document.
And you can try to combine it with multi-fields, that allows you to map the same value in different ways.
You will probably need something like this:
"status_code": {
"type": "integer",
"fields": {
"as_string": {
"type": "keyword"
}
}
}
This way you will have a field named status_code as an intenger and the same value in a field named status_code.as_string as a keyword, but you should test to see if really does what you want.

Use Strict mapping and you will be able to catch the exception raised by Elastic.
Below is the excerpt from Elastic docs:
By default, when a previously unseen field is found in a document, Elasticsearch will add the new field to the type mapping. This behaviour can be disabled, both at the document and at the object level, by setting the dynamic parameter to false (to ignore new fields) or to strict (to throw an exception if an unknown field is encountered).
As a part of Exception handling, you can push the message to some other index where dynamic mapping is enabled.

Related

Why does not working elasticsearch auto mapping when use number without quotes in json

When I try to initially POST data in index firs time
POST /my-test-index1/_doc/
{
"foo": {
"bar": [ "99.31.99.33", 862 ]
}
}
i receive error
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "mapper [user.id] cannot be changed from type [long] to [text]"
}
],
"type" : "illegal_argument_exception",
"reason" : "mapper [user.id] cannot be changed from type [long] to [text]"
},
"status" : 400
}
But if I initially post json with quotes numbers it's work, and next numbers without quotes also be work.
POST /my-test-index1/_doc/
{
"foo": {
"bar": [ "99.31.99.33", "862" ]
}
}
{
"_index" : "my-test-index1",
"_type" : "_doc",
"_id" : "OiyhdIABdQBSvDJuTJ4t",
"_version" : 1,
"result" : "created",
}
I know that initially post create mapping, but my question, why numbers in json without quotes doesn't work, not create right mapping when initially post
In first scenario, You are getting exception because array values are not same data type and index mapping is not created. You can check official documentation.
In Elasticsearch, there is no dedicated array data type. Any field can
contain zero or more values by default, however, all values in the
array must be of the same data type.
In second scenario, when initially post json with quotes numbers it's work because it will create index and set text type for bar field. Also, when you send other request with numbers without quotes, it will work as it consider all value of array as text and not integer.
You can use below API to check index mapping:
GET my-test-index1/_mapping

How to make Elastic Engine understand a field is not to be analyzed for an exact match?

The question is based on the previous post where the Exact Search did not work either based on Match or MatchPhrasePrefix.
Then I found a similar kind of post here where the search field is set to be not_analyzed in the mapping definition (by #Russ Cam).
But I am using
package id="Elasticsearch.Net" version="7.6.0" targetFramework="net461"
package id="NEST" version="7.6.0" targetFramework="net461"
and might be for that reason the solution did not work.
Because If I pass "SOME", it matches with "SOME" and "SOME OTHER LOAN" which should not be the case (in my earlier post for "product value").
How can I do the same using NEST 7.6.0?
Well I'm not aware of how your current mapping looks. Also I don't know about NEST as well but I will explain
How to make Elastic Engine understand a field is not to be analyzed for an exact match?
by an example using elastic dsl.
For exact match (case sensitive) all you need to do is to define the field type as keyword. For a field of type keyword the data is indexed as it is without applying any analyzer and hence it is perfect for exact matching.
PUT test
{
"mappings": {
"properties": {
"field1": {
"type": "keyword"
}
}
}
}
Now lets index some docs
POST test/_doc/1
{
"field1":"SOME"
}
POST test/_doc/2
{
"field1": "SOME OTHER LOAN"
}
For exact matching we can use term query. Lets search for "SOME" and we should get document 1.
GET test/_search
{
"query": {
"term": {
"field1": "SOME"
}
}
}
O/P that we get:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931472,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"field1" : "SOME"
}
}
]
}
}
So the crux is make the field type as keyword and use term query.

ES curl for email is not returning correct results despite knowing that it does exist

I do a query for a term "owner" and a document showed the email for an owner. I figured to look at all Houses which have this email, to query for email instead of owner.
When I do the following curl request, It doesnt return any actual cases.
curl -X GET "localhost:9200/_search/?pretty" -H "Content-Type: application/json" -d'{"query": {"match": {"email": {"query": "test.user#gmail.com"}}}}'
it does not return the correct information. I wanted to find an exact result. I was also thinking to use the term:
curl -X GET "localhost:9200/_search/?pretty" -H "Content-Type: application/json" -d'{"query": {"term": {"email": "test.user#gmail.com"}}}'
in an attempt to find an exact match. This seems to return no document information. I am thinking that it might have something to do with the periods or maybe the # symbol.
I have also tried match when trying to wrap the email with escaped quotes, escaped periods.
Is there something going on I am unaware of with special characters?
Elasticsearch is not schema free, now they are calling it "schema on write" and that´s a very good name for the schema generation process. When elasticsearch recieves a new document with unknown fields, it tries an "educated guess".
When you index the first document with the field "email", elasticsearch will have a look on the value provided and create a mapping for this field.
The value "test.user#gmail.com" will then be mapped to "Text" mapping type.
Now, let´s see how elastic will process a simple document with a email. Create a document:
POST /auto_mapped_index/_doc
{"email": "nobody#example.com"}
Courious how the mapping look like? Here you go:
GET /auto_mapped_index/_mapping
Will be answered with:
{
"my_first_index" : {
"mappings" : {
"properties" : {
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
You see, the "type" : "text" is indicating the mapping type "text" as assumed before. And there is also a subfield "keyword", automatically created by elastic for text type fields by default.
We have 2 options now, the easy one is to query the keyword subfield (please note the dot notation):
GET /my_first_index/_search
{"query": {"term": {"email.keyword": "nobody#example.com"}}}
Done!
The other option is to create a specific mapping for our index. In order to do so, we need a new and empty index and define the mapping. We can do it with one shot:
PUT /my_second_index/
{
"mappings" : {
"properties" : {
"email" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
Now let us populate the index (here i´m putting two documents):
POST /my_second_index/_doc
{"email": "nobody#example.com"}
POST /my_second_index/_doc
{"email": "anybody#example.com"}
And now your unchanged query should work :
GET /my_second_index/_search
{"query": {"term": {"email": "anybody#example.com"}}}
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_second_index",
"_type" : "_doc",
"_id" : "OTf3n28BpmGM8iQdGR4j",
"_score" : 0.2876821,
"_source" : {
"email" : "anybody#example.com"
}
}
]
}
}

Attempting to use Elasticsearch Bulk API when _id is equal to a specific field

I am attempting to bulk insert documents into an index. I need to have _id equal to a specific field that I am inserting. I'm using ES v6.6
POST productv9/_bulk
{ "index" : { "_index" : "productv9", "_id": "in_stock"}}
{ "description" : "test", "in_stock" : "2001"}
GET productv9/_search
{
"query": {
"match": {
"_id": "2001"
}
}
}
When I run the bulk statement it runs without any error. However, when I run the search statement it is not getting any hits. Additionally, I have many additional documents that I would like to insert in the same manner.
What I suggest to do is to create an ingest pipeline that will set the _id of your document based on the value of the in_stock field.
First create the pipeline:
PUT _ingest/pipeline/set_id
{
"description" : "Sets the id of the document based on a field value",
"processors" : [
{
"set" : {
"field": "_id",
"value": "{{in_stock}}"
}
}
]
}
Then you can reference the pipeline in your bulk call:
POST productv9/doc/_bulk?pipeline=set_id
{ "index" : {}}
{ "description" : "test", "in_stock" : "2001"}
By calling GET productv9/_doc/2001 you will get your document.

How to get documents size(in bytes) in Elasticsearch

I am new to elasticsearch. I need to get the size of the documents of the query results.
Example:--
this is a document. (19bytes).
this is also a document. (24bytes)
content:{"a":"this is a document", "b":"this is also a document"}(53bytes)
when I query for the document in ES. I will get the above documents as result. So, the size of both documents is 32bytes. I need the 32bytes in elasticsearch as a result.
Does your document only contain a single field? I'm not sure this is 100% of what you want, but generally you can calculate the length of fields and either store them with the document or calculate them at query time (but this is a slow operation and I would avoid it if possible).
So here's an example with a test document and the calculation for the field length:
PUT test/_doc/1
{
"content": "this is a document."
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = ctx._source.content.length();
} else {
ctx._source.content_length = 0;
}
"""
}
}
GET test/_search
The query result is then:
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"content" : "this is a document.",
"content_length" : 19
}
}
]
}
}
BTW there are 19 characters (including spaces and dots in that one). If you want to exclude those, you'll have to add some more logic to the script. I would be careful with bytes BTW, since UTF8 might use more than one byte per character (like höhe) and this script is really only counting characters.
Then you can easily use the length in queries and aggregations.
If you want to calculate the size of all the subdocuments combined, use the following:
PUT test/_doc/2
{
"content": {
"a": "this is a document",
"b": "this is also a document"
}
}
POST test/_update_by_query
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "content_length"
}
}
]
}
},
"script": {
"source": """
if(ctx._source.containsKey("content")) {
ctx._source.content_length = 0;
for (item in ctx._source.content.entrySet()) {
ctx._source.content_length += item.getValue().length();
}
}
"""
}
}
GET test/_search
Just note that content can either be of the type text or have a subdocument, but you can't mix that.
There's no way to get elasticsearch docs size by API. The reason is that the doc indexed to Elasticsearch takes different size in the index, depending on whether you store _all, which fields are indexed, and the mapping type of those fields, doc_value and more. also elasticsearch uses deduplication and other methods of compaction, so the index size has no linear correlation with the original documents it contains.
One way to work around it is to calculate the document size in advance before indexing it, and add it as another field in the doc, i.e. doc_size field. then you can query this calculated field, and run aggregations on it.
Note however that as I stated above this does not represent the size of the index, and might be completely wrong - for example if all the docs contain a very long text field with the same value, then Elasticsearch would only store that long value once and reference to it, so the index size would be much smaller.
Elasticsearch now has a _size field, which can be enabled in mappings.
Once enabled, this gives out data size in Bytes.
GET <index_name>/_doc/<doc_id>?stored_fields=_size
Elasticsearch official doc

Resources