How to understand this description of 'collapse' in the Elasticsearch document? - elasticsearch

ES version:6.4.3
First, pls imagine that I have an index like this:
create a new index "test_1",
store some data,
#### 1.create a new index "test_1"
DELETE test_1
PUT /test_1/
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
GET /test_1/_mapping
GET /test_1/_refresh
GET /test_1/_search
#### 2.put some doc
POST _bulk
{ "index" : { "_index" : "test_1", "_id" : "100" } }
{ "title" : ["100","101"] }
{ "index" : { "_index" : "test_1", "_id" : "101" } }
{ "title" : "100" }
test agg
#### 3.test agg
GET /test_1/_search
{
"size": 0,
"aggs": {
"title": {
"terms": {
"field": "title.keyword",
"size": 100
}
}
}
}
It works as expected, and the results are as follows:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"title": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "100",
"doc_count": 2
},
{
"key": "101",
"doc_count": 1
}
]
}
}
}
test collapse
#### 4. test collapse
GET /test_1/_search
{
"_source": false,
"from":0,
"size": 10,
"query": {
"match_all": {
}
},
"collapse": {
"field": "title.keyword",
"inner_hits": {
"name": "latest",
"size": 1
}
}
}
The result is an error:
{
"error": {
"root_cause": [
{
"type": "illegal_state_exception",
"reason": "failed to collapse 0, the collapse field must be single valued"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test_1",
"node": "1TlabepgQSi-5WvjVm6MuQ",
"reason": {
"type": "illegal_state_exception",
"reason": "failed to collapse 0, the collapse field must be single valued"
}
}
],
"caused_by": {
"type": "illegal_state_exception",
"reason": "failed to collapse 0, the collapse field must be single valued",
"caused_by": {
"type": "illegal_state_exception",
"reason": "failed to collapse 0, the collapse field must be single valued"
}
}
},
"status": 500
}
So my question is why the error is reported, is it related to this description of es about collapse:
The field used for collapsing must be a single valued keyword or numeric field with doc_values activated.
If the two are related, why is the reason for the error being failed to collapse 0, where does this 0 come from? Sincerely appreciate any answer.

First of all, thanks for providing a reproducible example, that helps a lot!!
Then, regarding collapse, indeed, it is only working on single valued fields. In your first document, title is an array, and hence, is multi-valued, which is not ok for collapsing.
Simply put, the 0 you see in the error message is the internal document ID, i.e. it's an incremental number that each document gets whenever it is indexed. In your case, 0 stands for the first document that has been indexed. If you invert the documents in your bulk call, you'll see 1 instead.

Related

Change field type in index without reindex

First, I had this index template
GET localhost:9200/_index_template/document
And this is output
{
"index_templates": [
{
"name": "document",
"index_template": {
"index_patterns": [
"v*-documents-*"
],
"template": {
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"properties": {
"firstOperationAtUtc": {
"format": "epoch_millis",
"ignore_malformed": true,
"type": "date"
},
"firstOperationAtUtcDate": {
"ignore_malformed": true,
"type": "date"
}
}
},
"aliases": {
"documents-": {}
}
},
"composed_of": [],
"priority": 501,
"version": 1
}
}
]
}
And my data is indexed, for example
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"bool": {
"should": [
{
"exists": {
"field": "firstOperationAtUtc"
}
}
]
}
}
}
Output is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "v2-documents-2021-11-20",
"_type": "_doc",
"_id": "9b46d6fe78735274342d1bc539b084510000000455",
"_score": 1.0,
"_source": {
"firstOperationAtUtc": 1556868952000,
"firstOperationAtUtcDate": "2019-05-03T13:35:52.000Z"
}
}
]
}
}
Next, I need to update mapping for field firstOperationAtUtc and remove format epoch_millis
localhost:9200/_template/document
{
"index_patterns": [
"v*-documents-*"
],
"template": {
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"properties": {
"firstOperationAtUtc": {
"ignore_malformed": true,
"type": "date"
},
"firstOperationAtUtcDate": {
"ignore_malformed": true,
"type": "date"
}
}
},
"aliases": {
"documents-": {}
}
},
"version": 1
}
After that, If I get previous request I still have indexed data.
But now I need to update field firstOperationAtUtc and set data from firstOperationAtUtcDate
localhost:9200/v2-documents-2021-11-20/_update_by_query
{
"script": {
"source": "if (ctx._source.firstOperationAtUtcDate != null) { ctx._source.firstOperationAtUtc = ctx._source.firstOperationAtUtcDate }",
"lang": "painless"
},
"query": {
"match": {
"_id": "9b46d6fe78735274342d1bc539b084510000000455"
}
}
}
After that, if I get previous request
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"bool": {
"should": [
{
"exists": {
"field": "firstOperationAtUtc"
}
}
]
}
}
}
I have no indexed data
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
But if I find with id, I will get this data with modify data but my field is ignored
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"terms": {
"_id": [ "9b46d6fe78735274342d1bc539b084510000000455" ]
}
}
}
Output is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "v2-documents-2021-11-20",
"_type": "_doc",
"_id": "9b46d6fe78735274342d1bc539b084510000000455",
"_score": 1.0,
"_ignored": [
"firstOperationAtUtc"
],
"_source": {
"firstOperationAtUtc": "2019-05-03T13:35:52.000Z",
"firstOperationAtUtcDate": "2019-05-03T13:35:52.000Z"
}
}
]
}
}
How I could indexed data without reindex? Because I have milliard data in index and this could may produce huge downtime in prod
What you changed is the index template, but not your index mapping. The index template is used only when a new index that matches the name pattern is created.
What you want to do is to modify the actual mapping of your index, like this:
PUT test/_mapping
{
"properties": {
"firstOperationAtUtc": {
"ignore_malformed": true,
"type": "date"
}
}
}
However, this won't be possible and you will get the following error, which makes sense as you cannot modify an existing field mapping.
Mapper for [firstOperationAtUtc] conflicts with existing mapper:
Cannot update parameter [format] from [epoch_millis] to [strict_date_optional_time||epoch_millis]
The only reason why your update by query seemed to work is because you have "ignore_malformed": true in your mapping. Because if you remove that parameter and try to run your update by query again, you'd see the following error:
"type" : "mapper_parsing_exception",
"reason" : "failed to parse field [firstOperationAtUtc] of type [date] in document with id '2'. Preview of field's value: '2019-05-03T13:35:52.000Z'",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "failed to parse date field [2019-05-03T13:35:52.000Z] with format [epoch_millis]",
"caused_by" : {
"type" : "date_time_parse_exception",
"reason" : "date_time_parse_exception: Failed to parse with all enclosed parsers"
}
}
So, to wrap it up, you have two options:
Create a new index with the right mapping and reindex your old index into it, but that doesn't seem like an option for you.
Create a new field in your existing index mapping (e.g. firstOperationAtUtcTime) and discard the use of firstOperationAtUtc
The steps would be:
Modify the index template to add the new field
Modify the actual index mapping to add the new field
Run your update by query by modifying the script to write your new field
In short:
# 1. Modify your index template
# 2. modify your actual index mapping
PUT v2-documents-2021-11-20/_mapping
{
"properties": {
"firstOperationAtUtcTime": {
"ignore_malformed": true,
"type": "date"
}
}
}
# 3. Run update by query again
POST v2-documents-2021-11-20/_update_by_query
{
"script": {
"source": "if (ctx._source.firstOperationAtUtcDate != null) { ctx._source.firstOperationAtUtcTime = ctx._source.firstOperationAtUtcDate; ctx._source.remove('firstOperationAtUtc')}",
"lang": "painless"
},
"query": {
"match": {
"_id": "9b46d6fe78735274342d1bc539b084510000000455"
}
}
}

Elastic search array of objects nested range aggregation

I'm trying to make range aggregation on the following data set:
{
"ProductType": 1,
"ProductDefinition": "fc588f8e-14f2-4871-891f-c73a4e3d17ca",
"ParentProduct": null,
"Sku": "074617",
"VariantSku": null,
"Name": "Paraboot Avoriaz/Jannu Marron Brut Marron Brown Hiking Boot Shoes",
"AllowOrdering": true,
"Rating": null,
"ThumbnailImageUrl": "/media/1106/074617.jpg",
"PrimaryImageUrl": "/media/1106/074617.jpg",
"Categories": [
"399d7b20-18cc-46c0-b63e-79eadb9390c7"
],
"RelatedProducts": [],
"Variants": [
"84a7ff9f-edf0-4aab-87f9-ba4efd44db74",
"e2eb2c50-6abc-4fbe-8fc8-89e6644b23ef",
"a7e16ccc-c14f-42f5-afb2-9b7d9aefbc5c"
],
"PriceGroups": [
"86182755-519f-4e05-96ef-5f93a59bbaec"
],
"DisplayName": "Paraboot Avoriaz/Jannu Marron Brut Marron Brown Hiking Boot Shoes",
"ShortDescription": "",
"LongDescription": "<ul><li>Paraboot Avoriaz Mountaineering Boots</li><li>Marron Brut Marron (Brown)</li><li>Full leather inners and uppers</li><li>Norwegien Welted Commando Sole</li><li>Hand made in France</li><li>Style number : 074617</li></ul><p>As featured on Pritchards.co.uk</p>",
"UnitPrices": {
"EUR 15 pct": 343.85
},
"Taxes": {
"EUR 15 pct": 51.5775
},
"PricesInclTax": {
"EUR 15 pct": 395.4275
},
"Slug": "paraboot-avoriazjannu-marron-brut-marron-brown-hiking-boot-shoes",
"VariantsProperties": [
{
"Key": "ShoeSize",
"Value": "8"
},
{
"Key": "ShoeSize",
"Value": "10"
},
{
"Key": "ShoeSize",
"Value": "6"
}
],
"Guid": "0d4f6899-c66a-4416-8f5d-26822c3b57ae",
"Id": 178,
"ShowOnHomepage": true
}
I'm aggregating on VariantsProperties which have the following mapping
"VariantsProperties": {
"type": "nested",
"properties": {
"Key": {
"type": "keyword"
},
"Value": {
"type": "keyword"
}
}
}
Terms aggregations are working fine with following code:
{
"aggs": {
"Nest": {
"nested": {
"path": "VariantsProperties"
},
"aggs": {
"fieldIds": {
"terms": {
"field": "VariantsProperties.Key"
},
"aggs": {
"values": {
"terms": {
"field": "VariantsProperties.Value"
}
}
}
}
}
}
}
}
However when I try to do a range aggregation to get shoes in size between 8 - 12 such as:
{
"aggs": {
"Nest": {
"nested": {
"path": "VariantsProperties"
},
"aggs": {
"fieldIds": {
"range": {
"field": "VariantsProperties.Value",
"ranges": [ { "from": 8, "to": 12 }]
}
}
}
}
}
}
I get the following error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Field [VariantsProperties.Value] of type [keyword] is not supported for aggregation [range]"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "product-avenueproductindexdefinition-24476f82-en-us",
"node": "ejgN4XecT1SUfgrhzP8uZg",
"reason": {
"type": "illegal_argument_exception",
"reason": "Field [VariantsProperties.Value] of type [keyword] is not supported for aggregation [range]"
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Field [VariantsProperties.Value] of type [keyword] is not supported for aggregation [range]",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Field [VariantsProperties.Value] of type [keyword] is not supported for aggregation [range]"
}
}
},
"status": 400
}
Is there a way to "transform" the terms aggregation into a range aggregation, without the need of changing the schema? I know I could build the ranges myself by extracting the data from the terms aggregation and building the ranges out of it, however, I would prefer a solution within the elastic itself.
There are two ways to solve this:
Option A: Use a script instead of a field. This option will work without having to reindex your data, but depending on your volume of data, the performance might suffer.
POST test/_search
{
"aggs": {
"Nest": {
"nested": {
"path": "VariantsProperties"
},
"aggs": {
"fieldIds": {
"range": {
"script": "Integer.parseInt(doc['VariantsProperties.Value'].value)",
"ranges": [
{
"from": 8,
"to": 12
}
]
}
}
}
}
}
}
Option B: Add an integer sub-field in your mapping.
PUT my-index/_mapping
{
"properties": {
"VariantsProperties": {
"type": "nested",
"properties": {
"Key": {
"type": "keyword"
},
"Value": {
"type": "keyword",
"fields": {
"numeric": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
}
}
Once your mapping is modified, you can run _update_by_query on your index in order to reindex the VariantsProperties.Value data
PUT my-index/_update_by_query
Finally, when this last command is done, you can run the range aggregation on the VariantsProperties.Value.numeric field.
Also note that this second but will be more performant on the long term.

ELK query to return one record for each product with the max timestamp

On Kibana, I can view logs for various products (product.name) along with timestamp and other information. Here is one of the log:
{
"_index": "xxx-2017.08.30",
"_type": "logs",
"_id": "xxxx",
"_version": 1,
"_score": null,
"_source": {
"v": "1.0",
"level": "INFO",
"timestamp": "2017-01-30T18:31:50.761Z",
"product": {
"name": "zzz",
"version": "2.1.0-111"
},
"context": {
...
...
}
},
"fields": {
"timestamp": [
1504117910761
]
},
"sort": [
1504117910761
]
}
There are several other logs for same product and also several logs for different products.
However, I want to write a query that returns single record for a given product.name (the one with maximum timestamp value) and it returns same information for all other products. That, is logs returned one for each product and for each product, it should be the one with maximum timestamp.
How do I achieve this?
I tried to follow the approach listed in:
How to get latest values for each group with an Elasticsearch query?
And created a query:
{
"aggs": {
"group": {
"terms": {
"field": "product.name"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}'
But, I got an error that said:
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [product.name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
Do I absolutely need to set fielddata=true for this field in this case? If no, what should I do? If yes, I am not sure how to set it. I tried doing it this way:
curl -XGET 'localhost:9200/xxx*/_search?pretty' -H 'Content-Type: application/json' -d'
{
"properties": {
"product.name": {
"type": "text",
"fielddata": true
}
},
"aggs": {
"group": {
"terms": {
"field": "product.name"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}'
But, I think there is something wrong with it (synatactically?) and I get this error:
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Unknown key for a START_OBJECT in [properties].",
"line" : 3,
"col" : 19
}
],
The reason you got error is because you try to do aggregation on text field (product.name) you can't doing that in elasticsearch 5.
You don't need to set field data to true,what you need to do is define in the mapping the fields product. name as a 2 fields, one product.name and second product.name.keyword
Like this:
{
"product.name":
{
"type" "text",
"fields":
{
"keyword":
{
"type": "keyword",
"ignore_above": 256
}
}
}
}
Then you need to do the aggregation on product.name.keyword

Elasticsearch Query - Return all documents that do not have a corresponding document

I have an index that contains documents who have a status. These are initially imported with a job and their status is set to 0.
For simplicity:
{
"_uid" : 1234
"id" : 1
"name" : "someName",
"status" : 0
}
Then another import job runs and extends these objects by iterating over each object with status=0. Each object that is extended gets the status 1.
{
"_uid" : 1234
"id" : 1
"name" : "someName",
"newProperty" : "someValue",
"status" : 1
}
(Note the unchanged _uid. It's the same object)
Now I have a third import job that takes all objects with status one, takes their ID (the ID!!! Not their _uid!) and creates a new object with the same ID, but different UID:
{
"_uid" : 5678
"id" : 1
"completelyDifferentProperty" : "someValue"
"status" : 2
}
So now, for each ID, I have two objects: One with status = 1, One with status = 2.
For the last job I need to make sure that it only picks objects with status =1 that DO NOT YET have a corresponding status=2 object.
So I need a query to the effect of
"Get all objects where status == 1 for which no status == 2 object with the same ID exists".
I have a feeling aggregations might help me but I haven't gotten it figured out yet.
You can do it fairly easily with a parent/child relationship. This is sort of a special-case use of the capability, but I think it could be used to solve your problem.
To test it out, I set up an index like this, with parent_doc type and a child_doc type (I only included the properties necessary to set up the capability; it doesn't hurt to add more in your documents):
PUT /test_index
{
"mappings": {
"parent_doc": {
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long"
},
"_uid": {
"type": "long"
},
"status": {
"type": "integer"
}
}
},
"child_doc": {
"_parent": {
"type": "parent_doc"
},
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long"
},
"_uid": {
"type": "long"
},
"status": {
"type": "long"
}
}
}
}
}
Then I added four docs; three parents, one child. There is one document that has "status: 1 that doesn't have a corresponding child document.
POST /test_index/_bulk
{"index":{"_type":"parent_doc"}}
{"_uid":1234,"id":1,"name":"someName","newProperty":"someValue","status":0}
{"index":{"_type":"parent_doc"}}
{"_uid":1234,"id":2,"name":"someName","newProperty":"someValue","status":1}
{"index":{"_type":"child_doc","_parent":2}}
{"_uid":5678,"id":2,"completelyDifferentProperty":"someValue","status":2}
{"index":{"_type":"parent_doc"}}
{"_uid":4321,"id":3,"name":"anotherName","newProperty":"anotherValue","status":1}
We can find the document we want like this; notice we are querying only the parent_doc type, and that our conditions are that status is 1 and no child (at all) exists:
POST /test_index/parent_doc/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"status": 1
}
},
{
"not": {
"filter": {
"has_child": {
"type": "child_doc",
"query": {
"match_all": {}
}
}
}
}
}
]
}
}
}
}
}
This returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "3",
"_score": 1,
"_source": {
"_uid": 4321,
"id": 3,
"name": "anotherName",
"newProperty": "anotherValue",
"status": 1
}
}
]
}
}
Here's all the code I used to test it:
http://sense.qbox.io/gist/d1a0267087d6e744b991de5cdec1c31d947ebc13

elastic search suggester filter

I am implementing suggester filter for search operation using elastic search API.
I have encountered problem like I can do search base in prefix search only, but I cant do with middle word.
I had tried below example :
PUT / bls {
"mappings": {
"bl": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"name_suggest": {
"type": "completion",
"context": {
"store": {
"type": "category"
},
"status": {
"type": "category"
}
}
}
}
}
}
}
and
POST / bls / bl / 1 {
"name": "LG 32LN5110 32 inches LED TV",
"name_suggest": {
"input": ["sony 32LN5110 32 inches LED TV"],
"context": {
"store": [
44,
45
],
"status": "Active"
}
}
}
POST / bls / _suggest ? pretty {
"name_suggest": {
"text": "sony",
"completion": {
"field": "name_suggest",
"context": {
"store": "44",
"status": "Active"
}
}
}
}
I got result with above query but I cant do search with below query :
POST / bls / _suggest ? pretty {
"name_suggest": {
"text": "LED",
"completion": {
"field": "name_suggest",
"context": {
"store": "44",
"status": "Active"
}
}
}
}
and this above query display results as below :
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"name_suggest": [{
"text": "LED",
"offset": 0,
"length": 3,
"options": []
}]
}
The String type are indexed by default. So even without specifying the type they are indexed with Default Analyzer if no specific analyzer was specified.
For your case, you must specify the
index: analyzed for name_suggest property
Such that an Anayzer containing whitespace analyzer is used, which will tokenize all the words in your input field. And hence can search anywhere across the text.

Resources