Elasticsearch: Transpose and aggregate the data

Elasticsearch: Transpose and aggregate the data - elasticsearch

I am using the ES 6.5. When I fetch the required messages, I have to transpose and aggregate it. See example for more details.
Message retrieved - 2 messages retried for example:
{
"_index": "index_name",
"_type": "data",
"_id": "data_id",
"_score": 5.0851293,
"_source": {
"header": {
"id": "System_20190729152502239_57246_16667",
"creationTimestamp": "2019-07-29T15:25:02.239Z",
},
"messageData": {
"messageHeader": {
"date": "2019-06-03",
"mId": "1000",
"mDescription": "TEST",
},
"messageBreakDown": [
{
"category": "New",
"subCategory": "Sub",
"messageDetails": [
{
"Amount": 5.30
}
]
}
]
}
}
},
{
"_index": "index_name",
"_type": "data",
"_id": "data_id",
"_score": 5.09512,
"_source": {
"header": {
"id": "System_20190729152502239_57246_16667",
"creationTimestamp": "2019-07-29T15:25:02.239Z",
},
"messageData": {
"messageHeader": {
"date": "2019-06-03",
"mId": "1000",
"mDescription": "TEST",
},
"messageBreakDown": [
{
"category": "Old",
"subCategory": "Sub",
"messageDetails": [
{
"Amount": 4.30
}
]
}
]
}
}
}
Now I am looking for a query to post on ES which will transpose the data and group by on category and sub category .
So basically if you check the messages, they have same header.id (which is the main search criteria). Within this header.id, one message is for category New and other Old (messageData.messageBreakDown is array and in it category value).
So ideally as you see the output, both messages belong to same mId, and it has New price and Old Price.
How to aggregate for the desired results ?
Final output message can have desired fields only e.g. date, mId, mDesciption, New price and Old price (both in one output)?
UPDATE:
Below is the mapping,
{"index_name":{"mappings":{"data":{"properties":{"header":{"properties":{"id":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"creationTimestamp":{"type":"date"}}},"messageData":{"properties":{"messageBreakDown":{"properties":{"category":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"messageDetails":{"properties":{"Amount":{"type":"float"}}},"subCategory":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"messageHeader":{"properties":{"mDescription":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"mId":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"date":{"type":"date"}}}}}}}}}}

Related

Update large set of documents without knowing _id

I would like up update a large set of documents in Elasticsearch at once.
One document looks like this:
{
"_index": "vue_storefront_magento_1_1587559712",
"_type": "product",
"_id": "1123",
"_version": 56,
"_score": 7.7135754,
"_source": {
"sku": "381735",
"score": "1"
}
"fields": {
"updated_at": [
1589880769000
]
},
"highlight": {
"type_id": [
"#kibana-highlighted-field#configurable#/kibana-highlighted-field#"
],
"sku": [
"#kibana-highlighted-field#381735#/kibana-highlighted-field#"
]
}
}
I have a JSON file that contains the data I want to update, there is no _id field, only an SKU. I want to use this JSON to create the request to ElasticSearch to update.
[
{ "sku": 381735, "score": 2 },
{ "sku": 381736, "score": 3 },
{ "sku": 381737, "score": 4 }
]
I would like to update all of the score fields based on the SKU field in the _source.
Is this possible? I already looked at the update by query API but can't figure it out :-/

How to turn an array of object to array of string while reindexing in elasticsearch?

Let say the source index have a document like this :
{
"name":"John Doe",
"sport":[
{
"name":"surf",
"since":"2 years"
},
{
"name":"mountainbike",
"since":"4 years"
},
]
}
How to discard the "since" information so once reindexed the object will contain only sport names? Like this :
{
"name":"John Doe",
"sport":["surf","mountainbike"]
}
Note that it would be fine if the resulting field keep the same name, but it's not mandatory.

I don't know which version of elasticsearch you're using, but here is a solution based on pipelines, introduced with ingest nodes in ES v5.0.
1) A script processor is used to extract the values from each subobject and set it in another field (here, sports)
2) The previous sport field is removed with a remove processor
You can use the Simulate pipeline API to test it :
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "random description",
"processors": [
{
"script": {
"lang": "painless",
"source": "ctx.sports =[]; for (def item : ctx.sport) { ctx.sports.add(item.name) }"
}
},
{
"remove": {
"field": "sport"
}
}
]
},
"docs": [
{
"_index": "index",
"_type": "doc",
"_id": "id",
"_source": {
"name": "John Doe",
"sport": [
{
"name": "surf",
"since": "2 years"
},
{
"name": "mountainbike",
"since": "4 years"
}
]
}
}
]
}
which outputs the following result :
{
"docs": [
{
"doc": {
"_index": "index",
"_type": "doc",
"_id": "id",
"_source": {
"name": "John Doe",
"sports": [
"surf",
"mountainbike"
]
},
"_ingest": {
"timestamp": "2018-07-12T14:07:25.495Z"
}
}
}
]
}
There may be a better solution, as I've not used pipelines a lot, or you could make this with Logstash filters before submitting the documents to your Elasticsearch cluster.
For more information about the pipelines, take a look at the reference documentation of ingest nodes.

Elasticsearch: Top k results per keyword

We have the following document in elasticsearch.
class Query(DocType):
text = Text(analyzer='snowball', fields={'raw': Keyword()})
src = Keyword()
Now we want top k results for each src. How can we achieve this?
Example:- Lets assume we index the following:
# src: place_order
Query(text="I want to order food", src="place_order")
Query(text="Take my order", src="place_order")
...
# src: payment
Query(text="How to pay ?", src="payment")
Query(text="Do you accept credit card ?", src="payment")
...
Now if the user writes a query take my order please along with the credit card details, and k=1, then we should return the following two results
[{"text": "Take my order", "src": "place_order", },
{"text": "Do you accept credit card ?", "src": "payment"}
]
Here since k=1, we are returning the just one result for each src.

You may try top hits aggregation which will return top N matching documents per each bucket in aggregation.
For the example in your post the query might look like this:
POST queries/query/_search
{
"query": {
"match": {
"text": "take my order please along with the credit card details"
}
},
"aggs": {
"src types": {
"terms": {
"field": "src"
},
"aggs": {
"best hit": {
"top_hits": {
"size": 1
}
}
}
}
}
}
The search on the text query restricts the set of documents for the aggregation. "src types" aggregation groups all src values found in the matched documents, and "best hit" selects one most relevant document per bucket (size parameter can be changed according to your needs).
The result of the query would be like the following:
{
"hits": {
"total": 3,
"max_score": 1.3862944,
"hits": [
{
"_index": "queries",
"_type": "query",
"_id": "VD7QVmABl04oXt2HGbGB",
"_score": 1.3862944,
"_source": {
"text": "Do you accept credit card ?",
"src": "payment"
}
},
{
"_index": "queries",
"_type": "query",
"_id": "Uj7PVmABl04oXt2HlLFI",
"_score": 0.8630463,
"_source": {
"text": "Take my order",
"src": "place_order"
}
},
{
"_index": "queries",
"_type": "query",
"_id": "UT7PVmABl04oXt2HKLFy",
"_score": 0.6931472,
"_source": {
"text": "I want to order food",
"src": "place_order"
}
}
]
},
"aggregations": {
"src types": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "place_order",
"doc_count": 2,
"best hit": {
"hits": {
"total": 2,
"max_score": 0.8630463,
"hits": [
{
"_index": "queries",
"_type": "query",
"_id": "Uj7PVmABl04oXt2HlLFI",
"_score": 0.8630463,
"_source": {
"text": "Take my order",
"src": "place_order"
}
}
]
}
}
},
{
"key": "payment",
"doc_count": 1,
"best hit": {
"hits": {
"total": 1,
"max_score": 1.3862944,
"hits": [
{
"_index": "queries",
"_type": "query",
"_id": "VD7QVmABl04oXt2HGbGB",
"_score": 1.3862944,
"_source": {
"text": "Do you accept credit card ?",
"src": "payment"
}
}
]
}
}
}
]
}
}
}
Hope that helps!

ElasticSearch : How can I boost score depending on field value?

I am trying to get rid of sorting in elasticsearch by boosting the _score based on field value. Here is my scenario:
I have a field in my document: applicationDate. This is time elapsed since EPOC. I want record having greater applicationDate (most recent) to have higer score.
If score of two documents are same, I want to sort them on another field that is of type String. Say "status" is another field that can have value (Available, in progress, closed ). So, documents having same applicationDate should have _score based on status.
Available should have more score , In Progress a less, Closed, least. So by this means, I wont have to sort the documents after getting results.
Please give me some pointers.

You should be able to achieve this using Function Score .
Depending on your requirements it could be as simple as the following
Example:
put test/test/1
{
"applicationDate" : "2015-12-02",
"status" : "available"
}
put test/test/2
{
"applicationDate" : "2015-12-02",
"status" : "progress"
}
put test/test/3
{
"applicationDate" : "2016-03-02",
"status" : "progress"
}
post test/_search
{
"query": {
"function_score": {
"functions": [
{
"field_value_factor" : {
"field" : "applicationDate",
"factor" : 0.001
}
},
{
"filter": {
"term": {
"status": "available"
}
},
"weight": 360
},
{
"filter": {
"term": {
"status": "progress"
}
},
"weight": 180
}
],
"boost_mode": "multiply",
"score_mode": "sum"
}
}
}
**Results:**
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 1456877060,
"_source": {
"applicationDate": "2016-03-02",
"status": "progress"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1449014780,
"_source": {
"applicationDate": "2015-12-02",
"status": "available"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 1449014660,
"_source": {
"applicationDate": "2015-12-02",
"status": "progress"
}
}
]

Have you looked at function scores?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
Specifically look at decay functions in the above documentation.

There is a new field called rank_feature_field that can be useful for this usecase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-feature.html

Specifying total size of results to return for ElasticSearch query when using inner_hits

ElasticSearch allows inner_hits to specify 'from' and 'size' parameters, as can the outer request body of a search.
As an example, assume my index contains 25 books, each having less than 50 chapters. The below snippet would return all chapters across all books, because a 'size' of 100 books includes all of 25 books and a 'size' of 50 chapters includes all of "less than 50 chapters":
"index": 'books',
"type": 'book',
"body": {
"from" : 0, "size" : 100, // outer hits, or books
"query": {
"filtered": {
"filter": {
"nested": {
"inner_hits": {
"size": 50 // inner hits, or chapters
},
"path": "chapter",
"query": { "match_all": { } },
}
}
}
},
.
.
.
Now, I'd like to implement paging with a scenario like this. My question is, how?
In this case, do I have to return back the above max of 100 * 50 = 5000 documents from the search query and implement paging in the application level by displaying only the slice I am interested in? Or, is there a way to specify the total number of hits to return back in the search query itself, independent of the inner/outer size?
I am looking at the "response" as follows, and so would like this data to be able to be paginated:
response.hits.hits.forEach(function(book) {
chapters = book.inner_hits.chapters.hits.hits;
chapters.forEach(function(chapter) {
// ... this is one displayed result ...
});
});

I don't think this is possible with Elasticsearch and nested fields. The way you see the results is correct: ES paginates and returns books and it doesn't see inside nested inner_hits. Is not how it works. You need to handle the pagination manually in your code.
There is another option, but you need a parent/child relationship instead of nested.
Then you are able to query the children (meaning, the chapters) and paginate the results (the chapters). You can use inner_hits and return back the parent (the book itself).
PUT /library
{
"mappings": {
"book": {
"properties": {
"name": {
"type": "string"
}
}
},
"chapter": {
"_parent": {
"type": "book"
},
"properties": {
"title": {
"type": "string"
}
}
}
}
}
The query:
GET /library/chapter/_search
{
"size": 5,
"query": {
"has_parent": {
"type": "book",
"query": {
"match_all": {}
},
"inner_hits" : {}
}
}
}
And a sample output (trimmed, complete example here):
"hits": [
{
"_index": "library",
"_type": "chapter",
"_id": "1",
"_score": 1,
"_source": {
"title": "chap1"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
},
{
"_index": "library",
"_type": "chapter",
"_id": "2",
"_score": 1,
"_source": {
"title": "chap2"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
}

The search api allows for the addition of certain standard parameters, listed in the docs at: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference-2-0.html#api-search-2-0
According to the doc:
size Number — Number of hits to return (default: 10)
Which would make your request something like:
"size": 5000,
"index": 'books',
"type": 'book',
"body": {

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch: Transpose and aggregate the data - elasticsearch

Related

Update large set of documents without knowing _id

How to turn an array of object to array of string while reindexing in elasticsearch?

Elasticsearch: Top k results per keyword

ElasticSearch : How can I boost score depending on field value?

Specifying total size of results to return for ElasticSearch query when using inner_hits

Categories

Resources