How to write query elastic search? - elasticsearch

I have a index review like this
{
"_index": "zody_review",
"_type": "review",
"_id": "5b3c6c9e68cf860e1af5f7fd",
"_score": null,
"_source": {
"user": "571899623dc63af34d67a662",
"merchant": "56f8f80119a4c1ae791cf7bf",
"point": 3,
"score": 2,
"createdAt": "2018-07-04T13:43:42.331+07:00",
"location": {
"lat": 16.07054040054832,
"lon": 108.22062939405441
},
"feedback": "Phuc vu khong tot lam "
}
},
How can I query to get list review nearby, but limit get 5 reviews for each group by field merchant?
I've been stuck here too long!
Thank you!

You need to only return reviews that are near (say 100m) a given location and then you need to aggregate by marchant terms and add a top_hits sub-aggregation. It goes like this:
{
"size": 0,
"query": {
"geo_distance": {
"distance": "100m",
"location": {
"lat": 16.07055,
"lon": 108.2207
}
}
},
"aggs": {
"by_merchant": {
"terms": {
"field": "merchant"
},
"aggs": {
"top_5": {
"top_hits": {
"_source": [
"feedback"
],
"size": 5
}
}
}
}
}
}
Simply replace the location by the one you want to search around and probably the distance if you need a larger or smaller distance.

Related

ElasticSearch Aggregation + Sorting in on NonNumric Field 5.3

I wanted to aggregate the data on a different field and also wanted to get the aggregated data on sorted fashion based on the name.
My data is :
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp001_local000000000000001",
"_score": 10.0,
"_source": {
"name": [
"Person 01"
],
"groupbyid": [
"group0001"
],
"ranking": [
"2.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp002_local000000000000001",
"_score": 85146.375,
"_source": {
"name": [
"Person 02"
],
"groupbyid": [
"group0001"
],
"ranking": [
"10.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp003_local000000000000001",
"_score": 20.0,
"_source": {
"name": [
"Person 03"
],
"groupbyid": [
"group0002"
],
"ranking": [
"-1.0"
]
}
},
{
"_index": "testing-aggregation",
"_type": "employee",
"_id": "emp004_local000000000000001",
"_score": 5.0,
"_source": {
"name": [
"Person 04"
],
"groupbyid": [
"group0002"
],
"ranking": [
"2.0"
]
}
}
My query :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "name:emp*^1000.0"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"order": {
"top_hit_agg": "desc"
},
"size": 10
},
"aggs": {
"top_hit_agg": {
"terms": {
"field": "name"
}
}
}
}
}
}
My mapping is :
{
"name": {
"type": "text",
"fielddata": true,
"fields": {
"lower_case_sort": {
"type": "text",
"fielddata": true,
"analyzer": "case_insensitive_sort"
}
}
},
"groupbyid": {
"type": "text",
"fielddata": true,
"index": "analyzed",
"fields": {
"raw": {
"type": "keyword",
"index": "not_analyzed"
}
}
}
}
I am getting data based on the average of the relevance of grouped records. Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
I wanted grouping on one field and after that grouped bucket, I want to sort on another field. This is sample data.
There are other fields like created_on, updated_on. I also wanted to get sorted data based on that field. also get the data by alphabetically grouped.
I wanted to sort on the non-numeric data type(string). I can do the numeric data type.
I can do it for the ranking field but not able to do it for the name field. It was giving the below error.
Expected numeric type on field [name], but got [text];
You're asking for a few things, so I'll try to answer them in turn.
Step 1: Sorting buckets by relevance
I am getting data based on the average of the relevance of grouped records.
If this is what you're attempting to do, it's not what the aggregation you wrote is doing. Terms aggregations default to sorting the buckets by the number of documents in each bucket, descending. To sort the groups by "average relevance" (which I'll interpret as "average _score of documents in the group"), you'd need to add a sub-aggregation on the score and sort the terms aggregation by that:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless",
}
}
}
}
}
}
Step 2: Sorting employees by name
Now, what I wanted is the first club the records based on the groupid and then in each bucket sort the data based on the name field.
To sort the documents within each bucket, you can use a top_hits aggregation:
"aggregations": {
"most_relevant_groups": {
"terms": {
"field": "groupbyid.raw",
"order": {
"average_score": "desc"
}
},
"aggs": {
"employees": {
"top_hits": {
"size": 10, // Default will be 10 - change to whatever
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
Step 3: Putting it all together
Putting the both the above together, the following aggregation should suit your needs (note that I used a function_score query to simulate "relevance" based on ranking - your query can be whatever and just needs to be any query that produces whatever relevance you need):
POST /testing-aggregation/employee/_search
{
"size": 0,
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "ranking"
}
}
]
}
},
"aggs": {
"groupbyid": {
"terms": {
"field": "groupbyid.raw",
"size": 10,
"order": {
"average_score": "desc"
}
},
"aggs": {
"average_score": {
"avg": {
"script": {
"inline": "_score",
"lang": "painless"
}
}
},
"employees": {
"top_hits": {
"size": 10,
"sort": [
{
"name.lower_case_sort": {
"order": "asc"
}
}
]
}
}
}
}
}
}

Elastic scripted aggregation of nested key/value pairs

I want to have a scripted aggregation of key/value pairs in a nested array in elastic. An example of the documents returned is as follows:
"hits": [
{
"_index": "testdan",
"_type": "year",
"_id": "AVtXirjYuoFS95t7pfkg",
"_score": 1,
"_source": {
"m_iYear": 2006,
"m_iTopicID": 11,
"m_People": [
{
"name": "Petrovic, Rade",
"value": 3.70370364
},
{
"name": "D. Kirovski",
"value": 3.70370364
}
]
}},
{
"_index": "testdan",
"_type": "year",
"_id": "AVtXirjYuoFS95t7pfkg",
"_score": 1,
"_source": {
"m_iYear": 2007,
"m_iTopicID": 11,
"m_People": [
{
"name": "Petrovic, Rade",
"value": 6.70370364
},
{
"name": "D. Kirovski",
"value": 2.70370364
}
]
}}
]
I would like to aggregate an average value for each person in m_Person over each document, as follows:
Petrovic, Rade = 3.70370364 + 6.70370364 / 2 = 7.05
D. Kirovski = 3.70370364 + 2.70370364 / 2 = 5.05
The division for the average should be calculated by the number of years that name appears.. One year may not show only one name for instance.
If this is more difficult due to not having unique IDs for people, I plan to add an ID for each person, but how would you go about scripting this so instead of returning all people, and needing to loop through at front-end, I can just have an array of people and their averages?
You may be able to achieve this sort of aggregation by utilizing Kibana Scripted Fields. See the examples section. This assumes you are using Elasticsearch 5.0 as the scripting language is Painless.
You can achieved this with a nested aggregation pretty easily. For each year, we're aggregating on the people's names and then computing the average value for each of them.
{
"size": 0,
"aggs": {
"years": {
"terms": {
"field": "m_iYear"
},
"aggs": {
"names": {
"nested": {
"path": "m_People"
},
"aggs": {
"names": {
"terms": {
"field": "m_People.name"
},
"aggs": {
"average": {
"avg": {
"field": "m_People.value"
}
}
}
}
}
}
}
}
}
}

Aggregate script ONLY on results of sorted query with filter, not full dataset

FYI - elasticsearch #v1.5; npm elasticsearch #4.0.2
For my specific use case, I need to find the five nearest points, around some other point, and calculate the max dist of those five results. For some reason, my query below is returning the max dist of all the filtered data, not the five nearest.
Here's my query thus far:
elasticsearchAPI = Meteor.npmRequire('elasticsearch');
esClient = new elasticsearchAPI.Client({
host: 'myHost'
});
var esQueryObject = {
"index": "ma_homes",
"size": 5,
"body": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_distance": {
"LOCATION": {
"lat": 42.5125339,
"lon": -71.06748
},
"distance": "3mi",
"optimize_bbox": "memory"
}
}
}
},
"size": 5,
"sort": [{
"_geo_distance": {
"LOCATION": {
"lat": 42.5125339,
"lon": -71.06748
},
"order": "asc",
"unit": "mi",
"distance_type": "sloppy_arc"
}
}],
"fields": ["F1_V7_2_F1TOWN"],
"aggs": {
"max_dist": {
"max": {
"script": "doc[\u0027LOCATION\u0027].arcDistanceInMiles(lat,lon)",
"params" : {
"lat" : 42.5125339,
"lon" : -71.06748
}
}
}
}
}
}
try {
esClient.search(esQueryObject, function(err, res) {
if ( err ) console.log("err: ", err);
if ( res ) {
console.log("res: ", JSON.stringify(res, null, "\t"));
};
});
}
catch(error) {
console.log("search err: ", error);
};
My problem is this returns a max_dist of 2.99, but I can clearly see from the hits that it should only be 0.02268!
Lastly, is there a better way of calculating the max distance? I don't live having to use a script.
See the results, below:
I20160729-14:46:08.447(-7)? {
I20160729-14:46:08.447(-7)? "took": 119,
I20160729-14:46:08.447(-7)? "timed_out": false,
I20160729-14:46:08.447(-7)? "_shards": {
I20160729-14:46:08.447(-7)? "total": 5,
I20160729-14:46:08.448(-7)? "successful": 5,
I20160729-14:46:08.448(-7)? "failed": 0
I20160729-14:46:08.448(-7)? },
I20160729-14:46:08.448(-7)? "hits": {
I20160729-14:46:08.448(-7)? "total": 19428,
I20160729-14:46:08.448(-7)? "max_score": null,
I20160729-14:46:08.452(-7)? "hits": [
I20160729-14:46:08.452(-7)? {
I20160729-14:46:08.452(-7)? "_index": "ma_homes",
I20160729-14:46:08.452(-7)? "_type": "home",
I20160729-14:46:08.453(-7)? "_id": "AVY1KqHN5rKRAKXZHxQf",
I20160729-14:46:08.453(-7)? "_score": null,
I20160729-14:46:08.453(-7)? "fields": {
I20160729-14:46:08.453(-7)? "F1_V7_2_F1TOWN": [
I20160729-14:46:08.453(-7)? "7WHITECIRWAKEFIELDMA"
I20160729-14:46:08.454(-7)? ]
I20160729-14:46:08.454(-7)? },
I20160729-14:46:08.454(-7)? "sort": [
I20160729-14:46:08.454(-7)? 0.013847018573431258
I20160729-14:46:08.454(-7)? ]
I20160729-14:46:08.455(-7)? },
I20160729-14:46:08.455(-7)? {
I20160729-14:46:08.455(-7)? "_index": "ma_homes",
I20160729-14:46:08.455(-7)? "_type": "home",
I20160729-14:46:08.456(-7)? "_id": "AVY1Ewoc5rKRAKXZGhMp",
I20160729-14:46:08.456(-7)? "_score": null,
I20160729-14:46:08.456(-7)? "fields": {
I20160729-14:46:08.456(-7)? "F1_V7_2_F1TOWN": [
I20160729-14:46:08.456(-7)? "8WHITECIRWAKEFIELDMA"
I20160729-14:46:08.457(-7)? ]
I20160729-14:46:08.457(-7)? },
I20160729-14:46:08.457(-7)? "sort": [
I20160729-14:46:08.458(-7)? 0.01675513175670524
I20160729-14:46:08.458(-7)? ]
I20160729-14:46:08.458(-7)? },
I20160729-14:46:08.458(-7)? {
I20160729-14:46:08.458(-7)? "_index": "ma_homes",
I20160729-14:46:08.459(-7)? "_type": "home",
I20160729-14:46:08.459(-7)? "_id": "AVY1T0cn5rKRAKXZJwC8",
I20160729-14:46:08.459(-7)? "_score": null,
I20160729-14:46:08.459(-7)? "fields": {
I20160729-14:46:08.459(-7)? "F1_V7_2_F1TOWN": [
I20160729-14:46:08.460(-7)? "10WHITECIRWAKEFIELDMA"
I20160729-14:46:08.460(-7)? ]
I20160729-14:46:08.460(-7)? },
I20160729-14:46:08.460(-7)? "sort": [
I20160729-14:46:08.461(-7)? 0.018417500448048605
I20160729-14:46:08.461(-7)? ]
I20160729-14:46:08.463(-7)? },
I20160729-14:46:08.464(-7)? {
I20160729-14:46:08.464(-7)? "_index": "ma_homes",
I20160729-14:46:08.464(-7)? "_type": "home",
I20160729-14:46:08.464(-7)? "_id": "AVY1Xb2P5rKRAKXZKhUh",
I20160729-14:46:08.464(-7)? "_score": null,
I20160729-14:46:08.465(-7)? "fields": {
I20160729-14:46:08.465(-7)? "F1_V7_2_F1TOWN": [
I20160729-14:46:08.465(-7)? "11WHITECIRWAKEFIELDMA"
I20160729-14:46:08.465(-7)? ]
I20160729-14:46:08.466(-7)? },
I20160729-14:46:08.466(-7)? "sort": [
I20160729-14:46:08.466(-7)? 0.018816876925529115
I20160729-14:46:08.467(-7)? ]
I20160729-14:46:08.467(-7)? },
I20160729-14:46:08.467(-7)? {
I20160729-14:46:08.468(-7)? "_index": "ma_homes",
I20160729-14:46:08.468(-7)? "_type": "home",
I20160729-14:46:08.468(-7)? "_id": "AVY1TNJh5rKRAKXZJnx0",
I20160729-14:46:08.468(-7)? "_score": null,
I20160729-14:46:08.469(-7)? "fields": {
I20160729-14:46:08.469(-7)? "F1_V7_2_F1TOWN": [
I20160729-14:46:08.470(-7)? "6WHITECIRWAKEFIELDMA"
I20160729-14:46:08.470(-7)? ]
I20160729-14:46:08.470(-7)? },
I20160729-14:46:08.471(-7)? "sort": [
I20160729-14:46:08.471(-7)? 0.022680252269458714
I20160729-14:46:08.471(-7)? ]
I20160729-14:46:08.471(-7)? }
I20160729-14:46:08.471(-7)? ]
I20160729-14:46:08.472(-7)? },
I20160729-14:46:08.472(-7)? "aggregations": {
I20160729-14:46:08.472(-7)? "max_dist": {
I20160729-14:46:08.472(-7)? "value": 2.999906924854209,
I20160729-14:46:08.473(-7)? "value_as_string": "2.999906924854209"
I20160729-14:46:08.473(-7)? }
I20160729-14:46:08.473(-7)? }
I20160729-14:46:08.474(-7)? }
There's two things wrong here, with the second strongly related to the first:
You're assuming that the sorting order has any impact on the aggregation. It doesn't. You may want to have a look at Elasticsearch: The Definitive Guide on Scoping Aggregations.
The gist is that the total result of the query, including not-returned-hits are a part of the aggregation's scope. In your exact case, it noted that there were "total": 19428 documents that matched your search. You just got back the closest 5.
You're sorting by ascending order, which means it sorts from least to greatest. This means you're only getting the top 5 closest distances, which is what you want, but that doesn't mean that's all the aggregation saw as the true max.
To those points, you need to figure out how to limit the top 5, or just not aggregate at all, which I would suggest is the easiest thing to do here. Simply get the top 5, then grab the last value and you're done getting both answers that you want.
Sorting is constrained to what's within 3 miles because of the 3 miles, which is good, but perhaps you can do something better depending on your needs by using a faster search distance_type:
{
"size": 5,
"_source": "F1_V7_2_F1TOWN",
"query": {
"filtered": {
"filter": [
{
"geo_distance": {
"LOCATION": {
"lat": 42.5125339,
"lon": -71.06748
},
"distance": "3mi",
"distance_type": "plane"
}
}
]
}
},
"sort": [
{
"_geo_distance": {
"LOCATION": {
"lat": 42.5125339,
"lon": -71.06748
},
"order": "asc",
"unit": "mi",
"distance_type": "sloppy_arc"
}
}
]
}
Notice I don't aggregate, I use _source instead of fields (fields is meant for stored fields, not limiting the source document output), and I am I switched to using plane for the filter distance_type because it's faster for short distances outside of the poles; I doubt too many homes are going to be using distances in the poles. For scoring, I left it as sloppy_arc because it can use a slightly more refined equation after being filtered.
I only get 5 documents back, and of those 5, the last one will be the furthest one away as its score.
As a big side note, ES 2.2+ increased geo performance significantly.

Complex aggregations with Elastic Search

Supposing this is my elasticsearch structure:
{
"_index": "my_index",
"_type": "person",
"_id": "ID",
"_source": {
...DATA...
}
}
{
"_index": "my_index",
"_type": "result",
"_id": "ID",
"_source": {
"personID": "personID"
"date": "timestamp",
"result": "integer",
"speciality": "categoryID"
}
}
I would like to get the most 10 most "influent" people based on:
number of competition in the last 30 days
number of competition in the last year
competition's results in the last 30 days
number of different specialities in the last 30 days
I'm thinking about using _score but I don't know how to influence the score using some values aggregated from the documents of type "result" . This is what I'm trying to achieve
POST my_index/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"term": {
"_type": {
"value": "person"
}
}
}
]
}
}
},
"functions": [
{
"field_value_factor": {
"field": {
"query": {
//competitions in the last 30 days
},
"aggs": {
//cout
}
},
"factor": 1
},
"weight": 0.1
}
]
}
}
Is this possible with just 1 request?
Is this a good approach?
Any tip on what to look at is appreciated

Specifying total size of results to return for ElasticSearch query when using inner_hits

ElasticSearch allows inner_hits to specify 'from' and 'size' parameters, as can the outer request body of a search.
As an example, assume my index contains 25 books, each having less than 50 chapters. The below snippet would return all chapters across all books, because a 'size' of 100 books includes all of 25 books and a 'size' of 50 chapters includes all of "less than 50 chapters":
"index": 'books',
"type": 'book',
"body": {
"from" : 0, "size" : 100, // outer hits, or books
"query": {
"filtered": {
"filter": {
"nested": {
"inner_hits": {
"size": 50 // inner hits, or chapters
},
"path": "chapter",
"query": { "match_all": { } },
}
}
}
},
.
.
.
Now, I'd like to implement paging with a scenario like this. My question is, how?
In this case, do I have to return back the above max of 100 * 50 = 5000 documents from the search query and implement paging in the application level by displaying only the slice I am interested in? Or, is there a way to specify the total number of hits to return back in the search query itself, independent of the inner/outer size?
I am looking at the "response" as follows, and so would like this data to be able to be paginated:
response.hits.hits.forEach(function(book) {
chapters = book.inner_hits.chapters.hits.hits;
chapters.forEach(function(chapter) {
// ... this is one displayed result ...
});
});
I don't think this is possible with Elasticsearch and nested fields. The way you see the results is correct: ES paginates and returns books and it doesn't see inside nested inner_hits. Is not how it works. You need to handle the pagination manually in your code.
There is another option, but you need a parent/child relationship instead of nested.
Then you are able to query the children (meaning, the chapters) and paginate the results (the chapters). You can use inner_hits and return back the parent (the book itself).
PUT /library
{
"mappings": {
"book": {
"properties": {
"name": {
"type": "string"
}
}
},
"chapter": {
"_parent": {
"type": "book"
},
"properties": {
"title": {
"type": "string"
}
}
}
}
}
The query:
GET /library/chapter/_search
{
"size": 5,
"query": {
"has_parent": {
"type": "book",
"query": {
"match_all": {}
},
"inner_hits" : {}
}
}
}
And a sample output (trimmed, complete example here):
"hits": [
{
"_index": "library",
"_type": "chapter",
"_id": "1",
"_score": 1,
"_source": {
"title": "chap1"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
},
{
"_index": "library",
"_type": "chapter",
"_id": "2",
"_score": 1,
"_source": {
"title": "chap2"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
}
The search api allows for the addition of certain standard parameters, listed in the docs at: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference-2-0.html#api-search-2-0
According to the doc:
size Number — Number of hits to return (default: 10)
Which would make your request something like:
"size": 5000,
"index": 'books',
"type": 'book',
"body": {

Resources