I want to have a scripted aggregation of key/value pairs in a nested array in elastic. An example of the documents returned is as follows:
"hits": [
{
"_index": "testdan",
"_type": "year",
"_id": "AVtXirjYuoFS95t7pfkg",
"_score": 1,
"_source": {
"m_iYear": 2006,
"m_iTopicID": 11,
"m_People": [
{
"name": "Petrovic, Rade",
"value": 3.70370364
},
{
"name": "D. Kirovski",
"value": 3.70370364
}
]
}},
{
"_index": "testdan",
"_type": "year",
"_id": "AVtXirjYuoFS95t7pfkg",
"_score": 1,
"_source": {
"m_iYear": 2007,
"m_iTopicID": 11,
"m_People": [
{
"name": "Petrovic, Rade",
"value": 6.70370364
},
{
"name": "D. Kirovski",
"value": 2.70370364
}
]
}}
]
I would like to aggregate an average value for each person in m_Person over each document, as follows:
Petrovic, Rade = 3.70370364 + 6.70370364 / 2 = 7.05
D. Kirovski = 3.70370364 + 2.70370364 / 2 = 5.05
The division for the average should be calculated by the number of years that name appears.. One year may not show only one name for instance.
If this is more difficult due to not having unique IDs for people, I plan to add an ID for each person, but how would you go about scripting this so instead of returning all people, and needing to loop through at front-end, I can just have an array of people and their averages?
You may be able to achieve this sort of aggregation by utilizing Kibana Scripted Fields. See the examples section. This assumes you are using Elasticsearch 5.0 as the scripting language is Painless.
You can achieved this with a nested aggregation pretty easily. For each year, we're aggregating on the people's names and then computing the average value for each of them.
{
"size": 0,
"aggs": {
"years": {
"terms": {
"field": "m_iYear"
},
"aggs": {
"names": {
"nested": {
"path": "m_People"
},
"aggs": {
"names": {
"terms": {
"field": "m_People.name"
},
"aggs": {
"average": {
"avg": {
"field": "m_People.value"
}
}
}
}
}
}
}
}
}
}
Related
I'm trying to create a query that returns information about how many documents that don't have data for two fields (date.new and date.old). I have tried the query below, but it works as OR-logic, where all documents missing either date.new or date.old are returned. Does anyone know how I can make this only return documents missing both fields?
{
"aggs":{
"Missing_field_count1":{
"missing":{
"field":"date.new"
}
},
"Missing_field_count2":{
"missing":{
"field":"date.old"
}
}
}
}
Aggregations is not the feature to use for this. You need to use the exists query wrapped within a bool/must_not query, like this:
GET index/_count
{
"size": 0,
"bool": {
"must_not": [
{
"exists": {
"field": "date.new"
}
},
{
"exists": {
"field": "date.old"
}
}
]
}
}
hits.total.value indicates the count of the documents that match the search request. The value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte)
Index Data:
{
"data": {
"new": 1501,
"old": 10
}
}
{
"title": "elasticsearch"
}
{
"title": "elasticsearch-query"
}
{
"date": {
"new": 1400
}
}
The search query given by #Val answers on how to achieve your use case.
Search Result:
"hits": {
"total": {
"value": 2, <-- note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "65112793",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"title": "elasticsearch"
}
},
{
"_index": "65112793",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"title": "elasticsearch-query"
}
}
]
}
I'm trying to work with Elastic (5.6) and to find a way to retrieve the top documents per some category.
I have an index with the following kind of documents :
{
"#timestamp": "2018-03-22T00:31:00.004+01:00",
"statusInfo": {
"status": "OFFLINE",
"timestamp": 1521675034892
},
"name": "myServiceName",
"id": "xxxx",
"type": "Http",
"key": "key1",
"httpStatusCode": 200
}
}
What i'm trying to do with these, is retrieve the last document (#timestamp-based) per name (my categories), see if its statusInfo.status is OFFLINE or UP and fetch these results into the hits part of a response so I can put it in a Kibana count dashboard or somewhere else (a REST based tool I do not control and can't modify by myself).
Basically, I want to know how many of my services (name) are OFFLINE (statusInfo.status) in their last update (#timestamp) for monitoring purposes.
I'm stuck at the "Get how many of my services" part.
My query so far:
GET actuator/_search
{
"size": 0,
"aggs": {
"name_agg": {
"terms": {
"field": "name.raw",
"size": 1000
},
"aggs": {
"last_document": {
"top_hits": {
"_source": ["#timestamp", "name", "statusInfo.status"],
"size": 1,
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
}
}
}
},
"post_filter": {
"bool": {
"must_not": {
"term": {
"statusInfo.status.raw": "UP"
}
}
}
}
}
This provides the following response:
{
"all_the_meta":{...},
"hits": {
"total": 1234,
"max_score": 0,
"hits": []
},
"aggregations": {
"name_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "myCategory1",
"doc_count": 225,
"last_document": {
"hits": {
"total": 225,
"max_score": null,
"hits": [
{
"_index": "myIndex",
"_type": "Http",
"_id": "dummy id",
"_score": null,
"_source": {
"#timestamp": "2018-04-06T00:06:00.005+02:00",
"statusInfo": {
"status": "UP"
},
"name": "myCategory1"
},
"sort": [
1522965960005
]
}
]
}
}
},
{other_buckets...}
]
}
}
}
Removing the size make the result contain ALL of the documents, which is not what I need, I only need each bucket content (every one contains one bucket).
Removing the post filter does not appear to do much.
I think this would be feasible in ORACLE SQL with a PARTITION BY OVER clause, followed by a condition.
Does somebody know how this could be achieved ?
If I understand you correctly, you are looking for the latest doc that have status of OFFLINE in each group (grouped by name)?. In that case you can try the query below and the number of items in the bucket should give you the "how many are down" (for up you would change the term in the filter)
NOTE: this is done in latest version, so it uses keyword field instead of raw
POST /index/_search
{
"size": 0,
"query":{
"bool":{
"filter":{
"term": {"statusInfo.status.keyword": "OFFLINE"}
}
}
},
"aggs":{
"services_agg":{
"terms":{
"field": "name.keyword"
},
"aggs":{
"latest_doc":{
"top_hits": {
"sort": [
{
"#timestamp":{
"order": "desc"
}
}
],
"size": 1,
"_source": ["#timestamp", "name", "statusInfo.status"]
}
}
}
}
}
}
Supposing this is my elasticsearch structure:
{
"_index": "my_index",
"_type": "person",
"_id": "ID",
"_source": {
...DATA...
}
}
{
"_index": "my_index",
"_type": "result",
"_id": "ID",
"_source": {
"personID": "personID"
"date": "timestamp",
"result": "integer",
"speciality": "categoryID"
}
}
I would like to get the most 10 most "influent" people based on:
number of competition in the last 30 days
number of competition in the last year
competition's results in the last 30 days
number of different specialities in the last 30 days
I'm thinking about using _score but I don't know how to influence the score using some values aggregated from the documents of type "result" . This is what I'm trying to achieve
POST my_index/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"term": {
"_type": {
"value": "person"
}
}
}
]
}
}
},
"functions": [
{
"field_value_factor": {
"field": {
"query": {
//competitions in the last 30 days
},
"aggs": {
//cout
}
},
"factor": 1
},
"weight": 0.1
}
]
}
}
Is this possible with just 1 request?
Is this a good approach?
Any tip on what to look at is appreciated
ElasticSearch allows inner_hits to specify 'from' and 'size' parameters, as can the outer request body of a search.
As an example, assume my index contains 25 books, each having less than 50 chapters. The below snippet would return all chapters across all books, because a 'size' of 100 books includes all of 25 books and a 'size' of 50 chapters includes all of "less than 50 chapters":
"index": 'books',
"type": 'book',
"body": {
"from" : 0, "size" : 100, // outer hits, or books
"query": {
"filtered": {
"filter": {
"nested": {
"inner_hits": {
"size": 50 // inner hits, or chapters
},
"path": "chapter",
"query": { "match_all": { } },
}
}
}
},
.
.
.
Now, I'd like to implement paging with a scenario like this. My question is, how?
In this case, do I have to return back the above max of 100 * 50 = 5000 documents from the search query and implement paging in the application level by displaying only the slice I am interested in? Or, is there a way to specify the total number of hits to return back in the search query itself, independent of the inner/outer size?
I am looking at the "response" as follows, and so would like this data to be able to be paginated:
response.hits.hits.forEach(function(book) {
chapters = book.inner_hits.chapters.hits.hits;
chapters.forEach(function(chapter) {
// ... this is one displayed result ...
});
});
I don't think this is possible with Elasticsearch and nested fields. The way you see the results is correct: ES paginates and returns books and it doesn't see inside nested inner_hits. Is not how it works. You need to handle the pagination manually in your code.
There is another option, but you need a parent/child relationship instead of nested.
Then you are able to query the children (meaning, the chapters) and paginate the results (the chapters). You can use inner_hits and return back the parent (the book itself).
PUT /library
{
"mappings": {
"book": {
"properties": {
"name": {
"type": "string"
}
}
},
"chapter": {
"_parent": {
"type": "book"
},
"properties": {
"title": {
"type": "string"
}
}
}
}
}
The query:
GET /library/chapter/_search
{
"size": 5,
"query": {
"has_parent": {
"type": "book",
"query": {
"match_all": {}
},
"inner_hits" : {}
}
}
}
And a sample output (trimmed, complete example here):
"hits": [
{
"_index": "library",
"_type": "chapter",
"_id": "1",
"_score": 1,
"_source": {
"title": "chap1"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
},
{
"_index": "library",
"_type": "chapter",
"_id": "2",
"_score": 1,
"_source": {
"title": "chap2"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
}
The search api allows for the addition of certain standard parameters, listed in the docs at: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference-2-0.html#api-search-2-0
According to the doc:
size Number — Number of hits to return (default: 10)
Which would make your request something like:
"size": 5000,
"index": 'books',
"type": 'book',
"body": {
I have a Json Data
"hits": [
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "AHkuN5_iRGO-R5dtaOvz6w",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:55:04.509Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "Busk7MDFQ4emtL3x5AQyZA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:58:31.440Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "4AN0zKe9SaSF1trz1IixfA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-07-02T04:53:07.010Z"
}
}
]
Am trying to write aggregation query which will find records in particular "deleted_date" range.
This is my query
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
My problem is am not getting correct number of records in particular date range. When i put any date am getting some doc_count number. Am new to elastic search. Am not sure is it the way to write range aggregation query. Please help me to solve this issue.
I think problem is you are confused with "from" and "to" of date range aggregation, with range filter. Range filter includes both date (from and to ) in default. But in date_range aggregation, includes the from value and excludes the to value for each range..
In your query,
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
**"to": "2014-08-02"** -- > if you want to include 2014-08-02 date then do,
"to" : "2014-08-03" (increase date by one, so 08-02 is included)
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This was also encountered by me, and I think your problem is also same.
FYI, look at the link.
What OP is looking for is InternalDateRange query. Try this instead:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02||/d", // /d rounds off to day
// from value -> 2014-08-02T00:00:00.000Z
"to": "2014-08-03||/d" // to value -> 2014-08-03T00:00:00.000Z
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This will return count of matching results in single bucket named daily_team.
"buckets": [
{
"key": "2014-08-02T00:00:00.000Z-2014-08-03T00:00:00.000Z",
"from": 1470096000000, //test data value
"from_as_string": "2014-08-02T00:00:00.000Z",
"to": 1470182400000, //test data value
"to_as_string": "2014-08-03T00:00:00.000Z",
"doc_count": 0
}
]
This will return single bucket containing matching doc_count.
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
Using above ranges will return 2 buckets, one each for from and to date range.
from -> 2014-08-02-*
to -> *-2014-08-02 as shown on official documentation page.