Elastic search - how to get aggregate nested _source value - elasticsearch

We are using elastic search to get some data.
please tell me how to get aggregate _source.eventName group data.
like this sql
seletc eventName, count(eventName) from events group by eventName;
Here is my aggs query and current response data structure.
{
type: 'event',
size: 3,
aggs: {
event_group: {
terms: {
field: 'eventName'
}
}
}
}
 
"hits": {
"hits": [
{
"_type": "event",
"_source": {
"eventName": "event1",
}
},
{
"_type": "event",
"_source": {
"eventName": "event1",
}
},
{
"_type": "event",
"_source": {
"eventName": "event2",
}
}
]
}
※ideal case(I wanna get like this result.)
{
"eventName": "event1",
"count": 2
},
{
"eventName": "event2",
"count": 1
}

ElasticSearch doesn't support this filtering, but you can use the REST API filter parameter which will return you sth like
GET .../_search?pretty&filter_path=hits.hits._source.*
"hits": {
"hits": [
{
"_source": {...},
"_source": {...},
"_source": {...},
}]
}
Elastic search Documentation on common options

Your query is almost correct. If you only want the count of eventNames, use size: 0 there instead of 3. It will tell ES to not return hits.
The response should have an aggregations property like so:
{
"aggregations":
{
"event_group": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "event1",
"doc_count" : 10
},
{
"key" : "event2",
"doc_count" : 10
},
{
"key" : "event3",
"doc_count" : 10
},
]
}
}
}
The doc_count property there is the count you're looking for.
Note: ES will only return the top 10 eventName in the bucket. Depending on the ES version you're using, if you want to get all unique eventNames, you need to specify a size in your terms aggregation. Read ES docs for more info.

Related

Elasticsearch - Find documents missing two fields

I'm trying to create a query that returns information about how many documents that don't have data for two fields (date.new and date.old). I have tried the query below, but it works as OR-logic, where all documents missing either date.new or date.old are returned. Does anyone know how I can make this only return documents missing both fields?
{
"aggs":{
"Missing_field_count1":{
"missing":{
"field":"date.new"
}
},
"Missing_field_count2":{
"missing":{
"field":"date.old"
}
}
}
}
Aggregations is not the feature to use for this. You need to use the exists query wrapped within a bool/must_not query, like this:
GET index/_count
{
"size": 0,
"bool": {
"must_not": [
{
"exists": {
"field": "date.new"
}
},
{
"exists": {
"field": "date.old"
}
}
]
}
}
hits.total.value indicates the count of the documents that match the search request. The value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte)
Index Data:
{
"data": {
"new": 1501,
"old": 10
}
}
{
"title": "elasticsearch"
}
{
"title": "elasticsearch-query"
}
{
"date": {
"new": 1400
}
}
The search query given by #Val answers on how to achieve your use case.
Search Result:
"hits": {
"total": {
"value": 2, <-- note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "65112793",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"title": "elasticsearch"
}
},
{
"_index": "65112793",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"title": "elasticsearch-query"
}
}
]
}

ElasticSearch Range query

I have created the index by using the following mapping:
put test1
{
"mappings": {
"type1": {
"properties": {
"age": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 32766
}
}
}
}
}
}
}
Added following documents into index:
PUT test1/type1/1/_create
{
"age":50
}
PUT test1/type1/2/_create
{
"age":100
}
PUT test1/type1/3/_create
{
"age":150
}
PUT test1/type1/4/_create
{
"age":200
}
I have used the following range query to fetch result:
GET test1/_search
{
"query": {
"range" : {
"age" : {
"lte" : 150
}
}
}
}
It is giving me the following response :
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test1",
"_type": "type1",
"_id": "2",
"_score": 1,
"_source": {
"age": 100
}
},
{
"_index": "test1",
"_type": "type1",
"_id": "3",
"_score": 1,
"_source": {
"age": 150
}
}
]
}
}
the above response not showing document having age is 50 it is showing only age is 100 and 150. As 50 also less than 200. What is wrong here?
Can anyone help me to get a valid result?
In my schema age field type text, I don't want to change it.
How can I get a valid result?
Because age field type is text, the range query is using alphabetically order. So the results are correct:
"100"<"150"
"150"="150"
"50">"150"
If you are ingesting only numbers in age field, you should change the age field type to number, or add another inner field as number, just you did with raw inner field.
UPDATE: Tested on local system and it is working.
NOTE: Ideally, you would want the mappings to be correct, but if there is no other choice and you are not the person to decide on the mapping then you can still achieve it by following.
For ES version 6.3 onwards, try this.
GET test1/type1/_search
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"source": "Integer.parseInt(doc['age.raw'].value) <= 150",
"lang": "painless"
}
}
}
}
}
}
Sources to refer:
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl-script-query.html
https://discuss.elastic.co/t/painscript-script-cast-string-as-int/97034
Type for your field age in mapping is set to text. That is reason it is doing dictionary sorting where 50 > 150. Please use long data type. https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Latest document for each category?

I have documents in ElasticSearch with structure like this:
{
"created_on": [timestamp],
"source_id": [a string ID for the source],
"type": [a term],
... other fields
}
Obviously, I can select these documents in Kibana, show them in "discover", produce (for example) a pie chart showing type terms, and so on.
However, the requirement I've been given is to use only the most recent document for each source_id.
The approach I've tried is to map the documents into one bucket per source_id, then for each bucket, reduce to remove all but the document with the latest created_on.
However, when I used the terms aggregator, the result only contained counts, not whole documents I could further process:
"aggs" : {
"sources" : {
"terms" : { "field" : "source_id" }
}
}
How can I make this query?
If I understood correctly what you're trying to do, one way to accomplish that is using the top_hits aggregations under the terms aggregation, which is useful for grouping results by any criteria you'd like to, for each bucket of its parent aggregation. Following your example, you could do something like
{
"aggs": {
"by_source_id": {
"terms": {
"field": "source_id"
},
"aggs": {
"most_recent": {
"top_hits": {
"sort": {
"created_on": "desc"
},
"size": 1
}
}
}
}
}
}
So you are grouping by source_id, which will create a bucket for each one, and then you'll get the top hits for each bucket according to the sorting criteria set in the top_hits agg, in this case the created_on field.
The result you should expect would be something like
....
"buckets": [
{
"key": 3,
"doc_count": 2,
"most_recent": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so_sample02",
"_type": "items",
"_id": "2",
"_score": null,
"_source": {
"created_on": "2018-05-01 07:00:01",
"source_id": 3,
"type": "a"
},
"sort": [
1525158001000
]
}
]
}
}
},
{
"key": 5,
"doc_count": 2, .... and so on
Notice how within the bucket, most_recent, we get the corresponding hits. You can furthermore limit the amount of fields returned, by specifying in your top_hits agg "includes": ["fieldA", "fieldB" .. and so on]
Hope that helps.

Filter results to remove documents with the same field value based on another field value (without aggregation)

Given the following 4 objects in an elasticsearch index:
"hits": [
{
"_id": "0:0",
"_source": {
"id": 0,
"version": 0,
"published": true
}
},
{
"_id": "0:1",
"_source": {
"id": 0,
"version": 1,
"published": false,
"latest": true
}
},
{
"_id": "1:0",
"_source": {
"id": 1,
"version": 0,
"published": true
}
},
{
"_id": "1:1",
"_source": {
"id": 1,
"version": 1,
"published": true,
"latest": true
}
}
]
I would like to find the documents using these rules:
with published:true
no duplicate id
for documents with the same id the highest version should be returned.
So for the above I'd like to get 0:0 and 1:1:
"hits": [
{
"_id": "0:0",
"_source": {
"id": 0,
"version": 0,
"published": true
}
},
{
"_id": "1:1",
"_source": {
"id": 1,
"version": 1,
"published": true,
"latest": true
}
}
]
I'm aware that I can use top_hits, but I'd like to know if this is possible without it, such that the main hits.hits array will contain these results.
I'd probably do the collapsing as follows:
{
query : {...},
aggs : {
ids: {
terms: {
field: "id"
},
aggs:{
dedup:{
top_hits:{ size:1, sort: {version : 'desc'} }
}
}
}
}
}
The reason I'm hoping to avoid using top_hits is that I'll need to update the result parser in our application. Also the size field will not work correctly if I do so.
To answer my own question based on this answer, it's not possible without using the top_hits aggregation. I think what I was trying to achieve wasn't the best use of aggregation. Instead I'm going to adjust the index model by adding latestPublished true to the relevant models, allowing the query to be { term: { latestPublished: true}}.

Specifying total size of results to return for ElasticSearch query when using inner_hits

ElasticSearch allows inner_hits to specify 'from' and 'size' parameters, as can the outer request body of a search.
As an example, assume my index contains 25 books, each having less than 50 chapters. The below snippet would return all chapters across all books, because a 'size' of 100 books includes all of 25 books and a 'size' of 50 chapters includes all of "less than 50 chapters":
"index": 'books',
"type": 'book',
"body": {
"from" : 0, "size" : 100, // outer hits, or books
"query": {
"filtered": {
"filter": {
"nested": {
"inner_hits": {
"size": 50 // inner hits, or chapters
},
"path": "chapter",
"query": { "match_all": { } },
}
}
}
},
.
.
.
Now, I'd like to implement paging with a scenario like this. My question is, how?
In this case, do I have to return back the above max of 100 * 50 = 5000 documents from the search query and implement paging in the application level by displaying only the slice I am interested in? Or, is there a way to specify the total number of hits to return back in the search query itself, independent of the inner/outer size?
I am looking at the "response" as follows, and so would like this data to be able to be paginated:
response.hits.hits.forEach(function(book) {
chapters = book.inner_hits.chapters.hits.hits;
chapters.forEach(function(chapter) {
// ... this is one displayed result ...
});
});
I don't think this is possible with Elasticsearch and nested fields. The way you see the results is correct: ES paginates and returns books and it doesn't see inside nested inner_hits. Is not how it works. You need to handle the pagination manually in your code.
There is another option, but you need a parent/child relationship instead of nested.
Then you are able to query the children (meaning, the chapters) and paginate the results (the chapters). You can use inner_hits and return back the parent (the book itself).
PUT /library
{
"mappings": {
"book": {
"properties": {
"name": {
"type": "string"
}
}
},
"chapter": {
"_parent": {
"type": "book"
},
"properties": {
"title": {
"type": "string"
}
}
}
}
}
The query:
GET /library/chapter/_search
{
"size": 5,
"query": {
"has_parent": {
"type": "book",
"query": {
"match_all": {}
},
"inner_hits" : {}
}
}
}
And a sample output (trimmed, complete example here):
"hits": [
{
"_index": "library",
"_type": "chapter",
"_id": "1",
"_score": 1,
"_source": {
"title": "chap1"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
},
{
"_index": "library",
"_type": "chapter",
"_id": "2",
"_score": 1,
"_source": {
"title": "chap2"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
}
The search api allows for the addition of certain standard parameters, listed in the docs at: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference-2-0.html#api-search-2-0
According to the doc:
size Number — Number of hits to return (default: 10)
Which would make your request something like:
"size": 5000,
"index": 'books',
"type": 'book',
"body": {

Resources