Result filter and pagination in Elasticsearch - elasticsearch

I need some help or an idea for the correct procedure.
I already indexed a big vaste of documents. Now I found out that there are some documents with almost the same content, f.e.
{
"title": "myDocument",
"date": "2017-09-18",
"page": 1
}
{
"title": "myDocument",
"date": "2017-09-18",
"page": 2
}
The title field is mapped as text, date is date and page is integer. As you can see the only difference is the page value.
Now I want to make a query and filter out these duplicates. Field collapsing seems a good way to do it but in this case I can't get the correct count of results and that's important for me.
An other way would be to get all results first and then filter out "manually" but then I have a problem with pagination.

Try something like this.
GET index/type/_search
{
"aggs": {
"count_by_title_date_page":{
"terms": {
"field": "title.keyword",
"size": 100
},
"aggs": {
"date": {
"terms": {
"field": "date.keyword",
"size": 100
},
"aggs": {
"page": {
"terms": {
"field": "page.keyword",
"size": 100
}
}
}
}
}
}
}
}

Related

Aggregate by multiple fields and Top 10 questions and answers count in ElasticSearcch

When we execute the records for the few records only questions is coming and it is not showing up the answer against each question In the database we have answers for all the questions. Please let me know the query to get the both questions and answers for top 10 records. Below is the error.
GET logstash-sdc-questionrecords/_search?q=source:website_portal
{
"aggs": {
"genres": {
"terms": {
"field": "question.keyword",
"order": {
"_count": "desc"
},
"size": 10
},
"aggs": {
"genres": {
"terms": {
"field": "answer.keyword",
"order": {
"_count": "desc"
},
"size": 10
}
}
}
}
}
}
enter image description here
What I understand from your question is that you need to show top 10 questions with their answers, in order to achieve that you need to use the sub-aggregation and top-hits will solve your use-case, while currently you are using two different top level terms aggregation.
Your search query should look like below
{
"aggs": {
"genres": {
"terms": {
"field": "question.keyword",
"size": 10
},
"aggs": {
"top_answer_hits": {
"top_hits": {
"size": 1,
"_source": {
"includes": [
"answer"
]
}
}
}
}
}
}
}

How do I filter after an aggregation?

I am trying to filter after a top hits aggregation to get if the first apparition of an error was in a given range but I can't find a way.
I have seen something about bucket selector but can't get it to work
POST log-*/_search/
{
"size": 100,
"aggs": {
"group":{
"terms": {
"field": "errorID.keyword",
"size": 100
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"#timestamp": {
"order": "asc"
}
}
]
}
},
}
}
}
}
}
With this top hits I get the first apparition of a concrete errorID as I have many documents with the same errorID, but what I want to find is if the first apparition is within a given range of dates.
I think that a valid solution would be to filter the results of the aggregation to check if it is in the range, but I don't know how could I do that.

How to get specific _source fields in aggregation

I am exploring ElasticSearch, to be used in an application, which will handle large volumes of data and generate some statistical results over them. My requirement is to retrieve certain statistics for a particular field. For example, for a given field, I would like to retrieve its unique values and document frequency of each value, along-with the length of the value. The value lengths are indexed along-with each document.
So far, I have experimented with Terms Aggregation, with the following query:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
}
}
}
}
The query returns all the values in the field val with the number of documents in which each value occurs. I would like the field val_len to be returned as well. Is it possible to achieve this using ElasticSearch? In other words, is it possible to include specific _source fields in buckets? I have looked through the documentation available online, but I haven't found a solution yet.
Hoping somebody could point me in the right direction. Thanks in advance!
I tried to include _source in the following manners:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"_source":["val_len"]
}
}
and
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"_source":["val_len"]
}
}
}
But I guess this isn't the right way, because both gave me parsing errors.
You need to use another sub-aggregation called top_hits, like this:
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100
},
"aggs": {
"hits": {
"top_hits": {
"_source":["val_len"],
"size": 1
}
}
}
}
}
Another way of doing it is to use another avg sub-aggregation so you can sort on it, too
"aggs": {
"type_count": {
"terms": {
"field": "val.keyword",
"size": 100,
"order": {
"length": "desc"
}
},
"aggs": {
"length": {
"avg": {
"field": "val_len"
}
}
}
}
}

Removing duplicates and sorting (aggs + sort)

I'm trying to find the best solution where a query returns a sorted set, which I then use aggs to remove duplicates, this works fine, however when I add a sort on the query results, e.g.
"query": {..},
"sort": {.. "body.make": "asc" ..}
I'd like the aggs to also return the results in that order, however it seems to always order on the query score.
// Here I'm collecting all body.vin values to remove duplicates
// and then returning only the first in each result set.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
},
I've tried to put a term aggregation in between to see if that would sort:
// here again same thing, however I attempt to sort on body.make
// in the document, however I now realize that my bucket result
// being each a collection of the duplicates, will sort each duplicate
// and not on the last results.
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"order": {
"terms": {
"field": "body.make",
"order": {
"_term": "asc"
}
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},
But the results from the aggregation are always based on score.
Also I've toyed with the idea or solution of adjusting the scores based on query sort, in this way the aggregation would return the proper order as it returns based on score, but there doesn't seem to be anyway of doing this with the sort: {}.
If anyone has had success in sorting results, while removing duplicates, or ideas/suggestions, please let me know.
This is not the most ideal solution since it will only allow the sorting on one field. The best would be to change scores/boosts on sorted results
Trying to explain it made me realize how this could be done once I grasped the concept of buckets, or more so how they are passed. I would still be interested in the sort + score adjust solution but via aggregates this works:
// here we first aggregate all body.make, so first results might
// {"toyota": {body.vin 123}, "toyota": {body.vin 123}...} and the
// next result passed into the dedup aggregate would be say
// {"nissan"...
"aggs": {
"sort": {
"terms": {
"size": 8,
"field": "body.make",
"order": {
"_term": "desc"
}
},
"aggs": {
"dedup": {
"terms": {
"size": 8,
"field": "body.vin"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
},

Aggregations for categories, sorted by category sequence

I have an elastic index, in which each document contains the following:
category {
"id": 4,
"name": "Green",
"seq": 2
}
I can use aggregations to get me the doc count for each of the categories:
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category.name"
}
}
}
}
This is fine, but the aggs are sorted by the doc count. What I'd like is to have the buckets sorted by the seq value, something that's easy in SQL.
Any suggestions?
Thanks!
Take a look at ordering terms aggregations.
Something like this could work, but only if "name" and "sequence" have the right relationships (one-to-one, or it works out in some other way):
POST /test_index/_search
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category.name",
"order" : { "seq_num" : "asc" }
},
"aggs": {
"seq_num": {
"max": {
"field": "category.seq"
}
}
}
}
}
}
Here is some code I used for testing:
http://sense.qbox.io/gist/4e551b2faec81eb0343e0e6d0cc9b10f20d7d4c1

Resources