Getting the terms with high document frequency - elasticsearch

How can I get the top 10 terms with highest document frequencies?
I have an analyzed field called article.
I am using ES 2.3.0.

You can simply use an aggregation:
POST /my_articles/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"term_count":{
"terms": {
"field":"article",
"size" : 10
}
}
}
}
For each word, it will return the number of document where it can be found. But it doesn't take into account if a word is here multiple times in the field.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

Why am I getting number of buckets always equal to the specified size in terms aggregations?

I am a newbie in elastic search, I am using terms aggregation to get only the unique documents based on a field from the index. I have specified maximum size of unique documents in my query, why the bucket count is always equal to size?
{
"aggs": {
"name": {
"terms": {
"field": "fieldname",
"size": 10000
}
}
}
}
why am I getting 10000 buckets, when unique documents may be less than that?
10000 is the upper cap for the number of documents returned in a query. Your index will be having more than 10000 records. To get actual count use value count api
GET index/_count
OR
{
"size": 0,
"aggs": {
"total_doc_count": {
"value_count": {
"field": "fieldname"
}
}
}
}
To fetch more than 10000 documents in a query , you have to use scroll api.
POST /index-name/_search?scroll=1m --> scroll context
{
"size": 10000, --> will return docs in chunk of 10,000
"query": {
"match_all": {}
}
}
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" --> you will get from previous request
}
If there are only 100 documents, terms aggregation will return only 100 not 10000

Limit The Number Of Results Processed By An Aggregation

I have a query with an aggregation. I want the aggregation to only operate on the top 500 hits returned by the query.
For example, let's say I have an index of comments. I want to query the top 500 matching comments and aggregate them based on the poster, so that I may answer the question: "Who are the top kitten and puppy posters?".
The query might look something like this:
POST comments/_search
{
"query": {
"query_string": {
"query": "\"kittens\" OR \"puppies\"",
"default_field": "body"
}
},
"aggs": {
"posters": {
"terms": {
"field": "poster"
}
}
}
}
The problem with this is, as far as I know, the aggregation will operate on ALL returned results, not the top 500.
Things I've Already Tried/Considered:
size at the query root only changes the number of hits returned by
the query, but has no effect on the aggregation.
size inside the
terms aggregation only affects the total number of buckets to return.
There used to be a limit filter in older versions that would limit the number of hits returned by a query (and therefore the number processed by the aggregation) but that was deprecated in favor of...
terminate-after which doesn't work because the results aren't sorted by score before being returned so I couldn't get the top 500, just a set of 500
Does anyone know how to limit the documents processed by an aggregation to only the top results?
EDIT: I'm using ES version 6.3
I think you are looking for sampler aggregation. You will have to wrap your poster aggregation into the sample aggregation.
The shard_size parameter is number of document that will be considered for the subaggregation. In your case 500.
{
"query": {
"query_string": {
"query": "\"kittens\" OR \"puppies\"",
"default_field": "body"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 500
},
"aggs": {
"posters": {
"terms": {
"field": "poster"
}
}
}
}
}
}

instruct elasticsearch to return random results from different types

I have an index in ES with say 3 types A,B,C. Each type holds 1000 products. When the user makes a query with no scoring , then ES returns first all results from A, then all from B and then all from C.
What I need is to present mixed results from the 3 types.
I looked into the random scoring but it s not quite what I need.
Any ideas?
Do you really need randomness or simple 3 results from a type? Three results from each type could be realized through the top hits aggregation. First you aggregate by the _type field, then the top hits aggregation is applied:
{
"query": {
"function_score": {
"query": {
"match_all": {
}
},
"random_score": {
"seed": 137677928418000
}
}
},
"aggs": {
"all_type": {
"terms": {
"field": "_type"
},
"aggs": {
"by_top_hit": {
"top_hits": {
"size": 3
}
}
}
}
}
}
Edit: I added random scoring, to get random results, I think to get special numbers of documents for each _type is difficult, a solution is probably to get just enough from all _type fields.

Aggregation on top 100 documents sorted by a field

I would like to do a terms aggregation on top 100 documents sorted on a field (not relevance score!).
I know how to do the aggregation:
{
"query": {
"match_all" : {}
},
"aggs" : {
"mydata_agg" : {
"terms": {
"field" : "title"
}
}
}
}
and I know how to get top 100 documents sorted on a field:
{
"query": {
"match_all": {}
},
"sort": {
"units_sold": {
"order": "desc"
}
},
"size": 100
}
But how do I run the terms aggregation on those 100 sorted documents? I could use a range filter but then I need to specify myself the cutoff value of units_sold that results in top 100 documents. results. I prefer to do everything in one query. Is that possible?
I have searched for couple hours but was unable to find a solution.
The term aggregation creates buckets, and we need to sort the outcome of the first aggregation. this can be done using bucket_sort.Read this article for more information.

Resources