Elasticsearch aggregation and filters - elasticsearch

Hi friends I am trying to make a search bar in my website. I have thousands of company articles. When i run this code:
GET articles/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "assistant",
"fields": ["title"]
}
}
]
}
},
"size": 0,
"aggs": {
"by_company": {
"terms": {
"field": "company.keyword",
"size": 10
}
}
}
}
The result is:
"aggregations": {
"by_company": {
"doc_count_error_upper_bound": 5,
"sum_other_doc_count": 409,
"buckets": [
{
"key": "University of Miami",
"doc_count": 6
},
{
"key": "Brigham & Women's Hospital(BWH)",
"doc_count": 4
},
So now I wanna filter articles of University of Miami so i run following query:
GET indeed_psql/job/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "assistant",
"fields": ["title"]
}
}
],
"filter": {
"term": {
"company.keyword": "University of Miami"
}
}
}
},
"size": 0,
"aggs": {
"by_company": {
"terms": {
"field": "company.keyword",
"size": 10
}
}
}
}
But now the result is:
"aggregations": {
"by_company": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "University of Miami",
"doc_count": 7
}
]
}
Why there are suddenly seven of them when in the previous aggregation were 6 ??? This also happens with other university filters. What am I doing wrong ? I am not using standard tokenizer and from filters I use english_stemmer, english_stopwords, english_keywords. Thanks for your help.

It's likely that your first query document counts are wrong. In your first response, the "doc_count_error_upper_bound" is 5, meaning that some of the terms in your returned aggregation were not present as top results in each of the underlying queried shards. The document count will always be too low rather than too high because it could have been "missed" during the process of querying a shard for the top N keys.
How many shards do you have? For instance, if there are 3 shards, and your aggregation size is 3 and your distribution of documents was something like this:
Shard 1 Shard 2 Shard 3
3 BYU 3 UMiami 3 UMiami
2 UMich 2 BWH 2 UMich
2 MGH 2 UMich 1 BWH
1 UMiami 1 MGH 1 BYU
Your resulting top 3 terms from each shard are merged into:
6 UMiami // returned
6 UMich // returned
3 BWH // returned
3 BYU
2 MGH
From which, only the top three results are returned. Almost all these keys are undercounted.
You can see in this scenario, the UMiami document in Shard 1 would not make it into consideration because it is beyond the depth of 3. But if you filter to ONLY look at UMiami, you would necessarily pull back any associated docs in each shard and end up with an accurate count.
You can play around with the shard_size parameter so that Elasticsearch looks a little deeper into each shard too get a more approximate count. But given that there are 7 total documents for this facet, it's likely there's only one occurrence of it on one of your shards so it will be hard to surface it in the top aggregations without grabbing all of the documents for that shard.
You can read more about the count approximation and error derivation here-- tldr, Elasticsearch is making a guess about the total number of documents for that facet based on top aggregations in each individual shard.

Related

Elasticsearch takes in minute for respond to aggregation query

Document count: 4 Billion
disc size : 2 TB
Primary: 5
replica: 2
master node : 3
data node: 4 * [16cpu and 64GB ram]
heap size: 30GB
mlock enable : true
It takes up to 3 minutes to respond to aggregation queries. On subsequent request, it caches and speeds things up. Is there a way to speed the aggregation on the first query?
Example aggregation query:
{
"query": {
"bool": {
"must": [],
"must_not": [],
"should": []
}
},
"size": 0,
"aggs": {
"agg_;COUNT_ROWS;5d8b0621690e727ff775d4ed": {
"terms": {
"field": "feild1.keyword",
"size": 10000,
"shard_size": 100,
"order": {
"_term": "asc"
}
},
"aggs": {
"agg_;COUNT_ROWS;5d8b0621690e727ff775d4ec": {
"terms": {
"field": "feild2.keyword",
"size": 30,
"shard_size": 100,
"order": {
"_term": "asc"
}
},
"aggs": {
"agg_HouseHold;COUNT_DISTINCT": {
"cardinality": {
"field": "feild3.keyword",
"precision_threshold": 40000
}
}
}
}
}
}
}
}
If I understand right, you are running the query on a single instance, with a total of 15 shards, 5 of which are primaries. The first terms aggregation have a size of 10,000. that is a high number that effects performance. consider moving to composite-aggregation in order to use pagination and not to try to squeeze it to a huge response.
Also, the shard_size doesn't make much sense to me, as you only query 5 shards, and asking for 10,000 results - bringing 100 results from 5 shards would yield 500 results, which is not enough. I would drop this shard_size param, or set a higher value in order for it to make sense.

Elasticsearch filter based on field similarity

For reference, I'm using Elasticsearch 6.4.0
I have a Elasticsearch query that returns a certain number of hits, and I'm trying to remove hits with text field values that are too similar. My query is:
{
"size": 10,
"collapse": {
"field": "author_id"
},
"query": {
"function_score": {
"boost_mode": "replace",
"score_mode": "avg",
"functions": [
{
//my custom query function
}
],
"query": {
"bool": {
"must_not": [
{
"term": {
"author_id": MY_ID
}
}
]
}
}
}
},
"aggs": {
"book_name_sample": {
"sampler": {
"shard_size": 10
},
"aggs": {
"frequent_words": {
"significant_text": {
"field": "book_name",
"filter_duplicate_text": true
}
}
}
}
}
}
This query uses a custom function score combined with a filter to return books a person might like (that they haven't authored). Thing is, for some people, it returns books with names that are very similar (i.e. The Life of George Washington, Good Times with George Washington, Who was George Washington), and I'd like the hits to have a more diverse set of names.
I'm using a bucket_selector to aggregate the hits based on text similarity, and the query gives me something like:
...,
"aggregations": {
"book_name_sample": {
"doc_count": 10,
"frequent_words": {
"doc_count": 10,
"bg_count": 482626,
"buckets": [
{
"key": "George",
"doc_count": 3,
"score": 17.278715785140975,
"bg_count": 9718
},
{
"key": "Washington",
"doc_count": 3,
"score": 15.312204414323656,
"bg_count": 10919
}
]
}
}
}
Is it possible to filter the returned documents based on this aggregation result within Elasticsearch? IE remove hits with book_name_sample doc_count less than X? I know I can do this in PHP or whatever language uses the hits, but I'd like to keep it within ES. I've tried using a bucket_selector aggregator like so:
"book_name_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"freqWords": "frequent_words"
},
"script": "params.freqWords < 3"
}
}
But then I get an error: org.elasticsearch.search.aggregations.bucket.sampler.InternalSampler cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation
Also, if that filter removes enough documents so that the hit count is less than the requested size, is it possible to tell ES to go fetch the next top scoring hits so that hits count is filled out?
Why not use top hits inside the aggregation to get relevant document that match the bucket? You can specify how many relevant top hits you want inside the top hits aggregation. So basically this will give you a certain number of documents for each bucket.

elasticsearch parent child extremely inefficient has_child query

I have a parent-child relationship in an ES index. The distribution in terms of the number of documents is around 20% for the parents (200M docs) and 80% children (1B docs). ES cluster has 5 nodes, each with 20GB RAM and 4 CPU cores. ES version is 1.5.2. We use 5 shards per index and 0 replication.
When I query it using the has_child, the processing is extremely slow - 170 sec. However, when I just run over the parents it takes less than a second.
This query takes far too long to return and causes timeouts within the application. I really care about the aggregations and time range filter.
I believe what is happening is that the query is running over every child first to do the filtering. In reality, I only would like it to run over the parents first and check if there is a single document and then use filter on the children.
Setup
The _parent is an action that looks like this
{
"a": "m_field",
"b": "b_field",
"c": "c_field",
"d": "d_field"
}
The _child is a timestamp when that action has occurred
{
"date": "2016-07-07T11:11:11Z"
}
These are typically stored in time series indices. Indexes are sharded by a month. An index usually takes around 70GB total size on disk. We choose to run it over an alias, which combines all or some of the most recent indices.
Query
When I query I do a query_string on the _parent document to search for the keyword and a Range filter on the child, using the has_child query.
This looks like the following.
{
"size": 0,
"aggs": {
"base_aggs": {
"cardinality": {
"field": "a"
}
}
},
"query": {
"bool": {
"must": [
{
"filtered": {
"query": {
"query_string": {
"query": "*",
"fields": [
"a",
"b",
"c",
"d",
"e"
],
"default_operator": "and",
"allow_leading_wildcard": true,
"lowercase_expanded_terms": true
}
},
"filter": {
"has_child": {
"type": "evt",
"min_children": 1,
"max_children": 1,
"filter": {
"range": {
"date": {
"lte": "2016-07-06T23:59:59.000",
"gte": "2016-06-07T00:00:00.000"
}
}
}
}
}
}
}
],
"must_not": [
{
"term": {
"b": {
"value": ""
}
}
},
{
"term": {
"b": {
"value": "__"
}
}
}
]
}
}
}
So the query should match on my query_string with the entry "*" and have children that are between the two dates provided. Because I only care about the aggregations I do not return any documents, and I only need to match on a single child document.
Question
How can I improve the speed of the query?
The performance of a has_child query or filter with the min_children
or max_children parameters is much the same as a has_child query with
scoring enabled.
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/has-child.html#min-max-children
So I guess, you would have to drop those parameters to speed up the query.

To get hits inside aggregations,in elasticsearch

I have a date field inside my data. I did a date histogram aggregation on it,with interval set as month. Now it returns,the number of documents per month,interval.
Here is the query I used:
{
"aggs": {
"dateHistogram": {
"date_histogram": {
"field": "currentDate",
"interval": "day"
}
}
}
}
Below the exact response I have received.
{
"aggregations": {
"dateHistogram": {
"buckets": [{
"key_as_string": "2015-05-06",
"key": 1430870400000,
"doc_count": 10
}, {
"key_as_string": "2015-04-06",
"key": 1430870500000,
"doc_count": 14
}]
}
}
}
From the above response it is clear that,there are 10 documents under the key "1430870400000" and 14 documents under the key "1430870500000". But despite from the document count,the individual documents are not shown. I want them to be shown in the response,so that I can take values out from it. How do I achieve this in elasticsearch?
The easy method for this is using the "top-hits" aggregation. You can find the usage of "top-hits" here
Top-hits aggregation will give you the relevant data inside the aggregation you have done and also there are options to specify from which result you want to fetch,and the size of the data you want to be taken and also sort options.
As per my understanding you want to fetch all documents and used that documents for aggregations so you should use match query with aggregation as below :
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"aggs": {
"date_wise_logs_counts": {
"date_histogram": {
"field": "currentDate",
"interval": "day"
}
}
}
}
Above return default 10 documents in hit array, use size size=BIGNUMBER to get more than 10 items. (where BIGNUMBER equals a number you believe is bigger than your dataset). But you should use scan and scroll instead of size

Retrieve document frequency for terms in query result with aggregations

For some of my queries to ElasticSearch I want three pieces of information back:
Which terms T occurred in the result document set?
How often does each element of T occur in the result document set?
How often does each element of T occur in the entire index (--> document frequency)?
The first points are easily determined using the default term facet or, nowadays, by the term aggregation method.
So my question is really about the third point.
Before ElasticSearch 1.x, i.e. before the switch to the 'aggregation' paradigm, I could use a term facet with the 'global' option set to true and a QueryFilter to get the document frequency ('global counts') of the exact terms occurring in the document set specified by the QueryFilter.
At first I thought I could do the same thing using a global aggregation, but it seems I can't. The reason is - if I understand correctly - that the original facet mechanism were centered around terms whereas the aggregation buckets are defined by the the set of documents belonging to each bucket.
I.e. specifying the global option of a term facet with a QueryFilter first determined the terms hit by the filter and then computed facet values. Since the facet was global I would receive the document counts.
With aggregations, it's different. The global aggregation can only be used as a top aggregation, causing the aggregation to ignore the current query results and compute the aggregation - e.g. a terms aggregation - on all documents in the index. So for me, that's too much, since I WANT to restrict the returned terms ('buckets') to the terms in the document result set. But if I use a filter-sub-aggregation with a terms-sub-aggregation, I would restrict the term-buckets to the filter again, thus not retrieving the document frequencies but normal facet counts. The reason is that the buckets are determined after the filter so they are "too small". But I don't want restrict bucket size, I want to restrict the buckets to the terms in the query result set.
How can I get the document frequency of those terms in a query result set using aggregations (since facets are deprecated and will be removed)?
Thanks for your time!
EDIT: Here comes an example of how I tried to achieve the desired behaviour.
I will define two aggregations:
global_agg_with_filter_and_terms
global_agg_with_terms_and_filter
Both have a global aggregation at their tops because its the only valid position for it. Then, in the first aggregation, I first filter the results to the original query and then apply a term-sub-aggregation.
In the second aggregation, I do mostly the same, only that here the filter aggregation is a sub-aggregation of the terms aggregation. Hence the similar names, only the order of aggregation differs.
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"global_agg_with_filter_and_terms": {
"global": {},
"aggs": {
"filter_agg": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
},
"aggs": {
"terms_agg": {
"terms": {
"field": "facets"
}
}
}
}
}
},
"global_agg_with_terms_and_filter": {
"global": {},
"aggs": {
"document_frequency": {
"terms": {
"field": "facets"
},
"aggs": {
"term_count": {
"filter": {
"query": {
"query_string": {
"query": "text: my query string"
}
}
}
}
}
}
}
}
}
}
Response:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 221,
"max_score": 0.9839197,
"hits": <omitted>
},
"aggregations": {
"global_agg_with_filter_and_terms": {
"doc_count": 1978,
"filter_agg": {
"doc_count": 221,
"terms_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 155
},
{
"key": "fid6",
"doc_count": 40
},
{
"key": "fid9",
"doc_count": 10
},
{
"key": "fid5",
"doc_count": 9
},
{
"key": "fid13",
"doc_count": 5
},
{
"key": "fid7",
"doc_count": 2
}
]
}
}
},
"global_agg_with_terms_and_filter": {
"doc_count": 1978,
"document_frequency": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fid8",
"doc_count": 1050,
"term_count": {
"doc_count": 155
}
},
{
"key": "fid6",
"doc_count": 668,
"term_count": {
"doc_count": 40
}
},
{
"key": "fid9",
"doc_count": 67,
"term_count": {
"doc_count": 10
}
},
{
"key": "fid5",
"doc_count": 65,
"term_count": {
"doc_count": 9
}
},
{
"key": "fid7",
"doc_count": 63,
"term_count": {
"doc_count": 2
}
},
{
"key": "fid13",
"doc_count": 55,
"term_count": {
"doc_count": 5
}
},
{
"key": "fid10",
"doc_count": 11,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid11",
"doc_count": 9,
"term_count": {
"doc_count": 0
}
},
{
"key": "fid12",
"doc_count": 5,
"term_count": {
"doc_count": 0
}
}
]
}
}
}
}
At first, please have a look at the first two returned term-buckets of both aggregations, with keys fid8 and fid6. We can easily see that those terms have been appearing in the result set 155 and 40 times, respectively. Now please look at the second aggregation, global_agg_with_terms_and_filter. The terms-aggregation is within the scope of the global aggregation, so here we can actually see the document frequencies, 1050 and 668, respectively. So this part looks good. The issue arises when you scan the list of term buckets further down, to the buckets with the keys fid10 to fid12. While we receive their document frequency, we can also see that their term_count is 0. This is due to the fact that those terms did not occur in our query, that we also used for the filter-sub-aggregation. So the problem is that for ALL terms (global scope!) their document frequency and their facet count with regards to the actual query result is returned. But I need this to be made exactly for the terms that occurred in the query result, i.e. for those exact terms returned by the first aggregation global_agg_with_filter_and_terms.
Perhaps there is a possibity to define some kind of filter that removes all buckets where their sub-filter-aggregation term_count has a zero doc_count?
Hello and sorry if the answer is late.
You should have a look at the Significant Terms aggregation as, like the terms aggregation, it returns one bucket for each term occuring in the results set with the number of occurences available through doc_count, but you also get the number of occurrences in a background set through bg_count. This means it only creates buckets for terms appearing in documents of your query results set.
The default background set comprises all documents in the query scope, but can be filtered down to any subset you want using background_filter.
You can use a scripted bucket scoring function to rank the buckets the way you want by combining several metrics:
_subset_freq: number of documents the term appears in the results set,
_superset_freq: number of documents the term appears in the background set,
_subset_size: number of documents in the results set,
_superset_size: number of documents in the background set.
Request:
{
"query": {
"query_string": {
"query": "text: my query string"
}
},
"aggs": {
"terms": {
"significant_terms": {
"script": "_subset_freq",
"size": 100
}
}
}
}

Resources