How to find top terms with occurrences in Elasticsearch - elasticsearch

I have a fairly big dataset in Elasticsearch: 1 index, about 120 million records of one type. I am processing a large number of paragraphs on a given set of topics. The number of topics is limited and associated with a unique ID. Each paragraph has a couple of sentences identified by the sentence_id (unique across all topics). Each sentence has a number of words and each word can occur multiple times. So my mapping looks like the following:
{
"sentence_id": 1200,
"topic_id": 2,
"value": "ground",
"occurrences": 20
}
Now, I want to run a query which answers this:
"Find the top words for a given topic ID sorted by their occurrences."
So for each word in a topic, I have to sum up its occurrences across all the sentences, sort them and return.
I am not able to achieve this. I tried writing aggregation term query, but it does not sum occurrences and merely returns the unique count of records for each word.
{
"query": {
"term": {
"topic_id": {
"value": 3117
}
}
},
"aggs": {
"total_occurrences": {
"terms": {
"field": "occurrences",
"size": 1000
}
}
}
}
Can some one help me out?

I think first you need to aggregate on unique value, and then sum its occurrences, your query should look something like this assuming your occurrences field is numeric
{
"query": {
"term": {
"topic_id": {
"value": 3117
}
}
},
"aggs": {
"total_occurrences": {
"terms": {
"field": "value",
"size": 1000,
"order": {
"sum_occurrences": "desc" <--- to sort by top words
}
},
"aggs": {
"sum_occurrences": {
"sum": {
"field": "occurrences"
}
}
}
}
},
"size": 0
}
Hope this helps!

Related

How to get total number of aggregation buckets in Elasticsearch?

I use Elasticsearch terms aggregation to see how many documents have a certain value in their "foo" field like this:
{
...
"aggregations": {
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
}
}
}
and I get the response:
"aggregations": {
"foo": {
"buckets": [
{
"key_as_string": "2018-10-01T00:00:00.000Z",
"key": 1538352000000,
"doc_count": 935
},
{
"key_as_string": "2018-11-01T00:00:00.000Z",
"key": 1541030400000,
"doc_count": 15839
},
...
/* 48 more values */
]
}
}
But I'm limiting the number of different values to 50. If there are more different values in this field they won't be returned in the response, and that's fine, because I don't need to all of them, but I would like to know how many of them there are. So, how could I get the total number of different values? It would be fantastic if the answer provided a full example query, thanks.
You can probably add a cardinality aggregation which will give you unique number of terms for the field. This will be equal to the number of buckets for the terms aggregation.
{
...
"aggregations": {
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
},
"uniquefoo": {
"cardinality": {
"field": "foo"
}
}
}
}
NOTE: Please keep in mind that cardinality aggregation might in some cases return approx count. To know more on it read here.
The cardinality aggregation is there to help. Just note, however, that the number that is returned is an approximation and might not reflect the exact number of buckets you'd get if you were to request them all. However, the accuracy is pretty good on low cardinality fields.
{
...
"aggregations": {
"unique_count": {
"cardinality": {
"field": "foo"
}
},
"metastore": {
"terms": {
"field": "foo",
"size": 50
}
}
}
}

How to get documents that are differents by value field

I'm using ElasticSearch 6.3.
Scenario: dozens of thousand documents has "123" field with "blabla" value in most of those. A few has "blabla blo" in that field. These occupy last places in query results if I set up size: 10000 (if default size, they doesn't appear). But I really want both unique records: one with these field "123": "blabla" and that one with field "123":"blabla blo".
I`m using wildcard and getting all 10000 documents. Only need those two.
I'm going to feed a select tag HTML with thats records, but only two of them ideally!
Query body:
{
"query": {
"wildcard":{
"324" : {
"value":"*b*"
}
}
},
"size": 10000,
"_source": ["324"]
}
How I should make it? The concept would be similar to find records which value aren't fully duplicated in that field, I supose.
Thank you
That's what aggs are for!
GET index_name/_search
{
"query": {
"wildcard": {
"324": {
"value": "*b*"
}
}
},
"size": 0,
"aggs": {
"324_uniques": {
"terms": {
"field": "324",
"size": 10
}
}
}
}
field could be 324 OR 324.keyword, depending on your mapping.

Unexpected results when using min sub-aggregation in Elasticsearch

My documents include the fields name and date_year, and my goal is to find the most recently added names (e.g. the ten last added names with their first year of appearance and the total number of documents). I therefore have a terms aggregation on name, which is ordered by a min sub-aggregation on date_year:
{
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"order": {
"start_year": "desc"
}
},
"aggs": {
"start_year": {
"min": {
"field": "date_year"
}
}
}
}
}
}
This is returning unexpected results, when not adding size under terms. For example, the first bucket has doc_count 1 and start_year 2015, while I'm sure that there are tens of documents with this name, and the earliest date_year is 1870. When I add a large enough size, the results are accurate. For example:
{
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"size": 10000, <------ large enough value
"order": {
"start_year": "desc"
}
},
"aggs": {
"start_year": {
"min": {
"field": "date_year"
}
}
}
}
}
}
Can anyone explain to me what is causing this, and how I can limit the number of buckets returned? What I need would look something like this in SQL:
select name, min(year), count(*) from documents group by name order by min(year) desc limit 10

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.
I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Resources