Compute percentile with collapsing by user - elasticsearch

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers.
I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation.
I want to isolate all unique user then compute the 90th.
The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
My index store orginal tweets & retweets based on user search queries
The index is mapped with a dynamic template mapping (No problem with this)
The index contains approximatly 100M
Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}

Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.
"aggs": {
"unique":{
"cardinality": {
"field": "collapse_profile"
}
},
"thresholds":{
"percentiles": {
"field": "user.followers_count",
"percents": [90],
"script": {
"source": """
if(params.keys == null){
params.keys = new HashMap();
}
def key = doc['user.id'].value;
def value = doc['user.followers_count'].value;
if(params.keys[key] == null){
params.keys[key] = _score;
return value;
}
return _score;
""",
"lang": "painless"
}
}
}
}

Related

How to sum the size of documents within a time interval?

I'm attempting to estimate the sum of size of n documents across an index using below query :
GET /events/_search
{
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"total_size": {
"sum": {
"field": "doc['_source'].bytes"
}
}
}
}
This returns documents but the size of the aggregation is 0 :
"aggregations" : {
"total_size" : {
"value" : 0.0
}
}
How to sum the size of documents within a time interval ?
The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.
However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:
GET /events/_search
{
"runtime_mappings": {
"source.size": {
"type": "double",
"script": """
def size = params._source.toString().length() * 8;
emit(size);
"""
}
},
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"size": {
"sum": {
"field": "source.size"
}
}
}
}
Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.

ElasticSearch _knn_search query on multiple fields

I'm using ES 8.2. I'd like to use approximate method of _knn_search on more than 1 vector. Below I've attached my current code searching on a single vector. So far as I've read _knn_search does not support search on nested fields.
Alternatively, I can use multi index search. One index, one vector, one search, sum up all results together. However, I need to store all these vectors together in one index as I need also to perform filtration on some other fields besides vectors for knn search.
Thus, the question is if there is a work around how I can perform _knn_search on more than 1 vector?
search_vector = np.zeros(512).tolist()
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 100,
"num_candidates": 1000
},
"filter": [
{
"range": {
"feature_vector_1.match_prc": {
"gt": 10
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
The last working query that I've end up with is
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 1000,
"num_candidates": 1000
},
"filter": [
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
double value = dotProduct(params.queryVector, 'feature_vector_2.vector');
return 100 * (1 + value) / 2;
""",
"params": {
"queryVector": search_vector
}
},
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
However, it is not true AKNN on 2 vectors but still working option if performance of such query satisfies your expectations.
the below seems to be working for me for combining KNN searches, taking the average of multiple cosine similarity scores. Note that this is a little different than the original request, since it performs a brute force search, but you can still filter the results up front by replacing the match_all bit.
GET my-index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.vector1, 'my-vector1') + cosineSimilarity(params.vector2, 'my-vector2'))/2 + 1.0",
"params": {
"vector1": [
1.3012068271636963,
...
0.23468133807182312
],
"vector2": [
-0.49404603242874146,
...
-0.15835021436214447
]
}
}
}
}
}

"Filter then Aggregation" or just "Filter Aggregation"?

I am working on ES recently and I found that I could achieve the almost same result but I have no clear idea as to the DIFFERENCE between these two.
"Filter then Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"term": {
"DestCountry": "CA"
}
}
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
"Filter Aggregation"
POST kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"ca": {
"filter": {
"term": {
"DestCountry": "CA"
}
},
"aggs": {
"_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
}
My Questions
Why there are two similar functions? I believe I am wrong about it but what's the difference then?
(please do ignore the result format, it's not the question I am asking ;p)
Which is better if I want to filter out the unrelated/unmatched and start the aggregation on lots of documents?
When you use it in "query", you're creating a context on ALL the docs in your index. In this case, it acts like a normal filter like: SELECT * FROM index WHERE (my_filter_condition1 AND my_filter_condition2 OR my_filter_condition3...).
When you use it in "aggs", you're creating a context on ALL the docs that might have (or haven't) been previously filtered. Let's say that if you have an structure like:
#OPTION A
{
"aggs":{
t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } }
}
}
}
Without a "query", is exactly the same as having
#OPTION B
{
"query":{
"filter" : { "term": { "type": "t-shirt" } }
}
}
BUT the results will be returned in different fields.
In the Option A, the results will be returned in the aggregations field.
In the Option B, the results will be returned in the hits field.
I would recommend to apply your filters always on the query part, so you can work with subsecuent aggregations of the already filtered docs. Also because Aggrgegations cost more performance than queries.
Hope this is helpful! :D
Both filters, used in isolation, are equivalent. If you load no results (hits), then there is no difference. But you can combine listing and aggregations. You can query or filter your docs for listing, and calculate aggregations on bucket further limited by the aggs filter. Like this:
POST kibana_sample_data_flights/_search
{
"size": 100,
"query": {
"bool": {
"filter": {
"term": {
... some other filter
}
}
}
},
"aggs": {
"ca_filter": {
"term": {
"TestCountry": "CA"
}
},
"aggs": {
"ca_weathers": {
"terms": { "field": "DestWeather" }
}
}
}
}
But more likely you will need the other way, ie. make aggregations on all docs, to display summary informations, while you display docs from specific query. In this case you need to combine aggragations with post_filter.
Answer from #Val's comment, I may just quote here for reference:
In option A, the aggregation will be run on ALL documents. In option B, the documents are first filtered and the aggregation will be run only on the selected documents. Say you have 10M documents and the filter select only a 100, it's pretty evident that option B will always be faster.

Elasticsearch partial update based on Aggregation result

I want to update partially all objects that are based on aggregation result.
Here is my object:
{
"name": "name",
"identificationHash": "aslkdakldjka",
"isDupe": false,
...
}
My goal is to set isDupe to "true" for all documents where "identificationHash" is there more than 2 times.
Currently what I'm doing is:
I get all the documents that "isDupe" = false with a Term aggregation on "identificationHash" for a min_doc_count of 2.
{
"query": {
"bool": {
"must": [
{
"term": {
"isDupe": {
"value": false,
"boost": 1
}
}
}
]
}
},
"aggregations": {
"identificationHashCount": {
"terms": {
"field": "identificationHash",
"size": 10000,
"min_doc_count": 2
}
}
}
}
With the aggregation result, I do a bulk update with a script where "ctx._source.isDupe=true" for all identificationHash that match my aggregation result.
I repeat step 1 and 2 until there is no more result from the aggregation query.
My question is: Is there a better solution to that problem? Can I do the same thing with one script query without looping with batch of 1000 identification hash?
There's no solution that I know of that allows you to do this in on shot. However, there's a way to do it in two steps, without having to iterate over several batches of hashes.
The idea is to first identify all the hashes to be updated using a feature called Transforms, which is nothing else than a feature that leverages aggregations and builds a new index out of the aggregation results.
Once that new index has been created by your transform, you can use it as a terms lookup mechanism to run your update by query and update the isDupe boolean for all documents having a matching hash.
So, first, we want to create a transform that will create a new index featuring documents containing all duplicate hashes that need to be updated. This is achieved using a scripted_metric aggregation whose job is to identify all hashes occurring at least twice and for which isDupe: false. We're also aggregating by week, so for each week, there's going to be a document containing all duplicates hashes for that week.
PUT _transform/dup-transform
{
"source": {
"index": "test-index",
"query": {
"term": {
"isDupe": "false"
}
}
},
"dest": {
"index": "test-dups",
"pipeline": "set-id"
},
"pivot": {
"group_by": {
"week": {
"date_histogram": {
"field": "lastModifiedDate",
"calendar_interval": "week"
}
}
},
"aggregations": {
"dups": {
"scripted_metric": {
"init_script": """
state.week = -1;
state.hashes = [:];
""",
"map_script": """
// gather all hashes from each shard and count them
def hash = doc['identificationHash.keyword'].value;
// set week
state.week = doc['lastModifiedDate'].value.get(IsoFields.WEEK_OF_WEEK_BASED_YEAR).toString();
// initialize hashes
if (!state.hashes.containsKey(hash)) {
state.hashes[hash] = 0;
}
// increment hash
state.hashes[hash] += 1;
""",
"combine_script": "return state",
"reduce_script": """
def hashes = [:];
def week = -1;
// group the hash counts from each shard and add them up
for (state in states) {
if (state == null) return null;
week = state.week;
for (hash in state.hashes.keySet()) {
if (!hashes.containsKey(hash)) {
hashes[hash] = 0;
}
hashes[hash] += state.hashes[hash];
}
}
// only return the hashes occurring at least twice
return [
'week': week,
'hashes': hashes.keySet().stream().filter(hash -> hashes[hash] >= 2)
.collect(Collectors.toList())
]
"""
}
}
}
}
}
Before running the transform, we need to create the set-id pipeline (referenced in the dest section of the transform) that will define the ID of the target document that is going to contain the hashes so that we can reference it in the terms query for updating documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{dups.week}}"
}
}
]
}
We're now ready to start the transform to generate the list of hashes to update and it's as simple as running this:
POST _transform/dup-transform/_start
When it has run, the destination index test-dups will contain one document that looks like this:
{
"_index" : "test-dups",
"_type" : "_doc",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"week" : "2021-11-01T00:00:00.000Z",
"dups" : {
"week" : "44",
"hashes" : [
"12345"
]
}
}
},
Finally, we can run the update by query as follows (add as many terms queries as weekly documents in the target index):
POST test/_update_by_query
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "44",
"path": "dups.hashes"
}
}
},
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "45",
"path": "dups.hashes"
}
}
}
]
}
},
"script": {
"source": "ctx._source.isDupe = true;"
}
}
That's it in two simple steps!! Try it out and let me know.

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.
I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Resources