Optimize ES query with too many terms elements - elasticsearch

We are processing a dataset of billions of records, currently all of the data are saved in ElasticSearch, and all of the queries and aggregations are performed with ElasticSearch.
The simplified query body is like below, we put the device ids in terms and then concate them with should to avoid the limit of 1024 to terms, the total count of terms element is up to 100,000, and it now becomes very slow.
{
"_source": {
"excludes": [
"raw_msg"
]
},
"query": {
"filter": {
"bool": {
"must": [
{
"range": {
"create_ms": {
"gte": 1664985600000,
"lte": 1665071999999
}
}
}
],
"should": [
{
"terms": {
"device_id": [
"1328871",
"1328899",
"1328898",
"1328934",
"1328919",
"1328976",
"1328977",
"1328879",
"1328910",
"1328902",
... # more values, since terms not support values more than 1024, wen concate all of them with should
]
}
},
{
"terms": {
"device_id": [
"1428871",
"1428899",
"1428898",
"1428934",
"1428919",
"1428976",
"1428977",
"1428879",
"1428910",
"1428902",
...
]
}
},
... # concate more terms until all of the 100,000 values are included
],
"minimum_should_match": 1
}
}
},
"aggs": {
"create_ms": {
"date_histogram": {
"field": "create_ms",
"interval": "hour",
}
}
},
"size": 0}
My question is that is there a way to optimize this case? Or is there a better choice to do this kind of search?
Realtime or near realtime is a must, other engine is acceptable.
simplified schema of the data:
"id" : {
"type" : "long"
},
"content" : {
"type" : "text"
},
"device_id" : {
"type" : "keyword"
},
"create_ms" : {
"type" : "date"
},
... # more field

You can use the terms query with a terms lookup to specify a larger list of values like here
Store your ids in a specific document with id like 'device_ids'
"should": [
{
"terms": {
"device_id": {
"index": "your-index-name",
"id": "device_ids",
"path": "field-name"
}
}
}
]

Related

Composite aggregation query with bucket_sort does not work properly

I have an index to store financial transactions:
{
"mappings": {
"_doc": {
"properties": {
"amount": {
"type": "long"
},
"currencyCode": {
"type": "keyword"
},
"merchantId": {
"type": "keyword"
},
"merchantName": {
"type": "text"
},
"partnerId": {
"type": "keyword"
},
"transactionDate": {
"type": "date"
},
"userId": {
"type": "keyword"
}
}
}
}
}
Here's my query:
GET /transactions/_search
{
"aggs": {
"date_merchant": {
"aggs": {
"amount": {
"sum": {
"field": "amount"
}
},
"amount_sort": {
"bucket_sort": {
"sort": [
{
"amount": {
"order": "desc"
}
}
]
}
},
"top_hit": {
"top_hits": {
"_source": {
"includes": [
"merchantName",
"currencyCode"
]
},
"size": 1
}
}
},
"composite": {
"size": 1,
"sources": [
{
"date": {
"date_histogram": {
"calendar_interval": "day",
"field": "transactionDate"
}
}
},
{
"merchant": {
"terms": {
"field": "merchantId"
}
}
}
]
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"userId": "AAA"
}
},
{
"term": {
"partnerId": "BBB"
}
},
{
"range": {
"transactionDate": {
"gte": "2022-07-01"
}
}
},
{
"term": {
"currencyCode": "EUR"
}
}
]
}
},
"size": 0
}
Please note the "size": 1 in the composite aggregation.
If I change it to 3 (based on my data)... I get different results!
That means the bucket_sort operation doesn't work on the whole list of buckets, but just on the returned ones (if it's just one, that means it's not going to be sorted at all!)
How can I sort on ALL the buckets instead?
EDIT
Based on Benjamin's answer I changed my query to use normal aggregations instead of composite, and a large bucket size for merchant IDs (default is 10, while for date histogram there's no limit)
Composite agg design
The composite aggregation is designed to iterate all buckets in the most efficient way possible.
How can I sort on ALL the buckets instead?
To fully sort over ALL buckets, all buckets would have to be enumerated ahead of time, defeating the design of the composite aggregation.
So, how to actually sort over all buckets?
Then aggregate over all buckets in a single call. Set your size to the largest number of buckets available within your query.
The number of buckets will be the cardinality of merchantId and the number of days in the date histogram.
Another option is to paginate over all the composite buckets and then sort them client side. If you choose this path, it may be good to have each page of the composite aggregation be sorted so that sorting them client side will be faster.

Translate MySQL aggregation query to ElasticSearch

I have a comments table that over the past year has grown considerably and I'm moving it to ElasticSearch.
The problem is that I need to adapt a query that I currently have in MySQL which returns the total number of comments for each day in the last 7 days for a given post.
Here's the MySQL query that I have now:
SELECT count(*) AS number, DATE(created_at) AS date
FROM `comments`
WHERE `post_id` = ?
GROUP BY `date`
ORDER BY `date` DESC
LIMIT 7
My index looks like this:
{
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "keyword"
},
"post_id": {
"type": "integer"
},
"subject": {
"analyzer": "custom_html_strip",
"type": "text"
},
"body": {
"analyzer": "custom_html_strip",
"type": "text"
},
"created_at": {
"format": "yyyy-MM-dd HH:mm:ss",
"type": "date"
}
}
}
}
}
Is it possible to reproduce that query for ElasticSearch? If so, how would that look like?
My ElasticSearch knowledge is kinda limited, I know that it offers aggregation, but I don't really know how to put it all together.
Use the following query to get all the comments on a given "post_id" for the last 7 days.
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": "the_post_id"
}
}
},
/*** only include this clause if you want the recent most 7 days ***/
{
"range": {
"created_at": {
"gte": "now-7d/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"posts_per_day": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day",
"order" : {"_key" : "desc"}
}
}
}
}
From the aggregation, pick up the first 7 buckets in your client application.
Elasticsearch supports sql queries as well (though they are limited)
so with a little change you can use something like this
GET _sql
{
"query": """
SELECT count(*) AS number, created_at AS date
FROM comments
WHERE post_id = 123
GROUP BY date
ORDER BY date DESC
LIMIT 7
"""
}
You can see the corresponding query using _sql/translate, which will return
{
"size" : 0,
"query" : {
"term" : {
"post_id" : {
"value" : 123,
"boost" : 1.0
}
}
},
"_source" : false,
"stored_fields" : "_none_",
"aggregations" : {
"groupby" : {
"composite" : {
"size" : 7,
"sources" : [
{
"31239" : {
"terms" : {
"field" : "created_at",
"missing_bucket" : true,
"order" : "desc"
}
}
}
]
}
}
}
}
That being said, some of the stuff used in the translated query is not needed, so this will be a better native query
{
"query": {
"term": {
"post_id": {
"value": 123
}
}
},
"aggs": {
"unq_dates": {
"terms": {
"field": "created_at",
"size": 7,
"order": {
"_term": "desc"
}
}
}
}
}
First, it is recommended that your time fields be regularly aggregated, so using fields written directly to a 'YYYY-MM-DD' format will improve your performance and reduce your consumption of resources
{
"query":
{
"filter":
[
{
"term":
{
"post_id":
{
"value": 1
}
}
}
]
},
"_source": false,
"aggs":
{
"created_at":
{
"auto_date_histogram":
{
"field": "created_at",
"buckets": 7,
"format": "yyyy-MM-dd"
}
}
}
}

Aggregation performance issue in Elasticsearch with hotel availability data

I am building a small app to find hotel room availability like booking.com using Elasticsearch 6.8.0.
Basically, I have a document per day and room, that specifies if it is available and the rate for that day. I need to run a query with this requirements:
Input:
The days of the desired staying.
The max amount of money I am willing to spend.
The page of the results I want to see.
The number of results per page.
Output:
List of cheapest offer per hotel that fulfill the requirements, ordered in ASC order.
Documents schema:
{
"mappings": {
"_doc": {
"properties": {
"room_id": {
"type": "keyword"
},
"available": {
"type": "boolean"
},
"rate": {
"type": "float"
},
"hotel_id": {
"type": "keyword"
},
"day": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
}
I have an index per month, and at the moment I only search within the same month.
I came up with this query:
GET /hotels_201910/_search?filter_path=aggregations.hotel.buckets.min_price.value,aggregations.hotel.buckets.key
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"aggs": {
"hotel": {
"terms": {
"field": "hotel_id",
"min_doc_count": 1,
"size" : 1000000
},
"aggs": {
"room": {
"terms": {
"field": "room_id",
"min_doc_count": 10,
"size" : 1000000
},
"aggs": {
"sum_price": {
"sum": {
"field": "rate"
}
},
"max_price": {
"bucket_selector": {
"buckets_path": {
"price": "sum_price"
},
"script": "params.price <= 600"
}
}
}
},
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
},
"sort_by_min_price" : {
"bucket_sort" :{
"sort": [{"min_price" : { "order" : "asc" }}],
"from" : 0,
"size" : 20
}
}
}
}
}
}
And it works, but have several issues.
It is too slow. With 100K daily rooms, it takes about 500 ms to return on my computer, where no other query is running. So in a live system it would be very bad.
I need to setup the "size" to a big number in the terms aggregation, otherwise not all hotels and rooms are considered.
Is there a way to improve the performance of this aggregation? I have tried to split the index in multiple shards, but it did not help.
I am almost sure that the approach is wrong, and that is why is slow. Any recommendation about how to achieve a faster query response time in this case?
Before going to the answer, I didnt understand why you are using the below condition/aggregation
"min_price": {
"min_bucket": {
"buckets_path": "room>sum_price"
}
}
Can you give me more clarification on why you need this.
Now, the answer your main question:
Why do you want to term by room_id as well with hotel_id. You can get all the rooms of your search and then group them by hotel_id on application side.
The below logic, will get you all docs grouped by room_id and with sum metrics. You can use the same script filter for > 600 condition.
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"range": {
"day": { "gte" : "20191001", "lte" : "20191010" }
}
},
{
"term": {
"available": true
}
}
]
}
},
"by_room_id": {
"composite" : {
"size": 100,
"sources" : [
{
"room_id": {
"terms" : {
"field": "room_id"
}
}
}
]
},
"aggregations": {
"price_on_required_dates": {
"sum": { "field": "rate" }
},
"include_source": {
"top_hits": {
"size": 1,
"_source": true
}
},
"price_bucket_sort": {
"bucket_sort": {
"sort": [
{"price_on_required_dates": {"order": "desc"}}
]
}
}
}
}
}
Also, to improve search performance,
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

Elasticsearch query results return wrong results

I'm trying to do a query for server logs. The search is returning results but there are a couple of issues.
1) I'm specifying the server name, yet I'm getting results back for other servers in the same domain.
2) Even though I'm specifying the query get results back from the past hour, they're coming back from two hours before, i.e. if I perform the search at 1pm, the results are returning from 12pm. The search returns the correct results if I specify sorting by timestamp but this seems to take longer for the results to appear so I would rather not do that unless I have to.
Any help you can give is greatly appreciated.
Here's my query (with edited log name and server name):
var searchParams = {
index: 'logs*',
"body": {
"from" : 0, "size": 50,
"sort": [
{
"timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"query": {
"bool": {
"must": [
{
"match" : {"gl2_source_input" : "579f7b6696d78a4f6cbfa745"},
"match" : {"source" : "server01.fakedomain.com"},
"match" : {"EventID" : "5145"}
},
{
"range": {
"timestamp": {
"gte": "now-1h",
"lte": "now/m",
"time_zone": "-05:00"
}
}
}
],
"must_not": []
}
},
}
}
A couple of things here:
If you want to match a keyword exactly, then use a term query on a keyword type field.
Unless you're interested in your queries being scored, you should use a filter clause instead of the must clause.
So your query can look something like this (assuming that your filter fields are keyword type fields).
var searchParams = {
index: 'logs*',
"body": {
"from" : 0, "size": 50,
"sort": [
{
"timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"query": {
"bool": {
"filter": [
{ "term" : {"gl2_source_input" : "579f7b6696d78a4f6cbfa745"} },
{ "term" : {"source" : "server01.fakedomain.com"} },
{ "term" : {"EventID" : "5145"} },
{
"range": {
"timestamp": {
"gte": "now-1h",
"lte": "now/m",
"time_zone": "-05:00"
}
}
}
]
}
},
}
}

Elasticsearch aggregation by arrays of String

I have an ElasticSearch index, where I store telephony transactions (SMS, MMS, Calls, etc ) with their associated costs.
The key of these documents are the MSISDN (MSISDN = phone number). In my app, I know that there are group of users. Each users can have one or more MSISDN.
Here is the mapping of this kind of documents :
"mappings" : {
"cdr" : {
"properties" : {
"callDatetime" : {
"type" : "long"
},
"callSource" : {
"type" : "string"
},
"callType" : {
"type" : "string"
},
"callZone" : {
"type" : "string"
},
"calledNumber" : {
"type" : "string"
},
"companyKey" : {
"type" : "string"
},
"consumption" : {
"properties" : {
"data" : {
"type" : "long"
},
"voice" : {
"type" : "long"
}
}
},
"cost" : {
"type" : "double"
},
"country" : {
"type" : "string"
},
"included" : {
"type" : "boolean"
},
"msisdn" : {
"type" : "string"
},
"network" : {
"type" : "string"
}
}
}
}
My goal and issue :
My goal is to make a query that retrieve cost by callType by group. But groups are not represented in ElasticSearch, only in my PostgreSQL database.
So I will make a method that retrieves all the MSISDN for every existing group, and get something like a List of String arrays, containing every MSISDN within each group.
Let's say I have something like :
"msisdn_by_group" : [
{
"group1" : ["01111111111", "02222222222", "033333333333", "044444444444"]
},
{
"group2" : ["05555555555","06666666666"]
}
]
Now, I will use this to generate an Elasticsearch query. I want to make with an aggregation, the sum of the cost, for all those terms in different buckets, and then split it again by callType. (to make a stackedbar chart).
I've tried several things, but didn't manage to make it work (histogram, buckets, term and sum was mainly the keyword i'm playing with).
If somebody here can help me with the order, and the keywords I can use to achieve this, it would be great :) Thanks
EDIT :
Here is my last try :
QUERY:
{
"aggs" : {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
}
I go the expected result, but it missing the "group" split, as I don't know how to pass the MSISDN arrays as a criteria :
RESULT :
"aggregations": {
"cost_histogram": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "data",
"doc_count": 5925,
"cost_histogram_sum": {
"value": 0
}
},
{
"key": "sms_mms",
"doc_count": 5804,
"cost_histogram_sum": {
"value": 91.76999999999995
}
},
{
"key": "voice",
"doc_count": 5299,
"cost_histogram_sum": {
"value": 194.1196
}
},
{
"key": "sms_mms_plus",
"doc_count": 35,
"cost_histogram_sum": {
"value": 7.2976
}
}
]
}
}
Ok I found out how to make this with one query, but it's damn a long query because it repeats for every group, but I have no choise. I'm using the "filter" aggregator.
Here is a working example based on the array I wrote in my question above :
POST localhost:9200/cdr/_search?size=0
{
"query": {
"term" : {
"companyKey" : 1
}
},
"aggs" : {
"group_1_split_cost": {
"filter": {
"bool": {
"should": [{
"bool": {
"must": {
"match": {
"msisdn": "01111111111"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "02222222222"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "03333333333"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "04444444444"
}
}
}
}]
}
},
"aggs": {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
},
"group_2_split_cost": {
"filter": {
"bool": {
"should": [{
"bool": {
"must": {
"match": {
"msisdn": "05555555555"
}
}
}
},{
"bool": {
"must": {
"match": {
"msisdn": "06666666666"
}
}
}
}]
}
},
"aggs": {
"cost_histogram": {
"terms": {
"field": "callType"
},
"aggs": {
"cost_histogram_sum" : {
"sum": {
"field": "cost"
}
}
}
}
}
}
}
}
Thanks to the newer versions of Elasticsearch we can now nest very deep aggregations, but it's still a bit too bad that we can't pass arrays of values to an "OR" operator or something like that. It could reduce the size of those queries, I guess. Even if they are a bit special and used in niche cases, as mine.

Resources