CosmosDB Distinct query - slow performance - performance

We are using CosmosDB and I'm running a Distinct query as follows
Select Distinct c.SomeType, c.SomeName
From c
Where c.ptkey = 'WHATEVERPTKEY'
And c.SomeCategory = 'WhateverCategory'
...where ptkey is the field that holds the partition key. The above works but takes around 1-1.5 minutes to complete (I'm assuming because some/many of the documents are very large) - I've tried filtering on the partition's unique key (id), using a "Group By", and played with "Order By" (restrictions apply when you combine the two and/or only one field is allowed in the Order By unless you have a composite key), but not much changes.
The one thing that does make a big difference is creating an Indexing Policy as follows
"compositeIndexes": [
[
{
"path": "/ptkey"
},
{
"path": "/SomeCategory"
},
{
"path": "/SomeType"
},
{
"path": "/SomeName"
}
]
]
...however, my question is how do I limit this composite key definition to only apply to the specific partition key the above query is for ('WHATEVERPTKEY' - as we have around a dozen partition keys within our database/collection) and secondly is there any alternative/better option (other than re-modelling our data)
Note my Query Stats when running the query in Azure CosmosDB Data Explorer without an index are as follows
Query Statistics
746.96 RUs
1 - 12
Retrieved document count 0
Retrieved document size 0 bytes
Output document count 0
Output document size 1789 bytes
Index hit document count 0
Index lookup time 1.1900000000000002 ms
Document load time 366.48990000000003 ms
Query engine execution time 29.0101 ms
System function execution time 0.76 ms
User defined function execution time 0 ms
Document write time 0.02 ms
1
UPDATE
full indexing policy of the collection is below
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/ptkey",
"order": "ascending"
},
{
"path": "/SomeCategory",
"order": "ascending"
},
{
"path": "/SomeType",
"order": "ascending"
},
{
"path": "/SomeName",
"order": "ascending"
}
]
]
}

So the query metrics in the portal can sometimes not show accurate data when there are LOTS of multiple pages of results but it does often work.
With DISTINCT the cost can depend on how many results you're dealing with. If you're expecting just a few results the impact on cost is low. If you're into thousands, it can get very expensive. Work is happening to make that less expensive. It's a ways out before it will be released.
Can you try again with just this as your index policy?
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/ptkey",
"order": "ascending"
},
{
"path": "/SomeCategory",
"order": "ascending"
}
]
]
}

Related

elasticsearch - how to combine results from two indexes

I have CDR log entries in Elasticsearch as in the below format. While creating this document, I won't have info about delivery_status field.
{
msgId: "384573847",
msgText: "Message text to be delivered"
submit_status: true,
...
delivery_status: //comes later
}
Later when delivery status becomes available, I can update this record.
But I have seen that update queries bring down the rate of ingestion. With pure inserts using bulk operations, I can reach upto 3000 or more transactions /sec, but if I combine with updates, the ingestion rate becomes very slow and crawls at 100 or less txns/sec.
So, I am thinking that I could create another index like below, where I store the delivery status along with msgId:
{
msgId:384573847,
delivery_status: 0
}
With this approach, I end up with 2 indices (similar to master-detail tables in an RDBMS). Is there a way to query the record by joining these indices? I have heard of aliases, but could not fully understand its concept and whether it can be applied in my use case.
thanks to anyone helping me out with suggestions.
As you mentioned, you can index both the document in separate index and used collapse functionality of Elasticsearch and retrieve both the documents.
Let consider, you have index document in index2 and index3 and both have common msgId then you can use below query:
POST index2,index3/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}
But again, you need to consider querying performance with large data set. You can do some benchmarking Evalue query performance and decide index or query time will be better.
Regarding alias, currently in above query we are providing index2,index3 as index name. (Comma separated). But if you use aliases then You can use the single unified name for query to both the index.
You can add both the index to single alias using below command:
POST _aliases
{
"actions": [
{
"add": {
"index": "index3",
"alias": "order"
}
},
{
"add": {
"index": "index2",
"alias": "order"
}
}
]
}
Now you can use below query with alias name insted of index name:
POST order/_search
{
"query": {
"match_all": {}
},
"collapse": {
"field": "msgId",
"inner_hits": {
"name": "most_recent",
"size": 5
}
}
}

ElasticSearch search performance

I'm working on an application that is similar to some shopping cart, where we store product and its metadata (JSON) and we are expecting faster search results. (Expected Search results should contain documents having search string anywhere in product JSON doc)
We have chosen ElasticSearch (AWS service) to store the complete product JSONs. we though it would be helpful for our faster search results.
But when I tried to test my search endpoint, it is taking 2sec+ for single request, and it keep on increasing upto 30sec if I make 100 parallel requests using Jmeter. (these query times are from the application logs, not from Jmeter responses.)
Here is the sample product JSON and sample search string I'm storing in ElasticSearch.
I believe we are using ES in wrong way, please help us implementing it in a right way.
Product JSON:
{
"dealerId": "D320",
"modified": 1562827907,
"store": "S1000",
"productId": "12345689",
"Items": [
{
"Manufacturer": "ABC",
"CODE": "V22222",
"category": "Electronics",
"itemKey": "b40a0e332190ec470",
"created": 1562828756,
"createdBy": "admin",
"metadata": {
"mfdDate": 1552828756,
"expiry": 1572828756,
"description": "any description goes here.. ",
"dealerName": "KrishnaKanth Sing, Bhopal"
}
}
]
}
Search String:
krishna
UPDATE:
We receive daily stock with multiple products (separate JSONs with different productIds) and we are storing them in date-wise index's (eg. products_20190715).
While searching we are searing on products_* indices.
We are using JestClient library to communicate with ES from our SpringBoot application.
Sample Search query:
{
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"simple_query_string": {
"query": "krishna*",
"flags": -1,
"default_operator": "or",
"lenient": true,
"analyze_wildcard": false,
"all_fields": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
}
],
"filter": [
{
"bool": {
"must": [
{
"bool": {
"should": [
{
"match_phrase": {
"category": {
"query": "Electronics",
"slop": 0,
"boost": 1
}
}
},
{
"match_phrase": {
"category": {
"query": "Furniture",
"slop": 0,
"boost": 1
}
}
},
{
"match_phrase": {
"category": {
"query": "Sports",
"slop": 0,
"boost": 1
}
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
{
"bool": {
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
"sort": [
{
"modified": {
"order": "desc"
}
}
]
}
There are several issues with your elasticsearch query.
Storing each day products in the different index is your design choice, which I am not aware of but if its a small list of products then it doesn't make sense and can cause the performance issue, as now these products will be stored in different smaller shards, which increases your search time, instead of searching them in a single shard, obviously if data is too large then having a single shard will also hurt performance, but that analysis you need to do and design your system accordingly and we can help you in that.
Now lets come to your query, first, you are using the wild card query which is anyway slow please read this post where the founder of Elasticsearch itself commented :-) and there is solution also provided to use the n-grams tokens instead of wildcard query, which we also used in our production to search for partial terms.
The third issue with your query is that you are using "all_fields": true, in your search query which will include all the fields in your index during the search which is quite a costly things to do and you should include only the relevant fields in your search.
I am sure even if you don't change the first one(design change) but incorporate the 2 other changes in your query, it will still improve your query performance a lot.
Happy debugging and learning.
Use Post processor JSON extractor and fetch the patter of data you need to input as search string.
Give JSON expression and match number as 0 to take the pattern in random and 1 for the first data and 2nd for 2nd so on. Hence, you have made the search string dynamic.
This will replicate the real scenario since each user will not be searching for the same string.
When you put more sequential/concurrent users over the server, it is normal that the time to get response from each requests increases gradually. But what you need to concern is about the failures from the server and the average time taken for the requests in summary report.
In general, as a standard, the requests should not take more than 10 seconds to respond.(depends upon companies and type of products). Please note that the default timeout of Jmeter is around 21 seconds.If the requests time goes beyond this, it automatically gets failed(if "Delay thread creation until needed" is disabled in thread group). But you can assert the expected value in the advanced tab in each request in Jmeter.

Elasticsearch query speed up with repeated used terms query filter

I will need to find out the co-occurrence times between one single tag and another fixed set of tags as whole. I have 10000 different single tags, and there are 10k tags inside fixed set of tags. I loop through all single tags under a fixed set of tags context with a fixed time range. I have total 1 billion documents inside the index with 20 shards.
Here is the elasticsearch query, elasticsearch 6.6.0:
es.search(index=index, size=0, body={
"query": {
"bool": {
"filter": [
{"range": {
"created_time": {
"gte": fixed_start_time,
"lte": fixed_end_time,
"format": "yyyy-MM-dd-HH"
}}},
{"term": {"tags": dynamic_single_tag}},
{"terms": {"tags": {
"index" : "fixed_set_tags_list",
"id" : 2,
"type" : "twitter",
"path" : "tag_list"
}}}
]
}
}, "aggs": {
"by_month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": two_month_start_time,
"max": start_month_start_time}
}}}
})
My question: Is there any solution which can have a cache inside elasticsearch for a fixed 10k set of tags terms query and time range filter which can speed up the query time? It took 1.5s for one single tag for my query above.
What you are seeing is normal behavior for Elasticsearch aggregations (actually, pretty good performance given that you have 1 billion documents).
There are a couple of options you may consider: using a batch of filter aggregations, re-indexing with a subset of documents, and downloading the data out of Elasticsearch and computing the co-occurrences offline.
But probably it is worth trying to send those 10K queries and see if Elasticsearch built-in caching kicks in.
Let me explain in a bit more detail each of these options.
Using filter aggregation
First, let's outline what we are doing in the original ES query:
filter documents with create_time in certain time window;
filter documents containing desired tag dynamic_single_tag;
also filter documents who have at least one tag from the list fixed_set_tags_list;
count how many such documents there are per each month in certain time period.
The performance is a problem because we've got 10K of tags to make such queries for.
What we can do here is to move filter on dynamic_single_tag from query to aggregations:
POST myindex/_doc/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { ... } }
]
}
},
"aggs": {
"by tag C": {
"filter": {
"term": {
"tags": "C" <== here's the filter
}
},
"aggs": {
"by month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": "2019-01-01",
"max": "2019-02-01"
}
}
}
}
}
}
}
The result will look something like this:
"aggregations" : {
"by tag C" : {
"doc_count" : 2,
"by month" : {
"buckets" : [
{
"key_as_string" : "2019-01-01T00:00:00.000Z",
"key" : 1546300800000,
"doc_count" : 2
},
{
"key_as_string" : "2019-02-01T00:00:00.000Z",
"key" : 1548979200000,
"doc_count" : 0
}
]
}
}
Now, if you are asking how this can help the performance, here is the trick: to add more such filter aggregations, for each tag: "by tag D", "by tag E", etc.
The improvement will come from doing "batch" requests, combining many initial requests into one. It might not be practical to put all 10K of them in one query, but even batches of 100 tags per query can be a game changer.
(Side note: roughly the same behavior can be achieved via terms aggregation with include filter parameter.)
This method of course requires getting hands dirty and writing a bit more complex query, but it will come handy if one needs to run such queries at random times with 0 preparation.
re-index the documents
The idea behind second method is to reduce the set of documents beforehand, via reindex API. reindex query might look like this:
POST _reindex
{
"source": {
"index": "myindex",
"type": "_doc",
"query": {
"bool": {
"filter": [
{
"range": {
"created_time": {
"gte": "fixed_start_time",
"lte": "fixed_end_time",
"format": "yyyy-MM-dd-HH"
}
}
},
{
"terms": {
"tags": {
"index": "fixed_set_tags_list",
"id": 2,
"type": "twitter",
"path": "tag_list"
}
}
}
]
}
}
},
"dest": {
"index": "myindex_reduced"
}
}
This query will create a new index, myindex_reduced, containing only elements that satisfy first 2 clauses of filtering.
At this point, the original query can be done without those 2 clauses.
The speed-up in this case will come from limiting the number of documents, the smaller it will be, the bigger the gain. So, if fixed_set_tags_list leaves you with a little portion of 1 billion, this is the option you can definitely try.
Downloading data and processing outside Elasticsearch
To be honest, this use-case looks more like a job for pandas. If data analytics is your case, I would suggest using scroll API to extract the data on disk and then process it with an arbitrary script.
In python it could be as simple as using .scan() helper method of elasticsearch library.
Why not to try the brute force approach?
Elasticsearch will already try to help you with your query via request cache. It is applied only to pure-aggregation queries (size: 0), so should work in your case.
But it will not, because the content of the query will always be different (the whole JSON of the query is used as caching key, and we have a new tag in every query). A different level of caching will start to play.
Elasticsearch heavily relies on the filesystem cache, which means that under the hood the more often accessed blocks of the filesystem will get cached (practically loaded into RAM). For the end-user it means that "warming up" will come slowly and with volume of similar requests.
In your case, aggregations and filtering will occur on 2 fields: create_time and tags. This means that after doing maybe 10 or 100 requests, with different tags, the response time will drop from 1.5s to something more bearable.
To demonstrate my point, here is a Vegeta plot from my study of Elasticsearch performance under the same query with heavy aggregations sent with fixed RPS:
As you can see, initially the request was taking ~10s, and after 100 requests it diminished to brilliant 200ms.
I would definitely suggest to try this "brute force" approach, because if it works it is good, if it does not - it costed nothing.
Hope that helps!

Automatically merge / rollup data in elastic search

Is there an easy way to create a new index from aggregated results from another index (And maybe merge em).
I have a large index with products that are similar. They have a product ID to identify which products belong together, but they have a different URL / Price and a different title (that I want to preserve somehow in the merge so I can search it).
So if I enter 8 product lines I would love to have it all roll up into 1 product with a nested array with similar product data.
I tried the rollup API with the job below. But I couldn't get that going the way I wanted and Im getting the feeling that that is only for historical / log data. All my data has the same timestamp since I update all of this every morning.
PUT _xpack/rollup/job/product
{
"index_pattern": "products",
"rollup_index": "products_rollup",
"cron": "*/30 * * * * ?",
"page_size": 1000,
"groups": {
"date_histogram": {
"field": "timestamp",
"interval": "7d"
},
"terms": {
"fields": [
"product_id"
]
}
},
"metrics": [
{
"field": "total_price",
"metrics": [
"min",
"max",
"sum"
]
}
]
}
Thanks!
For now the rollup API is mainly intended to rollup numerical data in time. Not to merge documents. In your case I would merge the documents on application level and get one document with the "subdocuments" in a nested object.

Which is the most effective way to get all the results of aggregation

I have the following query:
GET my-index-*/my-type/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"script" : "code"
},
"aggs": {
"dates": {
"date_range": {
"field": "created_time",
"ranges": [
{
"from": "2017-12-09T00:00:00.000",
"to": "2017-12-09T16:00:00.000"
},
{
"from": "2017-12-10T00:00:00.000",
"to": "2017-12-10T16:00:00.000"
}
]
}
},
"total_count": {
"sum_bucket": {
"buckets_path": "dates._count"
}
},
"bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalCount": "total_count"
},
"script": "params.totalCount == 0"
}
}
}
}
}
}
The result of this query is a bunch of buckets. What I need is the list of keys of my buckets. The problem is the aggregation result size is 10 by default, after getting those 10, my bucket_filter filters them by total count, and I get only some of those 10. I need to have all the results, which means I need to specify "size" = n, where n is the distinct count of code values, so that I don't lose any data. I have billions of documents, so in my case n is about 30.000. When I tried executing the query, "Out of memory" occurred on cluster, so I guess it's not the best idea. Is there a good way to get all the results for my query?
Unfortunately this is not recommended for high carnality fields with 30K unique values. The reason is because of memory cost and the large amount of data it needs to collect from the shards as you've discovered. It might work, but then you need more memory...
A more efficient solution is to use the Scroll API and specify in fields in your search request the values you want to retrieve from a field, and then store these values either in your client in-memory or stream it.
Update: since ES 6.5 this has been possible with Composite aggregations, see https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html

Resources