We have a lot of documents in each index (~10 000 000). But each document is very small and contains almost only integer values.
We needed to SUM all numerical field.
First step - We ask for all available fields with a mapping.
Example :
GET INDEX/TYPE/_mapping
Second step - We build the request with the fields from the mapping.
Example :
GET INDEX/TYPE/_search
{
// SOME FILTERS TO REDUCE THE NUMBER OF DOCUMENTS
"size":0,
"aggs":{
"FIELD 1":{
"sum":{
"field":"FIELD 1"
}
},
"FIELD 2":{
"sum":{
"field":"FIELD 2"
}
},
// ...
"FIELD N":{
"sum":{
"field":"FIELD N"
}
}
}
}
Our problem is that the second request execution time is linear with the number of field N.
That's not acceptable as this is only sums. So we tried to generate our own aggregation with a scripted metric (groovy).
Exemple with only 2 fields :
// ...
"aggs": {
"test": {
"scripted_metric": {
"init_script": "_agg['t'] = []",
"map_script": "_agg.t.add(doc)",
"combine_script": "res = [:]; res['FIELD 1'] = 0; res['FIELD 2'] = 0; for (t in _agg.t) { res['FIELD 1'] += t.['FIELD 1']; res['FIELD 2'] += t.['FIELD 2']; }; return res",
"reduce_script": "res = [:]; res['FIELD 1'] = 0; res['FIELD 2'] = 0; for (t in _aggs) { res['FIELD 1'] += t.['FIELD 1']; res['FIELD 2'] += t.['FIELD 2']; }; return res"
}
}
}
// ...
But it appears that the more affectations we add in the script, the more time it takes to execute it, so it doesn't solve our problem.
There is not a lot of example out there.
Do you have some ideas to improve this script performances ?
Or other ideas ?
How could it calculate N sums in sub-linear time, does any such system exist?
10 million document's isn't actually that many. How long are your queries taking, how many shards do you have and is the CPU maxed at 100%? (I was gonna ask these in a comment but don't have 50 reputation yet).
If you are interested in the total sum of all fields you could pre-calculate document-level sums when you are indexing the document and then at query time just take the sum of these values.
You could also try storing fields as doc_values and see if it helps. You would have less memory pressure and garbage collection, although docs mention a possible 10 - 25% performance hit.
Related
I am building an integration between Shopify and our ERP via the admin API using GraphQL. All is working well except when I try and get the exact prices for an order.
In the documentation discountedTotalSet should be 'The total line price after discounts are applied' but I am finding it returns the full price - see examples below.
Can anyone give me guidance on how to get the API to show the same prices that are on the order? I need this to match exactly line by line. This is the query I am using for the order:
{
node(id: "gid://shopify/Order/4866288156908") {
id
...on Order {
name
lineItems (first: 50) {
edges {
node {
id
quantity
sku
discountedTotalSet {
shopMoney {
currencyCode
amount
}
}
}
}
}
}
}
}
And this is the result, note amount says 599.00 but that is not correct, see screenshot for the same order from the UI.
{
"data": {
"node": {
"id": "gid://shopify/Order/4866288156908",
"name": "AK-1003",
"lineItems": {
"edges": [
{
"node": {
"id": "gid://shopify/LineItem/12356850286828",
"quantity": 1,
"sku": "AK-A1081",
"discountedTotalSet": {
"shopMoney": {
"currencyCode": "AUD",
"amount": "599.0"
}
}
}
}
]
}
}
},
Shopify UI screenshotemphasized text
discountedTotalSet gives you the amount after discounts applied to that particular line. In your example you're applying a discount to the whole order. There is no field, in the lineItem object that will give you the expected value for that line.
So you have to distribute the whole discount to each single line.
I had the exact same problem and I had to implement this solution in python, I hope it helps:
from decimal import Decimal
def split_discounts(money, n):
quotient = Decimal(round(((money * Decimal(100)) // n) / Decimal(100), 2))
remainder = int(money * 100 % n)
q1 = Decimal(round(quotient + Decimal(0.01), 2)) # quotient + 0.01
result = [q1] * remainder + [quotient] * (n - remainder)
return result # returns an array of discounted amounts
def retrieve_shop_money(obj):
return Decimal(obj['shopMoney']['amount']) if obj and obj['shopMoney'] and obj['shopMoney']['amount'] else Decimal(
0) # this is just to retrieve the inner shopMoney field
def get_line_price(order_node):
discount = retrieve_shop_money(order_node["cartDiscountAmountSet"])
non_free_lines = len([1 for item in order_node["lineItems"]["edges"] if
retrieve_shop_money(item["node"]["discountedTotalSet"]) > 0])
if non_free_lines > 0:
discounts = split_discounts(discount, non_free_lines)
else:
discounts = 0 # this was an edge case for me, that you might not consider
discounted = 0
for item in order_node["lineItems"]["edges"]:
gross = retrieve_shop_money(item["node"]["originalTotalSet"]) # THIS IS THE VALUE WITHOUT DISCOUNTS
net = retrieve_shop_money(item["node"]["discountedTotalSet"])
if net > 0: # exluding free gifts
net = net - discounts[discounted] # THIS IS THE VALUE YOU'RE LOOKING FOR
discounted = discounted + 1
So first I retrieve if the whole order was free. This was an edge case that was giving me some issues. In that case I just know that 0 is the answer I want.
Otherwise with the method split_discounts I calculate each single disount to be applied to the lines. Discounts can be different because if you discount $1 out of 3 items is going to be [0.33,0.33,0.34]. So the result is an array. Then I just loop through the lines and apply the discount if discountedTotalSet is >0.
Thinking about it, you might also want to be sure that the discount is greater than the value of the line. But that is an edge case that I never encouted, but depends on the kind of discounts you have.
I'm fairly new to elasticsearch (though with a fair bit of SQL experience) and am currently struggling with putting a proper query together. I have 2 boolean fields isPlayer and isEvil that an entry is either true or false on. Based on that, I want to split my dataset into 4 groups:
isPlayer: true, isEvil: true
isPlayer: true, isEvil: false
isPlayer: false, isEvil: true
isPlayer: false, isEvil: false
These groups I want to randomly sort within themselves, then attach them to be one long list that I can paginate. I'd like to do that inside the query, as that seems like the "correct" way to do this, since I'd do it similarly in SQL. In that list, the groups are to be sorted in order, so first all entries of Group 1 in a random order, then all entries of Group 2 in a random order, then all entries of Group 3 etc. . It is necessary that the randomness of the sorting is reproducible if given the same inputs, so if the sorting is based on random_score ideally I'd be using a seed for the randomness.
I can build a single query, but how do I combine 4?
As approaches I've found so far MultiSearch and Disjunction Max Query. MultiSearch seems like it doesn't support Pagination. Regarding Disjunction Max Query it might be that I'm missing the forest for the trees, but there I'm struggling in having the subqueries be randomly sorted only within themselves before appending them to one another.
Here how I write a single query for now without Disjunction Max Query, in case it helps:
{
"query": {
"bool": {
"should": [
{
"term": {
"isPlayer": true
}
},
{
"term": {
"isEvil": true
}
}
]
}
}
}
The solution to this problem is not doing 4 separate groups, but instead ensuring they all have different ranges of scores and sorting by scores. This can be achieved, by scoring the hits not by some kind of matching criteria, but through a script-score field. This field allows you to write code yourself that returns a logic score (The default language is called "painless", but I've seen examples of groovy as well).
The logic is fairly simple:
If isPlayer = true, add 2 points to the score
If isEvil = true, add 4 points to the score
Either way, add a random number between 0 and 1 to the score at the end
This creates the 4 groups I wanted with distinct score-ranges:
isPlayer = true, isEvil = true --> Score-range: 6-7
isPlayer = false, isEvil = true --> Score-range: 4-5
isPlayer = true, isEvil = false --> Score-range: 2-3
isPlayer = false, isEvil = false --> Score-range: 0-1
The query would look like this:
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": """
double score = 0;
if(doc['isPlayer']){
score += 2;
}
if(doc['isEvil']){
score += 4;
}
int partialSeed = 1;
score += randomScore(partialSeed, 'id');
return score;
"""
}
}
}
}
I have a large data set around 25million records
I am using searchAfter with PointInTime to walk through the data
My question is there a way where I can skip records over the limit of 10000
index.max_result_window
and start picking the records for example from 100,000 up to 105,000
right now I am sending multiple requests to Elasticsearch until I reach the desired point but it is not efficient and it is consuming a lot of time
Here is how I did it :
I calculated how many pages I needed to do the pagination.
Then the user will send a request with page number i.e number 3. So in this case only when I reach the desired page I will set the source to true.
this I best I managed to do to improve the performance and reduce the response size for none required pages
int numberOfPages = Pagination.GetTotalPages(totalCount, _size);
var pitResponse = await _esClient.OpenPointInTimeAsync(content._index, p => p.KeepAlive("2m"));
if (pitResponse.IsValid)
{
IEnumerable<object> lastHit = null;
for (int round = 0; round < numberOfPages; round++)
{
bool fetchSource = round == requiredPage;
var response = await _esClient.SearchAsync<ProductionDataItem>(s => s
.Index(content._index)
.Size(10000)
.Source(fetchSource)
.Query(query)
.PointInTime(pitResponse.Id)
.Sort(srt => {
if (content.Sort == 1) { srt.Ascending(sortBy); }
else { srt.Descending(sortBy); }
return srt; })
.SearchAfter(lastHit)
);
if (fetchSource)
{
itemsList.AddRange(response.Documents.ToList());
break;
}
lastHit = response.Hits.Last().Sorts;
}
}
//Closing PIT
await _esClient.ClosePointInTimeAsync(p => p.Id(pitResponse.Id));
Check here: Elasticsearch Pagination Techniques
I think the best way to do it, is how I did it
by keeping scrolling via Point in time and only loading the result when the desired page is reached by using the .source(bool)
Here, I am trying to get search results for multiple terms. Say fulltext="Lee jeans", then regexresult={"lee","jeans"}.
Code :
IProviderSearchContext searchContext = index.CreateSearchContext();
IQueryable<SearchItem> scQuery = searchContext.GetQueryable<SearchItem>();
var predicate = PredicateBuilder.True<SearchItem>();
//checking if the fulltext includes terms within " "
var regexResult = SearchRegexHelper.getSearchRegexResult(fulltext);
regexResult.Remove(" ");
foreach (string term in regexResult)
{
predicate = predicate.Or(p => p.TextContent.Contains(term));
}
scQuery = scQuery.Where(predicate);
IEnumerable<SearchHit<SearchItem>> results = scQuery.GetResults().Hits;
results=sortResult(results);
Sorting is based on sitecore fields:
switch (query.Sort)
{
case SearchQuerySort.Date:
results = results.OrderBy(x => GetValue(x.Document, FieldNames.StartDate));
break;
case SearchQuerySort.Alphabetically:
results = results.OrderBy(x => GetValue(x.Document, FieldNames.Profile));
break;
case SearchQuerySort.Default:
default:
results = results.OrderByDescending(x => GetValue(x.Document, FieldNames.Updated));
break;
}
Now, what i need is to have results for "lee" first and sort them and then find results for "jeans" and sort them. The final search result will have the concatenated sets of sorted items for "lee" first and then for "jeans".
Thus we would have to get results for "lee" first and then results for "jeans"
Is there a way to get results term by term ?
You can use Query-Time Boosting to give the terms more relevance and therefore affect the ranking:
Sitecore 7: Six Types of Search Boosting
Lucene Boost With LINQ in Sitecore 7 ContentSearch
You want to give the first term the highest boost, and then gradually reduce for each additional term:
var regexResult = SearchRegexHelper.getSearchRegexResult(fulltext);
regexResult.Remove(" ");
float boost = regexResult.Count();
foreach (string term in regexResult)
{
predicate = predicate.Or(p => p.TextContent.Contains(term)).Boost(boost--);
}
EDIT:
Boosting and sorting in the same query is not possible, at least, the sorting will undo the "relevance" based sorting that was returned due to boosting.
Alternative way would be to search multiple times and concatenate the results returning a single list. Not as efficient since you are essentially making multiple searches:
IProviderSearchContext searchContext = index.CreateSearchContext();
var items = new List<SearchResultItem>();
var regexResult = SearchRegexHelper.getSearchRegexResult(fulltext);
regexResult.Remove(" ");
foreach (string term in regexResult)
{
var results = searchContext.GetQueryable<SearchResultItem>()
.Where(p => p.Content.Contains(term));
SortSearchResults(results); //results passed in by reference, no need to return object to set it back to itself
items.AddRange(results);
}
NOTE: The above does not take into account duplicates between the result sets.
I have an index with the following data:
{
"_index":"businesses",
"_type":"business",
"_id":"1",
"_version":1,
"found":true,
"_source":{
"business":{
"account_level_id":"2",
"business_city":"Abington",
"business_country":"United States of America",
}
}
}
When I query the index, I want to sort by account_level_id (which is a digit between 1-5). The problem is, I don't want to sort in ASC or DESC order, but by the following: 4..3..5..2..1. This was caused by bad practice a couple years ago, where the account level maxed out at level 4, but then a lower level account was added with the value of 5. Is there a way to tell ES that I want the results returned in that specific order?
You could write a sort based script something like (not tested):
doc['account_level_id'].value == "5" ? 3 : doc['account_level_id'].value == "4" ? 5 : doc['account_level_id'].value == "3" ? 4 : doc['account_level_id'].value == "2" ? 2 : 1;
Or if possible you could create another field sort_level that maps account_level_id to sensible values that you can sort on.
{
"_index":"businesses",
"_type":"business",
"_id":"1",
"_version":1,
"found":true,
"_source":{
"business":{
"account_level_id":"4",
"business_city":"Abington",
"business_country":"United States of America",
"sort_level": 5
}
}
}
If you can sort in DESC you can create function that maps integers and sort using it.
DESC should sort them like (5 4 3 2 1), 5 replaced by 4, 4 replaced by 3, 3 replaced by 5.
int map_to(int x){
switch(x){
case 1: case 2: return x;
case 3: return 4;
case 4: return 5;
case 5: return 3;
}
}
and use it for your sorting algorithm (so when sorting algorithm has to compare x vs y it should compare map_to(x) vs map_to(y) , and this will make 4 comes before 3 and 5 as you want.