I'm working on an application that is similar to some shopping cart, where we store product and its metadata (JSON) and we are expecting faster search results. (Expected Search results should contain documents having search string anywhere in product JSON doc)
We have chosen ElasticSearch (AWS service) to store the complete product JSONs. we though it would be helpful for our faster search results.
But when I tried to test my search endpoint, it is taking 2sec+ for single request, and it keep on increasing upto 30sec if I make 100 parallel requests using Jmeter. (these query times are from the application logs, not from Jmeter responses.)
Here is the sample product JSON and sample search string I'm storing in ElasticSearch.
I believe we are using ES in wrong way, please help us implementing it in a right way.
Product JSON:
{
"dealerId": "D320",
"modified": 1562827907,
"store": "S1000",
"productId": "12345689",
"Items": [
{
"Manufacturer": "ABC",
"CODE": "V22222",
"category": "Electronics",
"itemKey": "b40a0e332190ec470",
"created": 1562828756,
"createdBy": "admin",
"metadata": {
"mfdDate": 1552828756,
"expiry": 1572828756,
"description": "any description goes here.. ",
"dealerName": "KrishnaKanth Sing, Bhopal"
}
}
]
}
Search String:
krishna
UPDATE:
We receive daily stock with multiple products (separate JSONs with different productIds) and we are storing them in date-wise index's (eg. products_20190715).
While searching we are searing on products_* indices.
We are using JestClient library to communicate with ES from our SpringBoot application.
Sample Search query:
{
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"simple_query_string": {
"query": "krishna*",
"flags": -1,
"default_operator": "or",
"lenient": true,
"analyze_wildcard": false,
"all_fields": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
}
],
"filter": [
{
"bool": {
"must": [
{
"bool": {
"should": [
{
"match_phrase": {
"category": {
"query": "Electronics",
"slop": 0,
"boost": 1
}
}
},
{
"match_phrase": {
"category": {
"query": "Furniture",
"slop": 0,
"boost": 1
}
}
},
{
"match_phrase": {
"category": {
"query": "Sports",
"slop": 0,
"boost": 1
}
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
{
"bool": {
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
}
],
"disable_coord": false,
"adjust_pure_negative": true,
"boost": 1
}
},
"sort": [
{
"modified": {
"order": "desc"
}
}
]
}
There are several issues with your elasticsearch query.
Storing each day products in the different index is your design choice, which I am not aware of but if its a small list of products then it doesn't make sense and can cause the performance issue, as now these products will be stored in different smaller shards, which increases your search time, instead of searching them in a single shard, obviously if data is too large then having a single shard will also hurt performance, but that analysis you need to do and design your system accordingly and we can help you in that.
Now lets come to your query, first, you are using the wild card query which is anyway slow please read this post where the founder of Elasticsearch itself commented :-) and there is solution also provided to use the n-grams tokens instead of wildcard query, which we also used in our production to search for partial terms.
The third issue with your query is that you are using "all_fields": true, in your search query which will include all the fields in your index during the search which is quite a costly things to do and you should include only the relevant fields in your search.
I am sure even if you don't change the first one(design change) but incorporate the 2 other changes in your query, it will still improve your query performance a lot.
Happy debugging and learning.
Use Post processor JSON extractor and fetch the patter of data you need to input as search string.
Give JSON expression and match number as 0 to take the pattern in random and 1 for the first data and 2nd for 2nd so on. Hence, you have made the search string dynamic.
This will replicate the real scenario since each user will not be searching for the same string.
When you put more sequential/concurrent users over the server, it is normal that the time to get response from each requests increases gradually. But what you need to concern is about the failures from the server and the average time taken for the requests in summary report.
In general, as a standard, the requests should not take more than 10 seconds to respond.(depends upon companies and type of products). Please note that the default timeout of Jmeter is around 21 seconds.If the requests time goes beyond this, it automatically gets failed(if "Delay thread creation until needed" is disabled in thread group). But you can assert the expected value in the advanced tab in each request in Jmeter.
Related
We have a query of the form:
{
"query": {
"bool": {
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
},
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"boost": 1
}
}
}
If we remove the third filter (the query_string one), performance is dramatically improved (typically going from around 2000 to 20 ms) for different variants of the above query.
The thing is, the first two filters (on userId and the date range) will always result in only a handful of search hits (say 50 or so).
So, if it was possible to hint that to Elasticsearch, or otherwise affect the query plan, it could solve our issue.
In old (1.x) versions of ES it seems that this was affected by the order of filters. from Elasticsearch: Order of filters for best performance:
"The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible. If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A."
But newer versions are smarter - https://www.elastic.co/blog/elasticsearch-query-execution-order:
Q: Does the order in which I put my queries/filters in the query DSL matter?
A: No, because they will be automatically reordered anyway based on their respective costs and match costs.
But is it still possible to reach the desired outcome here by modifying the ES search request somehow?
Your query should be like below, so that filters run first and will only select ~50 or so documents and then your costly query_string (because of the leading wildcard) will only run on those 50 docs.
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
}
],
"boost": 1
}
}
}
Assume I have a compound bool query with various "must" and "should" statements that each may include different leaf queries including "multi-match" and "match_phrase" queries such as below.
How can I get the score from individual queries packed into a single query?
I know one way could be to break it down into multiple queries, execute each, and then aggregate the results in code-level (not query-level). However, I suppose that is less efficient, plus, I lose sorting/pagination/.... features from ElasticSearch.
I think "Explanation API" is also not useful for me since it provides very low-level details of scoring (inefficient and hard to parse) while I just need to know the score for each specific leaf query (which I've also already named them)
If I'm wrong on any terminology (e.g. compound, leaf), please correct me. The big picture is how to obtain individual scores from each sub-query inside of a bool query.
PS: I came across Different score functions in bool query. However, it does not return the scores. If I wrap my queries in "function_score", I want the scoring to be default but obtain the individual scores in response to the query.
Please see the snippet below:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "...",
"fields": [
"field1^3",
"field2^5"
],
"_name": "must1_mm",
"boost": 3
}
}
],
"should": [
{
"multi_match": {
"query": "...",
"fields": [
"field3^2",
"field4^5"
],
"boost": 2,
"_name": "should1_mm",
"boost": 2
}
},
{
"match_phrase": {
"field5": {
"_name": "phrase1",
"boost": 1.5,
"query": "..."
}
}
},
{
"match_phrase": {
"field6": {
"_name": "phrase2",
"boost": 1,
"query": "..."
}
}
}
]
}
}
}```
We have a elasticsearch index containing a catalog of products, that we want to search by title and description.
We want it to have the following constraints:
We are searching title and description for occurences (matches in title should be twice as important as description)
We want it to have a very light fuzzy search result (but still accurate results)
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
category_id should filter products out (so no results of other categories should be shown)
The created_at attribute should be valued very high in sorting as well.
products should lose score the "older" they get. (This is very important, because they lose importance with every day)
I have tried to create a query like that, but the results are really less than accurate. Sometimes finding completely unrelated stuff. I think that's because of the wildcard query.
Also I think there must be a more elegant solution for the "created_at" scoring. Right?
I am using Elasticsearch 6.2
This is my current code. I would be happy to see an elegant solution for this:
{
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"min_score": 0.3,
"size": 12,
"from": 0,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": [
"212",
"213"
]
}
},
"should": [
{
"match": {
"title_completion": {
"query": "Development",
"boost": 20
}
}
},
{
"wildcard": {
"title": {
"value": "*Development*",
"boost": 1
}
}
},
{
"wildcard": {
"title_completion": {
"value": "*Development*",
"boost": 10
}
}
},
{
"match": {
"title": {
"query": "Development",
"operator": "and",
"fuzziness": 1
}
}
},
{
"range": {
"created_at": {
"gte": 1563264817998,
"boost": 11
}
}
},
{
"range": {
"created_at": {
"gte": 1563264040398,
"boost": 4
}
}
},
{
"range": {
"created_at": {
"gte": 1563256264398,
"boost": 1
}
}
}
]
}
}
}
First of all, building a request returning relevant results is usually a difficult task. It can't be done without knowing the content of the documents. That said, I can give you hints to fulfill your requirements and avoid unrelevant results.
We are searching title and description for occurences (matches in title should be twice as important as description)
You can use boost as you did in your query to give more importance to matches on title compared to description.
We want it to have a very light fuzzy search result (but still accurate results)
You should use AUTO value for the fuzzy field to define different values of fuzziness depending on the length of the term. E.g., by default terms having less than 3 letters (most common terms where a change in letter can result in a different word) will not allows changes. Terms with more than 3 letters will allow one change and more than 5 will allow 2 changes. You can change this behavior depending of your tests.
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
Use a should clause in the bool statement. Clauses in a should statements does not filter documents (unless specified otherwise). The queries in should clause are only used to increase the score.
category_id should filter products out (so no results of other categories should be shown)
Use a must of filter clause in the bool statement to ensure that all documents validate a constraint. If you don't want the subqueries to contribute to the score (I believe its your case), use filter instead of match because filter will be able to cache the results. Your query is ok for this requirement.
The created_at attribute should be valued very high in sorting as well. products should lose score the "older" they get. (This is very important, because they lose importance with every day)
You should use a function score with a decay function. If decay function are not clear for you, you can skip the equations in the document and jump to the figure which self explanatory. The following query is an example using a gauss decay function.
{
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
Hints for writing queries
Avoid wildcard queries: If you use * they are not efficient and will consume a lot of memory. If you want to be able to search in part of a term (finding "penthouse" when the user search "house") you should create a subfield using ngram tokenizer and write a standard match query using the subfield.
Avoid setting a minimum score: The score is a relative value. A small score or a high score does not mean that the document is relevant or not. You can read this article about the subject.
Be carefull with fuzzy queries: Fuzzy can generate a lot of noise and confuse users. In general, I would recommend to increase the default AUTO threshold for fuzzy and accept that some queries with mispelling does not return good results. Usually, it is simpler for a user to detect a mispelling in his input compared to understanding why he has completly unrelated results.
Example query
This is just an example that you will need to adapt with your data.
{
"size": 12,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": <CATEGORY_IDS>
}
},
"should": [
{
"match": {
"title": {
"query": <QUERY>,
"fuzziness": AUTO:4:12,
"boost": 3
}
}
},
{
"match": {
"title_completion": {
"query": <QUERY>,
"boost": 1
}
}
},
{
"match": {
// title_completion field with ngram tokenizer
"title_completion.ngram": {
"query": <QUERY>,
// Use lower boost because it match only partially
"boost": 0.5
}
}
}
]
},
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
}
Summary: I'm trying to understand why two queries that seem very similar in complexity are vastly different in execution speed.
I'm using Elastic Search 6.4 and i'm having a name field that i would like to use phonetic queries on.
As an example I profiled a phonetic query for the search term "Mario" and found out that Lucene in the backgroud is executing this as a SynonymQuery:
"type": "SynonymQuery",
"description": "Synonym(person.firstName.phonetic:mYrio person.firstName.phonetic:mari person.firstName.phonetic:mario person.firstName.phonetic:mori person.firstName.phonetic:morio)",
and it takes around 200ms to do so on an index with ~15 million records.
Since it seemed to convert my single search term into 5 synonyms, i thought "well, what if i search for the same 5 terms without phonetic? Will it be similarily slow?" or in other words "is it not the phonetic part that makes it slow, but the fact that it has to search for several synonyms?"
But it turns out if i query the field without phonetic for "mario mYrio mari mori morio" it will result in a BooleanQuery (with one term query per synonym as children):
"type": "BooleanQuery",
"description": "person.firstName:mario person.firstName:mYrio person.firstName:mari person.firstName:mori person.firstName:morio",
that takes only 1/10th of the time. Please note: I know and understand that those two queries give different results. I'm not trying to simulate a phonetic search with the second query. i just wanted to see if it would be slow as well, because it seemed to be a query om similar complexity.
for someone like me, who only recently started using Elastic Search, those two queries look very similar in complexity (search for 5 terms with an OR operator) and i can't understand why one is so much slower than the other.
Any insight would be much appreciated!
Thanks in advance!
regards
Mario
P.S.: i realised it will probably help if i include the two queries i used in this example:
first query (phonetic):
{
"profile": true,
"size": 1,
"timeout": "10s",
"query": {
"bool": {
"should": [
{
"match": {
"person.firstName.phonetic": {
"query": "mario",
"operator": "OR",
"prefix_length": 0,
"max_expansions": 50,
"fuzzy_transpositions": true,
"lenient": false,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
second query (non-phonetic):
{
"profile": true,
"size": 1,
"timeout": "10s",
"query": {
"bool": {
"should": [
{
"match": {
"person.firstName": {
"query": "mario myrio mari mori morio",
"operator": "OR",
"fuzziness": "0",
"prefix_length": 3,
"max_expansions": 50,
"fuzzy_transpositions": true,
"lenient": false,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
I would say it’s pretty clear what’s the difference between those two - rewrite process aka expanding term mario to a synonyms that exists. This process basically requires you to process trough SynonymGraphFilter, which I believe read data about synonyms from disk, which makes things slower.
In case of the boolean query the match is going through different analyzer chain (which I believe is just the same a phonetic, but without synonyms)
I have this ES query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "test",
"fields": [
"name^-1.0",
"id^-1.0",
"address.city^-1.0",
"address.street^-1.0"
],
"type": "phrase_prefix",
"lenient": "true"
}
}
],
"boost": 1.0,
"minimum_should_match": "1"
}
},
"from": 0,
"size": 20
}
and currently what happens is, when I search for person with the name john, I will get bunch of results that the id, address.city, address.street contains john in them, which is fine, but I want name to be more important, and also if I have in the es 2 people john and someone with 2 names like george john I would want the just john to come up first.
can I do that? :)
To make any field more important than other(s), you can set its boost to a higher value. So if fieldA^4 and fieldB^1 it implies that fieldA is 4 times more important than fieldB. Therefore you can give higher boost value to name field to make it more important for scoring.
For second point the document with name field value as john will have higher score than with a document having name field value as george john (assuming that other fields have same data in both documents). The reason you are get the second doc (george john) higher in result is because you have boosted all the fields with negative value.
So to cater to both of your points
give higher boost to name
make boost for all fields as positive value.
So the query should look as below:
{
//"explain": true,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "john",
"fields": [
"name^4.0",
"id^1.0",
"address.city^1.0",
"address.street^1.0"
],
"type": "phrase_prefix",
"lenient": "true"
}
}
],
"boost": 1,
"minimum_should_match": "1"
}
},
"from": 0,
"size": 20
}
To understand more on how the score for the matching document is calculated by elastic, you can use the "explain": true in your query. This will give detailed steps in result, taken by elastic to calculate the score.