Lucene vs Elasticsearch query syntax - elasticsearch

I can see that Elasticsearch support both Lucene syntax and it's own query language.
You can use both and get same kinds of results.
Example (might be done differently maybe but to show what I mean):
Both of these queries produce the same result but use Lucene or Elastic query syntax.
GET /index/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "field101:Denmark"
}
}
]
}
}
}
GET /index/_search
{
"query": {
"match": {
"field101": {
"query": "Denmark"
}
}
}
}
I was wondering are there any kind of implications when choosing one approach over the other (like performance or some kinds of optimizations)? Or is Elastic query syntax just translated to Lucene query somewhere since Elastic runs Lucene as its underlying search engine ?

I was wondering are there any kind of implications when choosing one approach over the other (like performance or some kinds of optimizations)?
Elasticsearch DSL will convert into Lucene query under the hood, you can set "profile":true in the query to see how that works and exactly how much time it takes to convert.
I would say there are no important performance implications and you should always use the DSL, because in many cases Elasticsearch will do optimizations for you. Also, query_string will expect well written Lucene queries, and you can have syntax errors (try doing "Denmark AND" as query_string.
Or is Elastic query syntax just translated to Lucene query somewhere since Elastic runs Lucene as its underlying search engine ?
Yes. You can try it yourself:
GET test_lucene/_search
{
"profile": true,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "field101:Denmark"
}
}
]
}
}
}
will produce:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"profile": {
"shards": [
{
"id": "[KGaFbXIKTVOjPDR0GrI4Dw][test_lucene][0]",
"searches": [
{
"query": [
{
"type": "TermQuery",
"description": "field101:denmark",
"time_in_nanos": 3143,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 0,
"match": 0,
"next_doc_count": 0,
"score_count": 0,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 0,
"advance_count": 0,
"score": 0,
"build_scorer_count": 0,
"create_weight": 3143,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 0
}
}
],
"rewrite_time": 2531,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 1115
}
]
}
],
"aggregations": []
}
]
}
}
And
GET /test_lucene/_search
{
"profile": true,
"query": {
"match": {
"field101": {
"query": "Denmark"
}
}
}
}
Will produce the same
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"profile": {
"shards": [
{
"id": "[KGaFbXIKTVOjPDR0GrI4Dw][test_lucene][0]",
"searches": [
{
"query": [
{
"type": "TermQuery",
"description": "field101:denmark",
"time_in_nanos": 3775,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 0,
"match": 0,
"next_doc_count": 0,
"score_count": 0,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 0,
"advance_count": 0,
"score": 0,
"build_scorer_count": 0,
"create_weight": 3775,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 0
}
}
],
"rewrite_time": 3483,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 1780
}
]
}
],
"aggregations": []
}
]
}
}
As you see, times are in nanoseconds, not even miliseconds, that says conversion is fast.
You can read more about here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html

Related

Elasticseach multiple indices suggestions

I have following problem. This is actually my implementation of an "did you mean" query. If I use only one index the results fit perfectly. If I use multiple indices I wont get any results.
Does this query only work for single indices?
GET index1/_search
{
"suggest": {
"text": "exmple",
"multi_phrase": {
"phrase": {
"field": "all",
"size": 5,
"gram_size": 3,
"collate": {
"query": {
"source": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"multi_match": {
"query": "{{suggestion}}",
"type": "cross_fields",
"fields": [
"name",
"name2"
],
"operator": "AND",
"lenient": true
}
}
}
}
},
"params": {
"field_name": "all"
}
}
}
}
}
}
If I try this query against on single index everything works fine. If I use multiple indices the results are empty.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"multi_phrase": [
{
"text": "example",
"offset": 0,
"length": 9,
"options": []
}
]
}
}
I found the solution on my own. I have to use confidence parameter.
The confidence level defines a factor applied to the input phrases
score which is used as a threshold for other suggest candidates. Only
candidates that score higher than the threshold will be included in
the result. For instance a confidence level of 1.0 will only return
suggestions that score higher than the input phrase. If set to 0.0 the
top N candidates are returned. The default is 1.0.

ElasticSearch errors in deleting records by query

I am trying to delete large-number of documents in ES via delete_by_query.
But I am seeing the following errors.
Query
POST indexName/typeName/_delete_by_query
{
"size": 100000,
"query": {
"bool": {
"must": [
{
"range": {
"CREATED_TIME": {
"gte": 0,
"lte": 1507316563000
}
}
}
]
}
}
}
Result
{
"took": 50489,
"timed_out": false,
"total": 100000,
"deleted": 0,
"batches": 1,
"version_conflicts": 1000,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0,
"failures": [
{
"index": "indexName",
"type": "typeName",
"id": "HVBLdzwnImXdVbq",
"cause": {
"type": "version_conflict_engine_exception",
"reason": "[typeName][HVBLdzwnImXdVbq]: version conflict, current version [2] is different than the one provided [1]",
"index_uuid": "YPJcVQZqQKqnuhbC9R7qHA",
"shard": "1",
"index": "indexName"
},
"status": 409
},....
Please read this article.
You have two ways of handling this issue, by set the url to ignore version conflicts or set the query to ignore version conflicts:
If you’d like to count version conflicts rather than cause them to abort then set conflicts=proceed on the url or "conflicts": "proceed" in the request body.

Boosting elastic aggregation result

I have an elastic index for products, each product has Brand attribution and I "have to" create an aggregation that returns Brands of the products.
My Sample Query:
GET /products/product/_search
{
"size": 0,
"aggs": {
"myFancyFilter": {
"filter": {
"match_all": {}
},
"aggs": {
"inner": {
"terms": {
"field": "Brand",
"size": 3
}
}
}
}
},
"query": {
"match_all": {}
}
}
And the result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 236952,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 236952,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 139267,
"buckets": [
{
"key": "Brand1",
"doc_count": 3144
},
{
"key": "Brand2",
"doc_count": 1759
},
{
"key": "Brand3",
"doc_count": 1737
}
]
}
}
}
}
It works perfect for me. Elastic sorts buckets according to doc_count, however I would like to manipulate the bucket order in result. For example, assume that I have Brand5 and I want to increment its order to #2. I want result coming in order Brand1, Brand5 and Brand3.
If it was not in an aggregation, but in a query, I could use function_score, but now, I don't have an idea. Any clues?
What you are looking for is to define your own sorting definition and that to be applied in aggregation in elasticsearch. I've been able to come up with a solution by renaming the aggregation terms in below manner:
Brand1 to a_Brand1
Brand5 to b_Brand5
Brand3 to c_Brand3
And then apply sorting on the terms so that sorting happens lexicographically.
Of course this may not be the exact or the best solution but I felt this can help.
Below is the query that I've used. Please note that my field name is brand and it is a multifield and I'm using the field brand.keyword.
POST testdataindex/_search
{
"size":0,
"query":{
"match_all":{
}
},
"aggs":{
"myFancyFilter":{
"filter":{
"match_all":{
}
},
"aggs":{
"inner":{
"terms":{
"script":{
"lang":"painless",
"inline":"if(params.newNames.containsKey(doc['brand.keyword'].value)) { return params.newNames[doc['brand.keyword'].value];} return null;",
"params":{
"newNames":{
"Brand1":"a_Brand1",
"Brand5":"b_Brand5",
"Brand3":"c_Brand3"
}
}
},
"order":{
"_term":"asc"
}
}
}
}
}
}
}
I've created a sample data with brand names Brand1, Brand3 and Brand5 and below how the results would appear. Note the change in the term names.
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 8,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a_Brand1",
"doc_count": 2
},
{
"key": "b_Brand5",
"doc_count": 4
},
{
"key": "c_Brand3",
"doc_count": 2
}
]
}
}
}
}
Hope it helps!

Execution plan of es search?

I want to use Elasticsearch 5's profile api to find the search execution plan.
Here are my two different querys:
1:
"query": {
"bool": {
"should": [
{
"term": {
"field1": "f1"
}
},
{
"range": {
"field3": {
"gt": "1"
}
}
}
]
}
}
2:
"query": {
"bool": {
"filter": [
{
"term": {
"field1": "f1"
}
},
{
"range": {
"field3": {
"gt": "1"
}
}
}
]
}
}
The difference between the two queries is just that query1 uses should and query2 use filter.
But the children profile results of two queries are the same.
Such as:
"children": [
{
"type": "BooleanQuery",
"description": "field1:f1 field3:[2 TO 9223372036854775807]",
"time": "0.06531500000ms",
"breakdown": {
"score": 0,
"build_scorer_count": 0,
"match_count": 0,
"create_weight": 65314,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 0,
"build_scorer": 0,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "field1:f1",
"time": "0.05287500000ms",
"breakdown": {
"score": 0,
"build_scorer_count": 0,
"match_count": 0,
"create_weight": 52874,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 0,
"build_scorer": 0,
"advance": 0,
"advance_count": 0
}
},
{
"type": "",
"description": "field3:[2 TO 9223372036854775807]",
"time": "0.001556000000ms",
"breakdown": {
"score": 0,
"build_scorer_count": 0,
"match_count": 0,
"create_weight": 1555,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 0,
"build_scorer": 0,
"advance": 0,
"advance_count": 0
}
}
]
}]
Does ElasticSearch create multiple threads, then each thread executes one filter item and then the results are merged in the end?
But in my mind,i think es will use a filter pipeline,execute filter items one by one.Such as elasticsearch-order-of-filters-for-best-performance says
In case 1, the execution will be slower because all documents from the past month will need to go through filter A first, which is not cached.
In case 2, you first filter out all the documents without the type XYZ, which is fast because filter B is cached. Then the documents that made it through filter B can go through filter A. So even though filter A is not cached, the execution will still be faster since there are mush less documents left in the filter pipeline.
How es to execute multiple filter item?

ElasticSearch - Average aggregation/sort over multivalued non-unique numeric fields

I am trying to handle sorting over the average of multivalued field called 'rating_average'. In the example I'm giving you, the values for this field are [1, 2, 2]. I'm expecting the average to be (1+2+2)/3 = 1.66666667. The reality I'm getting 1.5 as an average.
After a few tests and analyzing extended stats, I've discovered that happens because the average is calculated over all non-unique items. So statistical operators are applied over the set [1, 2] instead of [1, 2, 2]. I've proved this end also by adding an aggregations section to my query to double check the average calculated for the sort block is identical to the one in the stats aggregation.
An example document is the following:
{
"_source": {
"content_uri": "http://data.semint.co.uk/resource/testContent1",
"rating_average": [
"1",
"2",
"2"
],
"forDesk": "http://data.semint.co.uk/resource/kMFMJd1rtKD"
}
The query I'm performing is the following:
{
"from": 0,
"size": 20,
"aggs": {
"rating_stats": {
"extended_stats": {
"field": "rating_average"
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"terms": {
"mediaType": [
"http://data.semint.co.uk/resource/testMediaType3"
],
"execution": "and"
}
}
]
}
}
}
},
"fields": [ "content_uri", "rating_average"],
"sort": [
{
"rating_average": {
"order": "desc",
"mode": "avg"
}
}
]
}
And these are the results I get from executing the query over the document aforementioned.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "travel_content6",
"_type": "semantic-index",
"_id": "http://data.semint.co.uk/resource/testContent1",
"_score": null,
"fields": {
"content_uri": [
"http://data.semint.co.uk/resource/testContent1"
],
"rating_average": [1, 2, 2]
},
"sort": [
1.5
]
}
]
},
"aggregations": {
"rating_stats": {
"count": 2,
"min": 1,
"max": 2,
"avg": 1.5,
"sum": 3,
"sum_of_squares": 5,
"variance": 0.25,
"std_deviation": 0.5,
"std_deviation_bounds": {
"upper": 2.5,
"lower": 0.5
}
}
}
}

Resources