I want to use Elasticsearch 5's profile api to find the search execution plan.
Here are my two different querys:
1:
"query": {
"bool": {
"should": [
{
"term": {
"field1": "f1"
}
},
{
"range": {
"field3": {
"gt": "1"
}
}
}
]
}
}
2:
"query": {
"bool": {
"filter": [
{
"term": {
"field1": "f1"
}
},
{
"range": {
"field3": {
"gt": "1"
}
}
}
]
}
}
The difference between the two queries is just that query1 uses should and query2 use filter.
But the children profile results of two queries are the same.
Such as:
"children": [
{
"type": "BooleanQuery",
"description": "field1:f1 field3:[2 TO 9223372036854775807]",
"time": "0.06531500000ms",
"breakdown": {
"score": 0,
"build_scorer_count": 0,
"match_count": 0,
"create_weight": 65314,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 0,
"build_scorer": 0,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "field1:f1",
"time": "0.05287500000ms",
"breakdown": {
"score": 0,
"build_scorer_count": 0,
"match_count": 0,
"create_weight": 52874,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 0,
"build_scorer": 0,
"advance": 0,
"advance_count": 0
}
},
{
"type": "",
"description": "field3:[2 TO 9223372036854775807]",
"time": "0.001556000000ms",
"breakdown": {
"score": 0,
"build_scorer_count": 0,
"match_count": 0,
"create_weight": 1555,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 0,
"build_scorer": 0,
"advance": 0,
"advance_count": 0
}
}
]
}]
Does ElasticSearch create multiple threads, then each thread executes one filter item and then the results are merged in the end?
But in my mind,i think es will use a filter pipeline,execute filter items one by one.Such as elasticsearch-order-of-filters-for-best-performance says
In case 1, the execution will be slower because all documents from the past month will need to go through filter A first, which is not cached.
In case 2, you first filter out all the documents without the type XYZ, which is fast because filter B is cached. Then the documents that made it through filter B can go through filter A. So even though filter A is not cached, the execution will still be faster since there are mush less documents left in the filter pipeline.
How es to execute multiple filter item?
Related
I can see that Elasticsearch support both Lucene syntax and it's own query language.
You can use both and get same kinds of results.
Example (might be done differently maybe but to show what I mean):
Both of these queries produce the same result but use Lucene or Elastic query syntax.
GET /index/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "field101:Denmark"
}
}
]
}
}
}
GET /index/_search
{
"query": {
"match": {
"field101": {
"query": "Denmark"
}
}
}
}
I was wondering are there any kind of implications when choosing one approach over the other (like performance or some kinds of optimizations)? Or is Elastic query syntax just translated to Lucene query somewhere since Elastic runs Lucene as its underlying search engine ?
I was wondering are there any kind of implications when choosing one approach over the other (like performance or some kinds of optimizations)?
Elasticsearch DSL will convert into Lucene query under the hood, you can set "profile":true in the query to see how that works and exactly how much time it takes to convert.
I would say there are no important performance implications and you should always use the DSL, because in many cases Elasticsearch will do optimizations for you. Also, query_string will expect well written Lucene queries, and you can have syntax errors (try doing "Denmark AND" as query_string.
Or is Elastic query syntax just translated to Lucene query somewhere since Elastic runs Lucene as its underlying search engine ?
Yes. You can try it yourself:
GET test_lucene/_search
{
"profile": true,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "field101:Denmark"
}
}
]
}
}
}
will produce:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"profile": {
"shards": [
{
"id": "[KGaFbXIKTVOjPDR0GrI4Dw][test_lucene][0]",
"searches": [
{
"query": [
{
"type": "TermQuery",
"description": "field101:denmark",
"time_in_nanos": 3143,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 0,
"match": 0,
"next_doc_count": 0,
"score_count": 0,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 0,
"advance_count": 0,
"score": 0,
"build_scorer_count": 0,
"create_weight": 3143,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 0
}
}
],
"rewrite_time": 2531,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 1115
}
]
}
],
"aggregations": []
}
]
}
}
And
GET /test_lucene/_search
{
"profile": true,
"query": {
"match": {
"field101": {
"query": "Denmark"
}
}
}
}
Will produce the same
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"profile": {
"shards": [
{
"id": "[KGaFbXIKTVOjPDR0GrI4Dw][test_lucene][0]",
"searches": [
{
"query": [
{
"type": "TermQuery",
"description": "field101:denmark",
"time_in_nanos": 3775,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 0,
"match": 0,
"next_doc_count": 0,
"score_count": 0,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 0,
"advance_count": 0,
"score": 0,
"build_scorer_count": 0,
"create_weight": 3775,
"shallow_advance": 0,
"create_weight_count": 1,
"build_scorer": 0
}
}
],
"rewrite_time": 3483,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time_in_nanos": 1780
}
]
}
],
"aggregations": []
}
]
}
}
As you see, times are in nanoseconds, not even miliseconds, that says conversion is fast.
You can read more about here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html
I have documents that looks like
[
{
"price": 10,
"market.id": 1,
"product.id": 1
},
{
"price": 2,
"market.id": 3,
"product.id": 1
},
{
"price": 5,
"market.id": 3,
"product.id": 2
}
]
In order to count the number of market in a given average price interval,I made this ES query :
{
"size": 0,
"query": {
"bool": {
"must_not": {
"term": {
"price": 0
}
}
}
},
"aggs": {
"price_ranges": {
"histogram": {
"field": "price",
"interval": 0.5,
"min_doc_count": 0,
"extended_bounds": {
"min": 0,
"max": 45
}
},
"aggs": {
"market_count": {
"cardinality": {
"field": "market.id"
}
}
}
}
}
}
The problem here, is that doesn't take the average price but the price for each interval.
I don't know if it's possible to do this directly with an ES query.
I am trying to delete large-number of documents in ES via delete_by_query.
But I am seeing the following errors.
Query
POST indexName/typeName/_delete_by_query
{
"size": 100000,
"query": {
"bool": {
"must": [
{
"range": {
"CREATED_TIME": {
"gte": 0,
"lte": 1507316563000
}
}
}
]
}
}
}
Result
{
"took": 50489,
"timed_out": false,
"total": 100000,
"deleted": 0,
"batches": 1,
"version_conflicts": 1000,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0,
"failures": [
{
"index": "indexName",
"type": "typeName",
"id": "HVBLdzwnImXdVbq",
"cause": {
"type": "version_conflict_engine_exception",
"reason": "[typeName][HVBLdzwnImXdVbq]: version conflict, current version [2] is different than the one provided [1]",
"index_uuid": "YPJcVQZqQKqnuhbC9R7qHA",
"shard": "1",
"index": "indexName"
},
"status": 409
},....
Please read this article.
You have two ways of handling this issue, by set the url to ignore version conflicts or set the query to ignore version conflicts:
If you’d like to count version conflicts rather than cause them to abort then set conflicts=proceed on the url or "conflicts": "proceed" in the request body.
i am new in ElasticSearch i want count document based on id but i want to pass array in id like "myId":[1,2,3,4,5]
for every id i want count number
Current input
GET /probedb_v1/probe/_count
{
"query": {
"match_phrase": {
"myId": 1
}
}
}
Current output
{ "count": 6929,
"_shards":{ "total": 1,
"successful": 1,
"failed": 0
}
}
What is input for my
Required Output
{ "count": [6929,5222,65241,5241,6521],
"_shards":{ "total": 1,
"successful": 1,
"failed": 0
}
}
also need code for elasticsearch java-api
You can do it like this:
GET /probedb_v1/probe/_search
{
"size": 0,
"query": {
"terms": {
"myId": [123, 44]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "myId",
"size": 50
}
}
}
}
This will give you this output:
"aggregations": {
"NAME": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 123,
"doc_count": 3
},
{
"key": 44,
"doc_count": 2
}
]
}
}
Contrary to the current ES documentation http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-mlt-field-query.html, stop_words is an unsupported field for more_like_this_field queries in my ES installation.
version: 0.90.5
build_hash: c8714e8e0620b62638f660f6144831792b9dedee
build_timestamp: 2013-09-17T12:50:20Z
build_snapshot: false
lucene_version: 4.4
can anyone confirm this?
This is what I am sending to the server
{
"query": {
"bool": {
"must": [
{
"match_all": {
"boost": 1
}
},
{
"more_like_this_field": {
"FieldA": {
"like_text": "House",
"boost": 1,
"min_doc_freq": 0,
"min_word_len": 0,
"min_term_freq": 0
}
}
}
],
"should": [
{
"more_like_this_field": {
"Equipped": {
"like_text": "pool garage",
"boost": 0,
"min_doc_freq": 0,
"min_word_len": 0,
"min_term_freq": 0,
"stop_words": "garden"
}
}
},
{
"more_like_this_field": {
"Neighbourhood": {
"like_text": "school",
"boost": 5,
"min_doc_freq": 0,
"min_word_len": 0,
"min_term_freq": 0
}
}
}
],
"minimum_number_should_match": 2
}
}
}
and this is what I get back
QueryParsingException[[data] [mlt_field] query does not support [stop_words]];
Same happens with the more_like_this query, the reason is that stop_words must be an array.
{
"more_like_this_field": {
"Equipped": {
"like_text": "pool garage",
"boost": 0,
"min_doc_freq": 0,
"min_word_len": 0,
"min_term_freq": 0,
"stop_words": ["garden"]
}
}
}