I want to know the time taken by count query in elastic search, just like the search query query which contain took - time taken.
My Query looks like -
curl -XGET "http://localhost:9200/index1/type1/_count"
And result for that query -
{
"count": 136,
"_shards": {
"total": 15,
"successful": 15,
"failed": 0
}
}
Is there is any way so that I can get the time taken for count query just like search api?
Document for count API - Count API
At the time of writing this answer still its not supported by Elastic, raised a feature request and mostly I will work on to add a support of it.
A trick that can help with that is to use _search
with size zero (so no restult will be returned);
track_total_hits set to true (so it will count all hits, not only the ones in the result window); and
filter_path equal to took,htis.total.value.
For example, I executed the query above in a cluster of mine...
GET viagens-*/_search?filter_path=took,hits.total.value
{
"size": 0,
"track_total_hits": true,
"query": {
"match_all": {}
}
}
...and got this result:
{
"took": 2,
"hits": {
"total": {
"value": 2589552
}
}
}
It does not profile the Count API itself, unfortunately, but has a similar result. Can be very useful as an alternative in some situations!
Related
I'm having an issue with a Kibana Dashboard, which complains with multiple Courier Fetch: xxx of 345 shards failed. warning messages every time I reload it.
Okay, I'm asking for data spanning over the last 15 minutes, and I have an index per day. There is no way today's index contains 345 shards. So, why does my query span over so many shards ?
Things I have checked :
Number of indices and of shards per index :
I checked this using the _cat/indices endpoint : After filtering out indices I didn't create myself (such as kibana's indices, basically everything that starts with a dot), I have 69 indices, each containing 5 shards (adding up to a total of 345 shards). That's what I was expecting.
This basically means that my search is executed on all of my indices.
I'm not writing new data to old indices :
Here is a query for last hour's records on today's index1 :
GET 20181027_logs/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1543326215000,
"lte": 1543329815000,
"format": "epoch_millis"
}
}
}
]
}
}
}
Answer (truncated) :
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1557,
Same query without restricting the index :
GET *_logs/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": 1543326215000,
"lte": 1543329815000,
"format": "epoch_millis"
}
}
}
]
}
}
}
Answer (truncated) :
{
"took": 24,
"timed_out": false,
"_shards": {
"total": 345,
"successful": 345,
"failed": 0
},
"hits": {
"total": 1557,
We can see that the second query returns exactly the same results than the first one, but searches through every index.
My timestamp field is indexed :
By default, every field in elasticsearch is indexed, but I still double-checked it :
GET 20181027_logs/_mapping
{
"20181027_logs": {
"mappings": {
"logs": {
"properties": {
[…]
"timestamp": {
"type": "date"
}
[…]
While a non-indexed field would give2 :
"timestamp": {
"type": "date",
"index": false
}
Remaining leads
At this point, I have really no idea what could be the issue.
Just as a side note : The timestamp field is not the insertion date of the event, but the date at which the event actually happened. Regardless of this timestamp, the events are inserted in the latest index.
This means that every index can have events corresponding to past dates, but no future dates.
In this precise case, I don't see how this could matter : since we're only querying for the last 15 minutes, the data can only be in the last index no matter what happens.
Elasticsearch and Kibana version : 5.4.3
Thanks for reading this far, and any help would be greatly appreciated !
1 : There's a mistake in index naming, causing an offset between the index name and the actual corresponding date, but it should not matter here.
2 : This was checked on another elastic cluster, of the same version, with some fields explicitly opted out of indexing
TL;DR
I finally solved the issue simply by reducing the number of shards.
Full disclosure
When using the dev tools on kibana, I could find many errors on the _msearch endpoint :
{
"shard": 2,
"index": "20180909_logs",
"node": "FCv8yvbyRhC9EPGLcT_k2w",
"reason": {
"type": "es_rejected_execution_exception",
"reason": "rejected execution of org.elasticsearch.transport.TransportService$7#754fe283 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#16a14433[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 16646]]"
}
},
Which basically proves that I'm flooding my ES server with too many parallel requests on too many shards.
From what I could understand, apparently it's normal for kibana to query against every single index of my index pattern, event if some of them don't contain any fresh data (ES is supposed to query them anyway, and conclude that they don't contain any data in almost no time since the timestamp field is indexed)
From there, I had a few options :
1: Reduce the data retention
2: Reduce the number of parallel requests I am doing
3: Add nodes to my cluster
4: Restructure my data to use fewer shards
5: Increase the size of the search queue
1 and 2 are not an option in my case.
5 would probably work, but is apparently highly recommended against (from what I could understand, in most cases, this error is only the symptom of deeper issues, that should be fixed instead)
This is a 160GB single-node cluster, with (now) more than 350 shards. This makes an extremely low average size per shard, so I decided to first try number 4 : Reindex my data to use fewer shards.
How I dit it
Use a single shard per index :
I created the following index pattern :
PUT _template/logs {
"template": "*_logs",
"settings": {
"number_of_shards": 1
}
}
Now, all my future indices will have a single shard.
I still need to reindex or merge the existing indices, but this has to be done with the next point anyway.
Switch to monthly indices (instead of daily)
I modified the code that inserts data into ES to use a month-based index name (such as 201901_monthly_logs, and then reindexed every old index to the corresponding one in the new pattern :
POST _reindex
{
"source": {
"index": "20181024_logs"
},
"dest": {
"index": "201810_monthly_logs"
}
}
Enjoy !
This being done, I was down to 7 indices (and 7 shards as well).
All that was left was changing the index pattern from _logs to _monthly_logs in my kibana visualisations.
I haven't had any issue since this time, I'll just wait a bit more, then delete my old indices.
I'm trying to return multiple "buckets" of results from Elasticsearch in one HTTP request.
I'm using the _msearch API.
I'm using the following query:
POST /_msearch
{"index" : "[INDEXNAME]", "type":"post"}
{"query" : {"match" : {"post_type":"team-member"}}, "from" : 0, "size" : 10}
{"index" : "[INDEXNAME]", "type": "post"}
{"query" : {"match" : {"post_type": "article"}}, "from" : 0, "size" : 10}
The query executes without error, but the results only return one object, where it seems it shoul be two (one for the 10 team-members, and one for the 10 articles):
{
"responses": [
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 4,
"successful": 4,
"failed": 0
},
"hits": {
"total": 191,
"max_score": 3.825032,
"hits": [
{...}
]
}
}, // second query should be here, no?
]
}
Is my query construction wrong, or am I misunderstanding how this should work?
The format of a _msearch request must follow the bulk API format. It must look something like this:
header\n
body\n
header\n
body\n
The header part includes which index / indices to search on, optional (mapping) types to search on, the search_type, preference, and routing. The body includes the typical search body request (including the query, aggregations, from, size, and so on).
NOTE: the final line of data must end with a newline character \n.
Make sure your query follows this format (from your code example, depending on the environment, as you've added two new lines after POST /_msearch, your query may or may not work; you should only add one new line) . If the responses array only has one result, then, in your case, the last query is somehow discarded - again, check its format.
I don't see any problem actually, but you should check "Bulk API", it's similar.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
In my Spring-Data-Elasticsearch application, I am trying to use SearchQuery to search through Elasticsearch, according to some given QueryBuilder and FilterBuilder.
However, Elasticsearch docs talk about SearchResponse, which to me, seems to do the same work as SearchQuery.
I don't understand the difference between SearchQuery and SearchResponse.
Can someone please point out the difference?
If you pass the query object to an elasticsearch client and execute the query you get a response back.
The response type is dependent on the query type.
executed SearchQuery object -> SearchResponse object
executed IndexQuery object -> IndexResponse object
and so on...
In the code snippet of your link the SearchQuery object is build with the prepareSearch method. Afterwards it gets executed by the client.
SearchResponse response =
// Query creation part
client.prepareSearch("index1", "index2")
.setTypes("type1", "type2")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.termQuery("multi", "test"))
.setPostFilter(FilterBuilders.rangeFilter("age").from(12).to(18))
.setFrom(0).setSize(60).setExplain(true)
//query execution part
.execute()
.actionGet();
The search query is the query you send to Elastic, the search response is Elasticsearch's response to that query.
For example, this could be your query:
POST /your_index/_search
{
"query": {
"term": {
"available": {
"value": true
}
}
}
And a possible query response from ES:
{
"took": 99,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 58188,
"max_score": 0.99998283,
"hits": [
...
]
}
}
There is way to get the top n terms result. For example:
{
"aggs": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 5
}
}
}
}
Is there any way to set the offset for the terms result?
If you mean something like ignore first m results and return the next n results then no; it is not possible. A workaround to that would be to set size to m + n and do client side processing to ignore the first m results.
A little late, but (at least) since Elastic 5.2.0 you can use partitioning in the terms aggregation to paginate results.
https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
Maybe this helps a bit:
"aggregations": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 9999 ---> add here a bigger size
}
},
"aggregations": {
"limitBucket": {
"bucket_sort": {
"sort": [],
"from": 10,
"size": 20,
"gap_policy": "SKIP"
}
}
}
}
I am not sure about what value to put in the term size. I would suggest to put a reasonable value. This limits the initial aggregation, then the second limitBucket agg will limit again the term agg. This will probably still load in memory all the documents that you limited in the terms agg. That is why it depends on your scenario, if it's reasonable not get all results (i.e. if you have tens of thousands). I.e you are doing a google like search where you don't need to jump to page 1000.
Compared to the alternative to get the data on the client side, this might save you some data transfer from ES, but as I said weight this carefully as it loads all a lot of data in ES memory and you might have memory issues in ElasticSearch
My use case is as follows:
Execute a search against Products and boost the score by its salesRank relative to the other documents in the results. The top 10% sellers should be boosted by a factor of 1.5 and the top 25-10% should be boosted by a factor of 1.25. The percentiles are calculated on the results of the query, not the entire data set. This is feature is being used for on-the-fly instant results as the user types, so single character queries would still return results.
So for example, if I search for "Widget" and get back 100 results, the top 10 sellers returned will get boosted by 1.5 and the top 10-25 will get boosted by 1.25.
I immediately thought of using the percentiles aggregation feature to calculate the 75th and 90th percentiles of the result set.
POST /catalog/product/_search?_source_include=name,salesRank
{
"query": {
"match_phrase_prefix": {
"name": "N"
}
},
"aggs": {
"sales_rank_percentiles": {
"percentiles": {
"field" : "salesRank",
"percents" : [75, 90]
}
}
}
}
This gets me the following:
{
"hits": {
"total": 142,
"max_score": 1.6653868,
"hits": [
{
"_score": 1.6653868,
"_source": {
"name": "nylon",
"salesRank": 46
}
},
{
"_score": 1.6643861,
"_source": {
"name": "neon",
"salesRank": 358
}
},
..... <SNIP> .....
]
},
"aggregations": {
"sales_rank_percentiles": {
"values": {
"75.0": 83.25,
"90.0": 304
}
}
}
}
So great, that gives me the results and the percentiles. But I would like to boost "neon" above "nylon" because it's a top 10% seller in the results (note: in our system, the salesRank value is descending in precedence, higher value = more sales). The text relevancy is very low since only one character was supplied, so sales rank should have a big effect.
It seems that a function core query could be used here, but all of the examples in the documentation uses doc[] to use values from the document. There aren't any for using other information from the top-level of the response, e.g. "aggs" {}. I would basically like to boost a document if its sales rank falls within the 100-90th and 89th-75th percentiles, by 1.5 and 1.25 respectively.
Is this something Elasticsearch supports or am I going to have to roll my own with a custom script or plugin? Or try a different approach entirely? My preference would be to pre-calculate percentiles, index them, and do a constant score boost, but stakeholder prefers the run-time calculation.
I'm using Elasticsearch 1.2.0.
What if you keep sellers as a parent document and periodically updates their stars (and some boosting factor), say, via some worker. Then you match products using has_parent query, and use a combination of score mode, custom score query to match top products from top sellers?