ElasticSearch 7.7 how can I increase the count of results of whole index - elasticsearch

I understand that its theres hardcoded limit in Elasticsearch of 10k results per query. What I wanna know if theres any way to search results within this 10k limit but at the same time at least show count of all results for this particular query.
So let's suppose if there are 1M results matching for certain query, the count should show 1M instead of max limit of 10k.
Thank you.

Yes, You can.
You need to add the below attribute to your search query
{
"track_total_hits": true
}
It will show you the total count along with default result.

Elasticsearch supports a /_count API to result the count of all hits in query
GET /index/_count
{
// your search query here
"query": {
"match_all": {}
}
}
You can add "from" and "size" to visit specific hits of response
Example
GET index/_search
{
"from": 0,
"size": 100,
"query": {
"match_all": {}
}
}
In the returned query response from Elasticsearch, there is a field response['hits']['total']['value'] which has the count of hits too, but it also has its limitations.
NOTE: /_count API doesn't support "from" and "size", it gives you the total count.
for more details visit
Elasticsearch Count API.

Related

ElasticSearch: How to search only the N docs for large scale index?

I have large number of docs(about 100M) stored in a single index, when using group by on single field, the query may eat up all my CPU on ES server(most of time, < 100 results returned).
Is it possible to limit the query scope(i.e., only search 1M docs) for a single query?
Use query pagination to limit the search scope:
GET /_search
{
"from": 0,
"size": 1000000,
"query": {
"match": {
"city": "New york"
}
}
}
more information in documentation
to add to the other answer, you can also look at https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-aggregations-bucket-sampler-aggregation.html

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

Elastic Search Distinct values

I want to know how it's possible to get distinct value of a field in elastic search. I read an article here shows how to do that with facets, but I read facets are deprecated:
http://elasticsearch-users.115913.n3.nabble.com/Getting-Distinct-Values-td3830953.html
Is there any other way to do that? if not is it possible to tell me how to do that? it's abit hard to understand solutions like this: Elastic Search - display all distinct values of an array
Use aggregations:
GET /my_index/my_type/_search?search_type=count
{
"aggs": {
"my_fields": {
"terms": {
"field": "name",
"size": 1000
}
}
}
}
You can use the Cardinality metric
Although the counts returned aren't guaranteed to be 100% accurate, they almost always are for low cardinality terms and the precision is configurable via the precision_threshold param.
http://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

Resources