elastic range query giving only top 10 records - elasticsearch

I am using elastic search query range to get the records from one date to other date using python, but I am only getting 10 records.
Below is the query
{"query": {"range": {"date": {"gte":"2022-01-01 01:00:00", "lte":"2022-10-10 01:00:00"}}}}
Sample Output:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 8,
"successful": 8,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1.0,
"hits": [{"_source": {}}]
}
The "hits" list consists of only 10 records. When I checked in my database there are more than 10 records.
Can anyone tell me how to modify the query to get all the records for the mentioned date ranges?

You need to use the size param as by default Elasticsearch returns only 10 results.
{
"size" : 100, // if you want to fetch return 100 results.
"query": {
"range": {
"date": {
"gte": "2022-01-01 01:00:00",
"lte": "2022-10-10 01:00:00"
}
}
}
}
Refer Elasticsearch documentation for more info.
Update-1: Since OP wants to know the exact count of records matching search query(refer comments section), one use &track_total_hits=true in the URL (hint: causes performance issues) and then in the search response under hits you will get exact records matching your search as shown.
POST /_search?track_total_hits=true
"hits": {
"total": {
"value": 24981859, // note count with relation equal.
"relation": "eq"
},
"max_score": null,
"hits": []
}
Update-2, OP mentioned in the comments, that he is getting only 10K records in one query, as mentioned in the chat, its restericted by Elasticsearch due to performance reasons but if you still want to change it, it can be done by changing below setting of an index.
index.max_result_window: 20000 // your desired count
However as suggested in documentation
index.max_result_window The maximum value of from + size for searches
to this index. Defaults to 10000. Search requests take heap memory and
time proportional to from + size and this limits that memory. See
Scroll or Search After for a more efficient alternative to raising
this.

Related

Get All records From Elastic Search without size

My query is something like this :
{ "from": 0, "size": 100,"track_total_hits": true, "query": {"bool": {"filter": [{
"bool": {
"must_not": {
"exists": {
"field": "deleted_at"
}
}
}}]}}, "sort": [{ "added_at" : {"order" : "desc"}}]}
Now If I don't specify size it gives only 10 records . And I don't know how many records are there . So what is possible thing to retrieve all data at once or even get count ?
hits.total.value will give you the value of the total number of documents, matching the search query.
track_total_hits value defaults to 10,000. If the number of documents is more than 10,000, then the relationship will change to gte, instead of eq
Refer to this official documentation, to know more about hits.total
"hits": {
"total": {
"value": 11, // note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
Elasticsearch returns datas with size option. If you want to fetch all datas,
you should use pagination with scroll api
You can follow this link
What is the version of ES you are using? With ES 7.x count of matching documents will be present in the response under hits -> total -> value

Significant terms buckets always empty

I have a collection of posts with their tags imported into Elasticsearch. The indexes are:
language - type: keyword
tags (array) - type: keyword
created_at - type: date
Single document looks like that:
{ "language": "en", "tags": ["foo", "bar"], created_at: "..." }
I'm trying to get the significant terms query on my data set using:
GET _search
{
"aggregations": {
"significant_tags": {
"significant_terms": {
"field": "tags"
}
}
}
}
But the results bucket are always empty:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"skipped": 0,
"failed": 0
},
"aggregations": {
"significant_tags": {
"doc_count": 2945,
"bg_count": 2945,
"buckets": []
}
}
}
I can confirm the data is properly imported as i'm able to any other aggregation on this dataset and it works fine. Just the significant terms don't want to cooperate. Any ideas on what am i possibly doing wrong in here?
Elasticsearch 6.2.4
Significant terms needs a foreground query or aggregation for it to calculate difference in term frequencies and produce statistically significant results. So you will need to add a initial query then your aggregation. See the doc for details https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

min_doc_count term aggregation returns result with sum_other_doc_count > 0 but no buckets

Running this search in kibana:
GET myindex/mytype/_search
{"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "myfield",
"min_doc_count": 2
}
}
}
}
Gives:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 12,
"successful": 12,
"failed": 0
},
"hits": {
"total": 46117,
"max_score": 0,
"hits": []
},
"aggregations": {
"duplicateCount": {
"doc_count_error_upper_bound": 12,
"sum_other_doc_count": 45817,
"buckets": []
}
}
}
I'm not sure how to interpret this result. Per the term aggregation documentation sum_other_doc_count means:
when there are lots of unique terms, Elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response
Since there are no buckets in the response, it seems odd that there are apparently buckets that were not included. Does sum_other_doc_count include the buckets that were excluded by min_doc_count, and therefore the result can be interpreted as meaning there were no documents with duplicates for myfield?
If the latter is the case, assuming there were buckets returned, is is possible to get a some_other_doc_count that does not include the min_doc_count excluded buckets, or a total count of the buckets?
Update:
It seems that I can get some of the information I want via a cardinality aggregation. Total records - field cardinality provides the approximate number of documents with a duplicate field.

Scan And Scroll Query In ElasticSearch

I am using ES 1.5 version, I used the scroll query, id ,I ts working fine and returning the scroll id
GET /ecommerce_parts/_search?search_type=scan&scroll=3m
{
"query": { "match_all": {}},
"size": 1000
}
{
"scrollid": "c2Nhbjs1OzE4Okk4ckVkSld0UXdDU212UU1kNWZBU1E7MTc6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxNjpJOHJFZEpXdFF3Q1NtdlFNZDVmQVNROzIwOkk4ckVkSld0UXdDU212UU1kNWZBU1E7MTk6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxO3RvdGFsX2hpdHM6MTg0NjEwOw==",
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 184610,
"max_score": 0,
"hits": []
}
}
Now when I passing the scroll id to retrieve the set of documents its throwing an error
GET /_search/scroll?scroll=1m
c2Nhbjs1OzE4Okk4ckVkSld0UXdDU212UU1kNWZBU1E7MTc6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxNjpJOHJFZEpXdFF3Q1NtdlFNZDVmQVNROzIwOkk4ckVkSld0UXdDU212UU1kNWZBU1E7MTk6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxO3RvdGFsX2hpdHM6MTg0NjEwOw==
{
"error": "ElasticsearchIllegalArgumentException[Malformed scrollId []]",
"status": 400
}
GET
/_search/scroll?scroll=3mc2Nhbjs1OzIzOkk4ckVkSld0UXdDU212UU1kNWZBU1E7MjE6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsyNTpJOHJFZEpXdFF3Q1NtdlFNZDVmQVNROzI0Okk4ckVkSld0UXdDU212UU1kNWZBU1E7MjI6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxO3RvdGFsX2hpdHM6MTg0NjEwOw==
Both of the above queries giving me an error how to retrieve the documents based on the scroll id, If m wrong please suggest a query
For subsequent requests, you're supposed to send the scroll ID in a JSON payload like this:
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs1OzE4Okk4ckVkSld0UXdDU212UU1kNWZBU1E7MTc6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxNjpJOHJFZEpXdFF3Q1NtdlFNZDVmQVNROzIwOkk4ckVkSld0UXdDU212UU1kNWZBU1E7MTk6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxO3RvdGFsX2hpdHM6MTg0NjEwOw=="
}
In earlier versions of ES you could do it like this, too:
POST /_search/scroll?scroll=1m&scroll_id=c2Nhbjs1OzE4Okk4ckVkSld0UXdDU212UU1kNWZBU1E7MTc6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxNjpJOHJFZEpXdFF3Q1NtdlFNZDVmQVNROzIwOkk4ckVkSld0UXdDU212UU1kNWZBU1E7MTk6SThyRWRKV3RRd0NTbXZRTWQ1ZkFTUTsxO3RvdGFsX2hpdHM6MTg0NjEwOw==

Elastic Search Limiting the records that are aggregated

I am running an elastic search query with aggregation, which I intend to limit to say 100 records. The problem is that even when I apply the "size" filter, there is no effect on the aggregation.
GET /index_name/index_type/_search
{
"size":0,
"query":{
"match_all": {}
},
"aggregations":{
"courier_code" : {
"terms" : {
"field" : "city"
}
}
}}
The result set is
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": {
"total": 10867,
"max_score": 0,
"hits": []
},
"aggregations": {
"city": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Mumbai",
"doc_count": 2706
},
{
"key": "London",
"doc_count": 2700
},
{
"key": "Patna",
"doc_count": 1800
},
{
"key": "New York",
"doc_count": 1800
},
{
"key": "Melbourne",
"doc_count": 900
}
]
}
}
}
As you can see there is no effect on limiting the records on which the aggregation is to be performed. Is there a filter for, say top 100 records in Elastic Search.
Search operations in elasticsearch are performed in two phases query and fetch. During the first phase elasticsearch obtains results from all shards sorts them and determines which records should be returned. These records are then retrieved during the second phase. The size parameter controls the number of records that are returned to you in the response. Aggregations are executed during the first phase before elasticsearch actually knows which records will need to be retrieved and they are always executed on all records in the search. So, it's not possible to limit it by the total number of results. If you want to limit the scope of aggregation execution you need to limit the search query instead changing retrieval parameter. For example, if you add a filter to your search query that will only include records from the last year, aggregations will be executed on this filter.
It's also possible to limit the number of records that are analyzed on each shard using terminate_after parameter, however you will have no control on which records will be included and which records wouldn't be included into results, so this option is most likely not what you want.

Resources