ElasticSearch: How to search only the N docs for large scale index? - elasticsearch

I have large number of docs(about 100M) stored in a single index, when using group by on single field, the query may eat up all my CPU on ES server(most of time, < 100 results returned).
Is it possible to limit the query scope(i.e., only search 1M docs) for a single query?

Use query pagination to limit the search scope:
GET /_search
{
"from": 0,
"size": 1000000,
"query": {
"match": {
"city": "New york"
}
}
}
more information in documentation

to add to the other answer, you can also look at https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-aggregations-bucket-sampler-aggregation.html

Related

ElasticSearch 7.7 how can I increase the count of results of whole index

I understand that its theres hardcoded limit in Elasticsearch of 10k results per query. What I wanna know if theres any way to search results within this 10k limit but at the same time at least show count of all results for this particular query.
So let's suppose if there are 1M results matching for certain query, the count should show 1M instead of max limit of 10k.
Thank you.
Yes, You can.
You need to add the below attribute to your search query
{
"track_total_hits": true
}
It will show you the total count along with default result.
Elasticsearch supports a /_count API to result the count of all hits in query
GET /index/_count
{
// your search query here
"query": {
"match_all": {}
}
}
You can add "from" and "size" to visit specific hits of response
Example
GET index/_search
{
"from": 0,
"size": 100,
"query": {
"match_all": {}
}
}
In the returned query response from Elasticsearch, there is a field response['hits']['total']['value'] which has the count of hits too, but it also has its limitations.
NOTE: /_count API doesn't support "from" and "size", it gives you the total count.
for more details visit
Elasticsearch Count API.

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

fastest way to tell if a term exists in the index or not

What is the fastest query that can tell if a term exists in the index or not. I am not looking for scoring or anything, just a quick true/false response form elastic search that it has a document that contains this index.
you can use _count API.
example:
GET /twitter/_count?q=user:kimchy
more information:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html
also you can set the size to 0:
GET /twitter/user/_search {
"size": 0,
"query": {
"match": {
"username": "xyz"
}}}

In ElasticSearch is there is limit to the number of items in a terms query?

In the ES docs it lists this sample query:
{
"terms": {
"tags": [
"blue",
"pill"
],
"minimum_should_match": 1
}
}
Is there a limit (or a practical limit) on the number of items I could put in the list of possible strings to search for? Could I have a hundred items here?
Yaa, you can put thousands of item there(i've tested).. just follow the syntax. Then you are ok.

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources