How to run term statistics efficiently in Elasticsearch? - elasticsearch

I use the following code to find term frequency for a document.
POST myindex/mydoc/1/_termvectors?fields=fields.bodyText&pretty=true
{
"term_statistics":true,
"filter":{
"max_doc_freq":300,
"min_doc_freq":50
}
}
In my index there are 1 million documents. How to run this statistics more efficiently for each document?
By efficiently I mean for example: The word the in doc 1 can also appear in doc 2, so when I run the statistics for doc 2 there is no need to calculate the again(assuming that my index has not been updated for each document).

Related

Avoid ranking all matching documents in elasticsearch search query

I am having Elasticsearch index with multi-millions of documents. I am running a following search query.
POST testIndex/_search?size=200
{
"query": {
"query_string": {
"query": "(title:QA Manager OR title:QA Lead) AND (skills:JIRA OR skills:Software Development OR skills:Test Case)"
}
}
}
Even if we have passed the limit with size=200, it seems Elasticsearch is doing ranking for all the matching documents and bringing the top 200 with the highest rank.
Is there a way we can limit ranking? meaning do ranking on max 1000 matching documents only?
ES will consider your all data for search and ranking that is how Elasticsearch work. What basically do is, It executes your query in 2 phases, one is query and the second is fetch.
In Query Phase, it executes your query in all shades and get document id and score from each shard and return to requesting node. So in your scenario as size is set to 200, it will get 200 documents id from each shard and return to requesting node.
On requesting node, all the document id and score are merged and sorted based on score and select top document based on size param.
In Fetch phase, the actual docs are retrieved from individual shards where they reside based on ID which are selected in Query Phase and Results are returned to the client.
If you don't want to calculate score for some of your query, then you can move that query to the filter clause in bool query.

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

How to get random documents from Elasticsearch indexes with 50 million documents each

I'd like to sample 2000 random documents from approximately 60 ES indexes holding about 50 million documents each, for a total of about 3 billion documents overall. I've tried doing the following on the Kibana Dev Tools page:
GET some_index_abc_*/_search
{
"size": 2000,
"query": {
"function_score": {
"query": {
"match_phrase": {
"field_a": "some phrase"
}
},
"random_score": {}
}
}
}
But this query never returns. Upon refreshing the Dev Tools page, I get a page that tells me that the ES cluster status is red (doesn't seem to be a coincidence - I've tried several times). Other queries (counts, simple match_all queries) without the random function work fine. I've read that function score queries tend to be slow, but using a random function score is the only method I've been able to find for getting random documents from ES. I'm wondering if there might be any other, faster way that I can sample random documents from multiple large ES indexes.
EDIT: I would like to do this random sampling entirely using built-in ES functionality, if possible - I do not want to write any code to e.g. implement reservoir sampling on my end. I also tried running my query with a much smaller size - 10 documents - and I got the same result as for 2000.

Short queries return not enough results

Hey I have a field in elasticsearch that is analyzed with the alphanumeric_analyzer. Then I index data into that field that looks like this:
Test-00001
Test-00002
to
Test-01000
If I execute the following query, I get 250 results consistently. But they aren't necessarily Test-00001 to Test -00250.
`{
"query": {
"match": {
"filename_Analyzed": {
"type": "phrase_prefix",
"query": "0"
}
}
}
}`
I was expecting to get 1000 results, but I only get 250. Are my expectations correct, or is the search incorrect?
EDIT 1:
Gist for the mapping:
https://gist.github.com/goalie7960/8ffd1536269a901f18bc
EDIT 2:
If I double the number of shards, the number of results also doubles. So 5 shards = 250 results, 10 shards = 500 results, etc.
EDIT 3:
Here's a gist for the analyzer I am using. But I can also reproduce with the standard analyzer.
https://gist.github.com/goalie7960/b0bbbddf1cee29b4b5ed
Turns out the prefix query or phrase prefix was exceeding the max expansion limit in elastic search. A non simple solution was to switch to ngram analysis and it has fixed the problem. Yay.

Resources