ElasticSearch: Metric aggregation and doc values / field-data - elasticsearch

How does ES internally implement metric aggregations ?
Suppose documents in the index have below structure:
{
category: A,
measure: 20
}
Would for the below query which does terms aggregation on category and calculate sum(measure), the 'measure' field values
be extracted from the document (i.e. _source) and summed or
would the values be taken from doc-values / field data of 'measure' field
Query:
{
size: 0,
aggs: {
cat_aggs: {
terms: {
field: 'category'
},
aggs: {
sumAgg: {
sum: {field: 'measure'}
}
}
}
}
}

From the official documentation on metrics aggregations (emphasis added):
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.
If you're using a newer ES 2.x version, then doc_values have become the norm over field data.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space
So to answer your question clearly, metrics aggregations are computed based on either field data or doc values that have been stored at indexing time, i.e. not computed based on source parsing at query time, unless your doing it from a script which accesses the _source directly.

Related

Avoid ranking all matching documents in elasticsearch search query

I am having Elasticsearch index with multi-millions of documents. I am running a following search query.
POST testIndex/_search?size=200
{
"query": {
"query_string": {
"query": "(title:QA Manager OR title:QA Lead) AND (skills:JIRA OR skills:Software Development OR skills:Test Case)"
}
}
}
Even if we have passed the limit with size=200, it seems Elasticsearch is doing ranking for all the matching documents and bringing the top 200 with the highest rank.
Is there a way we can limit ranking? meaning do ranking on max 1000 matching documents only?
ES will consider your all data for search and ranking that is how Elasticsearch work. What basically do is, It executes your query in 2 phases, one is query and the second is fetch.
In Query Phase, it executes your query in all shades and get document id and score from each shard and return to requesting node. So in your scenario as size is set to 200, it will get 200 documents id from each shard and return to requesting node.
On requesting node, all the document id and score are merged and sorted based on score and select top document based on size param.
In Fetch phase, the actual docs are retrieved from individual shards where they reside based on ID which are selected in Query Phase and Results are returned to the client.
If you don't want to calculate score for some of your query, then you can move that query to the filter clause in bool query.

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

What is the difference between source filtering, stored fields, and doc values in elsaticsearch?

I've read the docs for source filtering, stored fields, and doc values.
In certain situations it can make sense to store a field. For instance, if you have a document with a title, a date, and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field
The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.
All fields which support doc values have them enabled by default.
Example 1
I have documents with title (short string), and content (>1MB). I want to search for matching titles, and return the title.
With source filtering
GET /_search
{ _source: "obj.title", ... }
With stored fields
GET /_search
{ _source: false, stored_fields: ["title"], ... }
With doc values
GET /_search
{_source: false, stored_fields: "_none_", docvalue_fields: "title", ... }
Okay, so
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
Will the source filtered reques use doc values?
Do stored fields store the analyzed tokens or the original value?
Are stored fields or doc values more or less efficient than _source?
Will the source filtered request read the full _source, title and content, from disk then apply the filter and return only the title, or will elasticsearch only read the title from disk?
The document you send for indexing to Elasticsearch will be stored in a field called _source (by default). So this means that if your document contains a large amount of data (like in the content field in your case), the full content will be stored in the _source field. When using source filtering, first the whole source document must be retrieved from the _source field and then only the title field will be returned. You're wasting space because nothing really happens with the content field, since you're searching on title and returning only the title value.
In your case, you'd be better off to not store the _source document, and only store the title field (but it has some disadvantages, too, so read this before you do), basically like this:
PUT index
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"title": {
"type": "text",
"store": true
},
"content": {
"type": "text"
}
}
}
}
Will the source filtered request use doc values?
doc-values are enabled by default on all fields, except on analyzed text fields. If you use _source filtering, it's not using doc values, as explained above, the _source field is retrieved and the fields you specified are filtered.
Do stored fields store the analyzed tokens or the original value?
Stored fields store the exact value as present in the _source document
Are stored fields or doc values more or less efficient than _source?
doc_values is a different beast, it's more of a optimization to store the tokens of non-analyzed fields in a way to will make it easy to sort, filter and aggregate on those values.
Stored fields (default is false) are also an optimization if you don't want to store the full source but only a few important fields (as explained above).
The _source field itself is a stored field that contains the whole document.

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

How Keyword and Numeric data Types are stored in elastic search? is it stored in inverted index?

put sana/_mapping/learn { "properties": { "name":{"type":"text"}, "age":{"type":"integer"} } }
POST sana/learn { "name":"rosy", "age":23 }
Quoting the Elasticsearch doc:
Most fields are indexed by default, which makes them searchable. The
inverted index allows queries to look up the search term in unique
sorted list of terms, and from that immediately have access to the
list of documents that contain the term.
Keyword and numeric data types are also indexed and stored in the inverted index so that these fields are searchable, but if you want you can disable it by setting index type to false, in your index mapping, also on these fields(keyword,numeric) doc_values is enabled by default sorting and aggregations etc, but not enabled on analyzed string(text) fields.
Hope I answered your question and let me know if you have any doubt.

Resources