Elastic Index. Enrich document based on aggregated value of a field from the same index - elasticsearch

Is it possible to enrich documents in the index based on the data from the same index ? Like if the source index has 10000 documents, and I need to calculate aggregated sum from each group of those documents, and then use the sum to enrich same index....
Let me try to explain. My case can be simplified to the one as below:
My elastic index A has documents with 3 fields:
timestamp1 identity_id hours_spent
...
timestamp2 identity_id hours_spent
Every hour I need to check the index and update documents with SKU field. If the timestamp1 is between [date1:date2] and total amount of hours_spent by indetity_id < a_limit I need to enrich the document with additional field sku=A otherwise with field sku=B.

Related

Elasticsearch - Does forcing _id on elasticsearch piles up to only one shard?

Let's say I have 2 documents with the following ids:
id0001
id0002
Since I am forcing the ids of the document, how does elastic search place this in the shard? Will elasticsearch put all of this in the same shard? In other words, how does elastic search compute where to place the documents in a shard?
Each document is routed to a specific shard depending on its _routing value, which defaults to its ID hash.
routing = _routing != null ? hash(_routing) : hash(_id)
routing_factor = num_routing_shards / num_primary_shards
shard_num = (hash(_routing) % num_routing_shards) / routing_factor
So shard_num will be a direct function of either a specific _routing value or the hash of the document's _id value.
In your sample, id0001 and id0002 would definitely land on two different shards, provided your index has more than one primary shards

Avoid ranking all matching documents in elasticsearch search query

I am having Elasticsearch index with multi-millions of documents. I am running a following search query.
POST testIndex/_search?size=200
{
"query": {
"query_string": {
"query": "(title:QA Manager OR title:QA Lead) AND (skills:JIRA OR skills:Software Development OR skills:Test Case)"
}
}
}
Even if we have passed the limit with size=200, it seems Elasticsearch is doing ranking for all the matching documents and bringing the top 200 with the highest rank.
Is there a way we can limit ranking? meaning do ranking on max 1000 matching documents only?
ES will consider your all data for search and ranking that is how Elasticsearch work. What basically do is, It executes your query in 2 phases, one is query and the second is fetch.
In Query Phase, it executes your query in all shades and get document id and score from each shard and return to requesting node. So in your scenario as size is set to 200, it will get 200 documents id from each shard and return to requesting node.
On requesting node, all the document id and score are merged and sorted based on score and select top document based on size param.
In Fetch phase, the actual docs are retrieved from individual shards where they reside based on ID which are selected in Query Phase and Results are returned to the client.
If you don't want to calculate score for some of your query, then you can move that query to the filter clause in bool query.

Update the value of a field in index based on its value in another index

There's an index_A that contains say about 10K docs. It has many fields like field_1, field_2, ...field_n and one of the fields is product_name.
Then there's another index_B that contains about 10 docs only and is a master catalogue sort of index. It has 2 fields: product_name and product_description.
e.g
{
"product_name" : "EES",
"product_desc" : "Elastic Enterprise Search"
}
{
"product_name" : "EO",
"product_desc" : "Elastic Observability"
}
index_A contains many fields, from that one of the fields is product_name. index_A does not have the field product_desc
I want to insert product_desc field into each document in index_A such that the value of product_name in index_A matches value of product_name in index_B.
i.e. something like set index_A.prod_desc = index_B.prod_desc where index_A.prod_name = index_B.prod_name
How can I achieve that?
Elasticsearch cannot do joins like that
the best approach would be to do this during indexing, using something like an ingest pipeline, or Logstash, or some other piece of code that pulls the description into the product document

elasticsearch aggregation bucket

In Kibana (or potentially even elasticsearch), is there a way to sort documents into buckets based on a field, and then compute statistics on the generated buckets themselves? Here is a simplified example of my problem:
I have logs with structure:
{
user_id: [string],
post_id: [string]
}
that signal a user with ID user_id has viewed post with ID post_id. I would like to:
bucket the logs by matching user_id
Count the amount of logs per bucket
Compute the 75th percentile of these bucket-specific counts
Is this possible in Kibana?

ElasticSearch: Metric aggregation and doc values / field-data

How does ES internally implement metric aggregations ?
Suppose documents in the index have below structure:
{
category: A,
measure: 20
}
Would for the below query which does terms aggregation on category and calculate sum(measure), the 'measure' field values
be extracted from the document (i.e. _source) and summed or
would the values be taken from doc-values / field data of 'measure' field
Query:
{
size: 0,
aggs: {
cat_aggs: {
terms: {
field: 'category'
},
aggs: {
sumAgg: {
sum: {field: 'measure'}
}
}
}
}
}
From the official documentation on metrics aggregations (emphasis added):
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.
If you're using a newer ES 2.x version, then doc_values have become the norm over field data.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space
So to answer your question clearly, metrics aggregations are computed based on either field data or doc values that have been stored at indexing time, i.e. not computed based on source parsing at query time, unless your doing it from a script which accesses the _source directly.

Resources