Let document match multiple buckets of date histogram - elasticsearch

I have an index that has a mapping which is similar to
{
"id": {
"type": "long"
},
"start": {
"type": "date"
},
"end": {
"type": "date"
}
}
I want to create a date histogram so that each document falls into all buckets which intervals fall between "start" and "end".
Eg. if for one document "start" = 12/01/2018, "end" = 04/25/2019, my date-histogram interval are weeks and the range is now-1y until now. I now want the document to fall into every bucket starting the week of 12/01/2018 until the week of 04/25/2019. So with just this one document the result should be 52 buckets where the buckets April to dezember have doc_count 0 and the buckets Dezember to April have doc_count 1.
As I see it date-histogram only gives me the option to match my document to exactly one bucket depending on one field, either "start" or "end".
What I have tried so far:
Dynamically generate a query with 52 filters which checks if a document falls into this "bucket"
Try to make use of painless scripts in each query
Both solutions were extremly slow. I am working with around 200k documents and such queries took around 10 seconds.
EDIT: Here is a sample query that is generated dynamically. As can be seen one filter is created per week. This query takes about 10 seconds which is way to long
%{
aggs: %{
count_chart: %{
aggs: %{
last_seen_over_time: %{
filters: %{
filters: %{
"2018-09-24T00:00:00Z" => %{
bool: %{
must: [
%{range: %{start: %{lte: "2018-09-24T00:00:00Z"}}},
%{range: %{end: %{gte: "2018-09-17T00:00:00Z"}}}
]
}
},
"2018-12-24T00:00:00Z" => %{
bool: %{
must: [
%{range: %{start: %{lte: "2018-12-24T00:00:00Z"}}},
%{range: %{end: %{gte: "2018-12-17T00:00:00Z"}}}
]
}
},
"2019-04-01T00:00:00Z" => %{
bool: %{
must: [
%{range: %{start: %{lte: "2019-04-01T00:00:00Z"}}},
%{range: %{end: %{gte: "2019-03-25T00:00:00Z"}}}
]
}
}, ...
}
}
}
},
size: 0
}
And a sample response:
%{
"_shards" => %{"failed" => 0, "skipped" => 0, "successful" => 5, "total" => 5},
"aggregations" => %{
"count_chart" => %{
"doc_count" => 944542,
"last_seen_over_time" => %{
"buckets" => %{
"2018-09-24T00:00:00Z" => %{"doc_count" => 52212},
"2018-12-24T00:00:00Z" => %{"doc_count" => 138509},
"2019-04-01T00:00:00Z" => %{"doc_count" => 119634},
...
}
}
}
},
"hits" => %{"hits" => [], "max_score" => 0.0, "total" => 14161812},
"timed_out" => false,
"took" => 2505
}
I hope this question is understandable. If not I will explain it more in detail.

How about doing 2 date_histogram query and calculating the difference per week?
I'm assuming you just need the overall count due to size:0 in your query.
let start = await client.search({
index: 'dates',
size: 0,
body: {
"aggs" : {
"start": {
"date_histogram": {
"field": "start",
"interval": "week"
},
}
}
}
});
let end = await client.search({
index: 'dates',
size: 0,
body: {
"aggs" : {
"end": {
"date_histogram": {
"field": "end",
"interval": "week"
},
}
}
}
});
let buckets = {};
let start_buckets = start.aggregations.start.buckets;
let end_buckets = end.aggregations.start.buckets;
let started = 0;
let ended = 0;
for (let i = 0; i < start_buckets.length; i++) {
started += start_buckets[i].doc_count;
buckets[start_buckets[i].key_as_string] = started - ended;
ended += end_buckets[i].doc_count;
}
This test took less than 2 seconds on my local on similar scale to yours.
You can run both aggregations simultaneously to save more time.

Related

Writing a NEST query to sort aggregation buckets by score

What I want is to create an aggregation bucket for each unitId (which is a field in my document). I want each bucket to be ordered by the max score in that bucket. I have written the following query which does what I want:
"aggs": {
"UnitAggregationBucket": {
"terms": {
"field": "unitId",
"size": 10,
"order": {
"max_score": "desc"
}
},
"aggs": {
"max_score": {
"max": {
"script": "_score"
}
}
}
}
I am using script to find the max score per bucket, in a sub-aggregation. I don't know how to write the above query using NEST?
Upadate:
This is the answer that I got from Elastic Community:
With 6.x, this would be something like:
var client = new ElasticClient();
var searchResponse = client.Search<object>(s => s
.Aggregations(a => a
.Terms("UnitAggregationBucket", t => t
.Field("unitId")
.Size(10)
.Order(o => o
.Descending("maximum_score")
)
.Aggregations(aa => aa
.Max("maximum_score", m => m
.Script("_score")
)
)
)
)
);
var termsAgg = searchResponse.Aggregations.Terms("UnitAggregationBucket");
foreach(var bucket in termsAgg.Buckets)
{
// do something with buckets
var maxScore = bucket.Max("maximum_score").Value;
}
Note that you can't use max_score for the name of the aggregation as
it's a reserved keyword name in the client, which the client uses in
its heuristics based aggregation response JSON deserialization method.
Original Answer
I managed to write the following NEST Query:
var unitAggregations = new TermsAggregation("UnitAggregationBucket")
{
Size 10,
Field = Field<MyDocument>(p => p.UnitId),
Order = new List<TermsOrder>
{
new TermsOrder()
{
Key = "max_score_in_bucket",
Order = SortOrder.Descending
}
},
Aggregations = new MaxAggregation("max_score_in_bucket", string.Empty)
{
Script = new InlineScript("_score")
}
};
Which produces, the following json:
"aggs": {
"UnitAggregationBucket": {
"aggs": {
"max_score_in_bucket": {
"max": {
"script": {
"source": "_score"
}
}
}
},
"terms": {
"field": "unitId",
"order": [
{
"max_score_in_bucket": "desc"
}
],
"size": 10
}
}
}
It's not the exact json that I wanted, but it does what I want.
Note: max_score is a reserved key-word in Elasticsearch, so I had to use a different name: max_score_in_bucket

Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3)

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

Return unique results in elasticsearch

I have a use case in which I have data like
{
name: "John",
parentid": "1234",
filter: {a: '1', b: '3', c: '4'}
},
{
name: "Tim",
parentid": "2222",
filter: {a: '2', b: '1', c: '4'}
},
{
name: "Mary",
parentid": "1234",
filter: {a: '1', b: '3', c: '5'}
},
{
name: "Tom",
parentid": "2222",
filter: {a: '1', b: '3', c: '1'}
}
expected results:
bucket:[{
key: "2222",
hits: [{
name: "Tom" ...
},
{
name: "Tim" ...
}]
},
{
key: "1234",
hits: [{
name: "John" ...
},
{
name: "Mary" ...
}]
}]
I want to return unique document by parentid. Although I can use top aggregation but I don't how can I paginate the bucket. As there is more chance of parentid being different than same. So mine bucket array would be large and I want to show all of them but by paginating them.
There is no direct way of doing this. But you can follow these steps to get desired result.
Step 1. You should know all parentid. This data can be obtained by doing a simple terms aggregation (Read more here) on field parentid and you will get only the list of parentid, not the documents matching to that. In the end you will have a smaller array on than you are currently expectig.
{
"aggs": {
"parentids": {
"terms": {
"field": "parentid",
"size": 0
}
}
}
}
size: 0 is required to return all results. Read more here.
OR
If you already know list of all parentid then you can directly move to step 2.
Step 2. Fetch related documents by filtering documents by parentid and here you can apply pagination.
{
"from": 0,
"size": 20,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"parentid": "2222"
}
}
}
}
}
from and size are used for pagination, so you can loop through each of parentid in the list and fetch all related documents.
If you are just looking for all names grouped by parent id, you can use below query:
{
"query": {
"match_all": {}
},"aggs": {
"parent": {
"terms": {
"field": "parentid",
"size": 0
},"aggs": {
"NAME": {
"terms": {
"field": "name",
"size": 0
}
}
}
}
},"size": 0
}
If you want the entire document grouped by parentdId, it will be a 2 step process as explained by Sumit above and you can use pagination there.
Aggregation doesn't give you access to all documents/document-ids in the agg result, so this will have to be a 2 step process.

elasticsearch aggregation PHP

i am trying to get the unique values from my elasticsearch database.
So i want the unique names from my elasticsearch database.
So i am aggregation like so ---
$paramss = [
'index' => 'myIndex',
'type' => 'myType',
'ignore_unavailable' => true,
'ignore' => [404, 500]
];
$paramss['body'] = <<<JSON
{
"size": 0,
"aggs" : {
"langs" : {
"terms" : { "field" : "name" }
}
}}
JSON;
$results = $client->search($paramss);
print_r(json_encode($results));
i get the result like so---
{
took: 3,
timed_out: false,
_shards: {
total: 5,
successful: 5,
failed: 0
},
hits: {
total: 1852,
max_score: 0,
hits: [
]
},
aggregations: {
langs: {
buckets: [
{
key: "aaaa.se",
doc_count: 430
},
{
key: "bbbb.se",
doc_count: 358
},
{
key: "cccc.se",
doc_count: 49
},
{
key: "eeee.com",
doc_count: 46
}
]
}
}
}
But the problem is i am not getting all the unique values, I am getting only 10 values, which is default value for elasticsearch query.
So how can i change the query size !!!
i tried like so---
$paramss = [
'index' => 'myIndex',
'type' => 'myType',
'size' => 1000,
'ignore_unavailable' => true,
'ignore' => [404, 500]
];
which returns me some weird documents.
So do anyone knows the solution of this problem.
How can i get all the unique names from my elasticsearch database, can someone help me to fix this problem.
You are also doing everuthing right, except you the size.
The "size": 0 should come after the targeted field's name.
$client = new Elasticsearch\Client($params);
$query['body'] = '{
"aggs" : {
"all_sources" : {
"terms" : {
"field" : "source",
"order" : { "_term" : "asc" },
"size": 0
}
}
}
}';
You need to put size parameter inside terms:
{
"aggs" : {
"langs" : {
"terms" : {
"field" : "name",
"size": 0
}
}
}}
Link to documentation where you can find more info:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

Using aggregation functions in Elasticsearch queries

I'm using elasticsearch 0.90.10 and I want to perform a search on it using a query with aggregation functions like sum(), avg(), min().
Suppose my data is something like that
[
{
"name" : "Alice",
"grades" : [40, 50, 60, 70]
},
{
"name" : "Bob",
"grades" : [10, 20, 30, 40]
},
{
"name" : "Charlie",
"grades" : [70, 80, 90, 100]
}
]
Let's say I need to fetch students with average grade greater than 75 (i.e. avg(grades) >= 75). How can I wrote such a query in ES using DSL, filters or scripting?
Thanks in advance.
The new ES 1.0.0.RC1 that is out might have better ways to do this with aggregations BUT here is a simple (and very verbose) script that works:
POST /test_one/grades/_search
{
"query" : {
"match_all": {}
},
"filter" : {
"script" : {
"script" : " sum=0; foreach( grade : doc['grades'].values) { sum = sum + grade }; avg = sum/doc['grades'].values.length; avg > 25; "
}
}
}
Data I tested with:
POST /test_one/grades
{
"name": "chicken",
"grades": [35,55,65]
}
POST /test_one/grades
{
"name": "pork",
"grades": [15,35,45]
}
POST /test_one/grades
{
"name": "kale",
"grades": [5,10,20]
}

Resources