Aggregation Query in Discover not returning expected result (Kibana / Elasticsearch) - elasticsearch

I have set up a Kibana / Elasticsearch instance to analyze some Data that i'm scraping.
I am analyzing news Articles from different websites, i want to use a query / Filter that only shows me Articles once by using a cardinality aggregation on the Field "article_id".
To do so i have set up a Lens Visualization, added the Visualization to a Dashboard and got the Request from Visualization via the "Inspect" option. Then i tried to use the Request as a Filter in the Discover Tab ("Edit as Query DSL"). The only thing that seems to be affected by the query is the time. When i run the Query in the "Dev Tools" Section it does just fine.
My Request looks like this:
{
"aggs": {
"696b506b-2d7f-4bfc-9fab-704ca6e95d5c": {
"terms": {
"field": "article_title.keyword",
"order": {
"acbaafc6-829d-4c65-9b6b-cbca538c938e": "desc"
},
"size": 100
},
"aggs": {
"acbaafc6-829d-4c65-9b6b-cbca538c938e": {
"cardinality": {
"field": "article_id.keyword"
}
}
}
}
},
"size": 0,
"fields": [
{
"field": "run_date",
"format": "date_time"
},
{
"field": "scrape_date",
"format": "date_time"
}
],
"script_fields": {},
"stored_fields": [
"*"
],
"runtime_mappings": {},
"_source": {
"excludes": []
},
"query": {
"bool": {
"must": [],
"filter": [
{
"match_all": {}
},
{
"match_all": {}
},
{
"range": {
"run_date": {
"gte": "2021-04-02T23:49:43.440Z",
"lte": "2021-04-17T23:49:43.440Z",
"format": "strict_date_optional_time"
}
}
}
],
"should": [],
"must_not": []
}
}
}
Any help is greatly appreciated as this has been driving me insane the last few hours...

Related

ElasticCloud - alert on disk usage using metricbeats

I'm struggling to understand how to define an alert for my hosts disk usage in elastic cloud.
The agent is installed on my different hosts with the "system" integration. Pretty sure this use metricbeats.
I can see this vizualisation here :
However the disk usage use a couple of field to get it's percentage :
system.fsstat.total_size.total
system.fsstat.total_size.used
When I review that part of the dashboard I end up with this :
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2022-05-12T08:47:46.895Z",
"lte": "2022-05-12T08:57:46.895Z",
"format": "strict_date_optional_time"
}
}
},
{
"bool": {
"must": [],
"filter": [
{
"bool": {
"should": [
{
"match_phrase": {
"data_stream.dataset": "system.fsstat"
}
}
],
"minimum_should_match": 1
}
}
],
"should": [],
"must_not": []
}
}
],
"filter": [],
"should": [],
"must_not": []
}
},
"aggs": {
"timeseries": {
"auto_date_histogram": {
"field": "#timestamp",
"buckets": 1
},
"aggs": {
"4e4dee91-4d1d-11e7-b5f2-2b7c1895bf32": {
"filter": {
"exists": {
"field": "system.fsstat.total_size.used"
}
},
"aggs": {
"docs": {
"top_hits": {
"size": 1,
"fields": [
"system.fsstat.total_size.used"
],
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
}
}
},
"57c96ee0-4d54-11e7-b5f2-2b7c1895bf32": {
"filter": {
"exists": {
"field": "system.fsstat.total_size.total"
}
},
"aggs": {
"docs": {
"top_hits": {
"size": 1,
"fields": [
"system.fsstat.total_size.total"
],
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
}
}
}
},
"meta": {
"timeField": "#timestamp",
"panelId": "4e4dc780-4d1d-11e7-b5f2-2b7c1895bf32",
"seriesId": "4e4dee90-4d1d-11e7-b5f2-2b7c1895bf32",
"intervalString": "600000ms",
"indexPatternString": "metrics-*",
"normalized": true
}
}
},
"runtime_mappings": {}
}
I want to create a threshold alert when the disk of any of my host reach, let's say 90%.
Threshold alert only takes one value, so I'm not able to create this alert.
Shoud I create a new field somewhere in metricbeats index or should I use a custom query alert ?
I'm quite new to ElasticCloud, I found a couple of solution using Python script etc but that seems a bit overkill for what I'm trying to achieve.
Hopefully someone will have a simple solution.

How to convert ElasticSearch query to ES7

We are having a tremendous amount of trouble converting an old ElasticSearch query to a newer version of ElasticSearch. The original query for ES 1.8 is:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*",
"default_operator": "AND"
}
},
"filter": {
"and": [
{
"terms": {
"organization_id": [
"fred"
]
}
}
]
}
}
},
"size": 50,
"sort": {
"updated": "desc"
},
"aggs": {
"status": {
"terms": {
"size": 0,
"field": "status"
}
},
"tags": {
"terms": {
"size": 0,
"field": "tags"
}
}
}
}
and we are trying to convert it to ES version 7. Does anyone know how to do that?
The Elasicsearch docs for Filtered query in 6.8 (the latest version of the docs I can find that has the page) state that you should move the query and filter to the must and filter parameters in the bool query.
Also, the terms aggregation no longer support setting size to 0 to get Integer.MAX_VALUE. If you really want all the terms, you need to set it to the max value (2147483647) explicitly. However, the documentation for Size recommends using the Composite aggregation instead and paginate.
Below is the closest query I could make to the original that will work with Elasticsearch 7.
{
"query": {
"bool": {
"must": {
"query_string": {
"query": "*",
"default_operator": "AND"
}
},
"filter": {
"terms": {
"organization_id": [
"fred"
]
}
}
}
},
"size": 50,
"sort": {
"updated": "desc"
},
"aggs": {
"status": {
"terms": {
"size": 2147483647,
"field": "status"
}
},
"tags": {
"terms": {
"size": 2147483647,
"field": "tags"
}
}
}
}

ES query ignoring time range filter

I have mimicked how kibana does a query search and have come up with the below query. Basically I'm looking for the lat 6 days of data (including those days where there is no data, since I need to feed it to a graph). But the returned buckets is giving me more than just those days. I woul like to understand where I'm going wring with this.
{
"version": true,
"size": 0,
"sort": [
{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"_source": {
"excludes": []
},
"aggs": {
"target_traffic": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Asia/Kolkata",
"min_doc_count": 0,
"extended_bounds": {
"min": "now-6d/d",
"max": "now"
}
},
"aggs": {
"days_filter": {
"filter": {
"range": {
"#timestamp": {
"gt": "now-6d",
"lte": "now"
}
}
},
"aggs": {
"in_bytes": {
"sum": {
"field": "netflow.in_bytes"
}
},
"out_bytes": {
"sum": {
"field": "netflow.out_bytes"
}
}
}
}
}
}
},
"stored_fields": [
"*"
],
"script_fields": {},
"docvalue_fields": [
"#timestamp",
"netflow.first_switched",
"netflow.last_switched"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "( flow.src_addr: ( \"10.5.5.1\" OR \"10.5.5.2\" ) OR flow.dst_addr: ( \"10.5.5.1\" OR \"10.5.5.2\" ) ) AND flow.traffic_locality: \"private\"",
"analyze_wildcard": true,
"default_field": "*"
}
}
]
}
}
}
If you put the range filter inside your aggregation section without any date range in your query, what is going to happen is that your aggregations will run on all your data and metrics will be bucketed by day over all your data.
The range query on #timestamp should be moved inside the query section so as to compute aggregations only on the data you want, i.e. the last 6 days.

Division of two aggregration metric of kibana

I want to do the division of two aggregation metric of kibana. I am getting the count of two values and i want to divide both of them.
Is there is any way to do it.
Kibana is generating this elastic search request :
{
"size": 0,
"_source": {
"excludes": []
},
"aggs": {
"1": {
"sum_bucket": {
"buckets_path": "1-bucket>_count"
}
},
"2": {
"cardinality": {
"field": "ms.keyword"
}
},
"1-bucket": {
"terms": {
"field": "ms.keyword",
"size": 10000,
"order": {
"_count": "desc"
}
}
}
},
"stored_fields": [
"*"
],
"script_fields": {
"indiviualCount": {
"script": {
"inline": "(doc['campaign'].empty) ? 0 : ((1.0/doc['campaign'].value) * 100)",
"lang": "painless"
}
}
},
"docvalue_fields": [
"edrTimestamp",
"timestamp"
],
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": [],
"should": [],
"must_not": []
}
}
}
Is there any way we can achieve it in Kibana. I was think to use scripted field but where it will be written. Someone recommended me to use Pipeline aggregation but i am not able to achieve it
I am searching for an answer for a while and my workaround is to use the vertical bar chart.
You need add 2 metric, each metric set to "sum" for your fields (you can manage your fields under "Management side tab" -> "scripted fields" to add or subtract your fields for the correctness of the division).
Switch to Metrics & Axes tab, set both metric to "stacked" mode. Set your y-axes to percentage mode.

Terrible has_child query performance

The following query has terrible performance.
100% sure it is the has_child. Query without it runs under 300ms, with it it takes 9 seconds.
Is there some better way to use the has_child query? It seems like I could query parents, and then children by id and then join client side to do the has child check faster than the ES database engine is doing it...
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"query": {
"term": {
"stage": "es"
}
}
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
Cluster info:
CPU and memory usage is low. It is AWS ES Service cluster (v1.5.2). Many small documents, and since version aws is running is old, doc values aren't on by default. Not sure if that is helping or hurting.
Since "stage" is not analyzed (based on your comment) and, therefore, you are not interested in scoring the documents that match on that field, you might realize slight performance gains by using the has_child filter instead of the has_child query. And using a term filter instead of a term query.
In the documentation for has_child, you'll notice:
The has_child filter also accepts a filter instead of a query:
The main performance benefits of using a filter come from the fact that Elasticsearch can skip the scoring phase of the query. Also, filters can be cached which should improve the performance of future searches that use the same filters. Queries, on the other hand, cannot be cached.
Try this instead:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"source": "IntegrationTest-2016-03-01T23:31:15.023Z"
}
},
{
"range": {
"eventTimestamp": {
"from": "2016-03-01T20:28:15.028Z",
"to": "2016-03-01T23:33:15.028Z"
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "s3"
}
}
}
},
{
"has_child": {
"type": "status",
"filter": {
"term": {
"stage": "es"
}
}
}
}
]
}
}
}
},
"aggs": {
"digests": {
"terms": {
"field": "digest",
"size": 0
}
}
},
"size": 0
}
I bit the bullet and just performed the parent:child join in my application. Instead of waiting 7 seconds for the has_child query, I fire off two consecutive term queries and do some post processing: 200ms.

Resources