Watcher in elasticsearch for High CPU usage - elasticsearch

I want to set watcher send mail to if usage of CPU in last X minutes over N%.
First elasticsearch get data from remote server through metricbeat on each minutes. Then i want to by using that data inform administrator off high CPU usage on remote sever.
I setup mail and i finish part if Memory usage is high, but problem is with CPU usage, is 4core processor. I don't to write aggs function and condition. I try something with code from github but i can't change function to work with metricbeat.

this worked for me. It sends mails whenever a host informs 5 hits (> 95% CPU) in a minute:
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"metricbeat-*"
],
"types": [],
"body": {
"query": {
"bool": {
"filter": [
{
"range": {
"#timestamp": {
"gte": "now-{{ctx.metadata.window_period}}"
}
}
},
{
"range": {
"system.process.cpu.total.pct": {
"gte": "{{ctx.metadata.threshold}}"
}
}
}
]
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gte": 5
}
}
},
"actions": {
"email_me": {
"throttle_period_in_millis": 300000,
"email": {
"profile": "standard",
"attachments": {
"datalles.json": {
"data": {
"format": "json"
}
}
},
"from": "xxxx#gmail.com",
"to": [
"yyyy#gmail.com"
],
"subject": "🚩 CPU overhead",
"body": {
"html": "The following hosts are running over {{ctx.metadata.threshold}}% CPU: <br><br>{{#ctx.payload.hits.hits}} <b>{{_source.beat.hostname}}</b> ({{_source.system.process.cpu.total.pct}}%) <br> {{/ctx.payload.hits.hits}}"
}
}
}
},
"metadata": {
"window_period": "1m",
"threshold": 0.95
}
}

Related

ElasticSearch Watcher simulate fires the action, otherwise it's stuck

I have a slack action configured. All aspects appear to be set up correctly. If I go to my watch's simulate section and choose execute (not ignoring the conditions) it executes fine and the message appears correctly templated in slack. If I save the config and let the watcher run it doesn't send. If I use the email action, it sends the email. If I use both, it sends neither.
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"elastic"
],
"rest_total_hits_as_int": true,
"body": {
"query": {
"bool": {
"must": {
"match": {
"level": "ERROR"
}
},
"filter": {
"range": {
"#timestamp": {
"gte": "now-1500m"
}
}
}
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gte": 1
}
}
},
"actions": {
"notify-slack": {
"throttle_period_in_millis": 5000,
"slack": {
"account": "monitoring",
"proxy": {
"host": "proxy.example.com"
"port": 3128
},
"message": {
"from": "watcher",
"to": [
"#elk-cluster-alerts"
],
"text": "Elk Error Alerts",
"icon": ":chuck:",
"attachments": [
{
"color": "danger",
"title": "Elk Error Alerts",
"text": "Roundhouse kick!"
}
]
}
}
}
}
}
UPDATE:
Not a fix, but the configuration works when I use a webhook instead of the slack config

Range #Timestamp is not giving results in Watcher Kibana

I am using below watcher json.
{
"trigger": {
"schedule": {
"interval": "2m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"<log-abc.upr-dev-{now/d}>"
],
"types": [],
"body": {
"size": 20,
"query": {
"bool": {
"must": [
{
"match": {
"trailer_message": "SUCCESS"
}
},
{
"range": {
"#timestamp": {
"gte": "now-50m"
}
}
}
]
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 0
}
}
},
"actions": {
"notify-pagerduty": {
"webhook": {
"scheme": "https",
"host": "********",
"port": 443,
"method": "post",
"path": "******",
"params": {},
"headers": {
"Content-Type": "application/json"
},
"body": "{\r\n \"payload\": {\r\n \"summary\": \"{{ctx.payload.hits.total}} success \",\r\n \"source\": \"TEST TEST\",\r\n \"severity\": \"error\"\r\n },\r\n \"routing_key\": \"*******************\",\r\n \"event\": \"function\",\r\n \"client\": \"Watcher\"\r\n}"
}
}
}
}
My logs has Success values but after using Range i am not getting any results.
If i remove range it produces results from that day's logs.
I want to use range as well but it is not working.
Please let me know where is the problem.
I was able to solve this problem.
#timestamp was not the field being used in my logs.
there was a different index sessiontime. Once i pointed my Watcher to use sessiontime it started to work.

Getting "expected [END_OBJECT] but found [FIELD_NAME]" in Kibana

I am working on Kibana 6x and using SentiNL to generate email alerts. Below is my query to generate mail if my application generate log "CREDENTIALS ARE NOT DEFINED FOR PULL EVENT SOURCES" with threshold 1. When i play my watcher i get below error.
Error: Watchers: play watcher : execute watcher : execute advanced watcher : get elasticsearch payload : search : [parsing_exception] [match] malformed query, expected [END_OBJECT] but found [FIELD_NAME], with { line=1 & col=80 }
Query:
"input": {
"search": {
"request": {
"index": [
"filebeat-2019.03.21"
],
"body": {
"query": {
"match": {
"msg": "CREDENTIALS ARE NOT DEFINED FOR PULL EVENT SOURCES"
},
"minimum_number_should_match": 1,
"bool": {
"filter": {
"range": {
"#timestamp": {
"gte": "now-15m/m",
"lte": "now/m",
"format": "epoch_millis"
}
}
}
}
},
"size": 0,
"aggs": {
"dateAgg": {
"date_histogram": {
"field": "#timestamp",
"time_zone": "Europe/Amsterdam",
"interval": "1m",
"min_doc_count": 1
}
}
}
}
}
}
}
Also I have used "minimum_number_should_match" to track threshold value. Is that correct?
Found the solution(Here i have not added threshold value) :
{
"actions": {
"email_html_alarm_2daee075-0f24-408e-a362-59172b5e3a1d": {
"name": "email html alarm",
"throttle_period": "1m",
"email_html": {
"stateless": false,
"subject": "Error v1.9 conditon",
"priority": "high",
"html": "<p>{{payload.hits.hits}} test hits Hi {{watcher.username}}</p>\n<p>There are {{payload.hits.total}} results found by the watcher <i>{{watcher.title}}</i>.</p>\n\n<div style=\"color:grey;\">\n <hr />\n <p>This watcher sends alerts based on the following criteria:</p>\n <ul><li>{{watcher.wizard.chart_query_params.queryType}} of {{watcher.wizard.chart_query_params.over.type}} over the last {{watcher.wizard.chart_query_params.last.n}} {{watcher.wizard.chart_query_params.last.unit}} {{watcher.wizard.chart_query_params.threshold.direction}} {{watcher.wizard.chart_query_params.threshold.n}} in index {{watcher.wizard.chart_query_params.index}}</li></ul>\n</div>",
"to": "abc#qwe.com",
"from": "abc#qwe.com"
}
}
},
"input": {
"search": {
"request": {
"index": [
"file-2019.04.03"
],
"body": {
"query": {
"bool": {
"must": {
"query_string": {
"query": "CREDENTIALS ARE NOT FOUND",
"analyze_wildcard": true,
"default_field": "*"
}
},
"filter": [{
"range": {
"#timestamp": {
"gte": "now-1d",
"lte": "now/m",
"format": "epoch_millis"
}
}
}]
}
}
}
}
}
},
"condition": {
"script": {
"script": "payload.hits.total > 0"
}
},
"trigger": {
"schedule": {
"later": "every 2 minutes"
}
},
"disable": true,
"report": false,
"title": "watcher_title",
"save_payload": false,
"spy": false,
"impersonate": false
}

Set up watcher for alerting high CPU usage by some process

I'm trying to create a Watcher Alert that will be triggered when some process on a node uses over 0.95% of CPU for the last one hour.
Here is an example of my config:
{
"trigger": {
"schedule": {
"interval": "10m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"metricbeat*"
],
"types": [],
"body": {
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"system.process.cpu.total.norm.pct": {
"gte": 0.95
}
}
},
{
"range": {
"system.process.cpu.start_time": {
"gte": "now-1h"
}
}
},
{
"match": {
"environment": "test"
}
}
]
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 0
}
}
},
"actions": {
"send-to-slack": {
"throttle_period_in_millis": 1800000,
"webhook": {
"scheme": "https",
"host": "hooks.slack.com",
"port": 443,
"method": "post",
"path": "{{ctx.metadata.onovozhylov-test}}",
"params": {},
"headers": {
"Content-Type": "application/json"
},
"body": "{ \"text\": \" ==========\nTest parameters:\n\tthrottle_period_in_millis: 60000\n\tInterval: 1m\n\tcpu.total.norm.pct: 0.5\n\tcpu.start_time: now-1m\n\nThe watcher:*{{ctx.watch_id}}* in env:*{{ctx.metadata.env}}* found that the process *{{ctx.system.process.name}}* has been utilizing CPU over 95% for the past 1 hr on node:\n{{#ctx.payload.nodes}}\t{{.}}\n\n{{/ctx.payload.nodes}}\n\nThe runbook entry is here: *{{ctx.metadata.runbook}}* \"}"
}
}
},
"metadata": {
"onovozhylov-test": "/services/T0U0CFMT4/BBK1A2AAH/MlHAF2QuPjGZV95dvO11111111",
"env": "{{ grains.get('environment') }}",
"runbook": "http://mytest.com"
}
}
This Watcher doesn't work when I set the metric system.process.cpu.start_time. Perhaps this metric is not a correct one... Unfortunately, I don't have relevant experience with Watcher to solve this issue on my own.
And another issue is that I don't know how to add the system.process.name into a message body.
Thanks in advance for any help!
Use timestamp field instead of system.process.cpu.start_time to check for all metrcibeat-* documents in the last 10 mins
"range": {
"timestamp": {
"gte": "now-10m",
"lte": "now"
}
}
To include system.process.name in your message body look at the {{ctx.payload}} and use the appropriate notation to refer to the process name. For ex. in one of our watcher configs we use {{_source.appname}} to refer to the application name.

How to filter on a date range for Sentinl?

So we've started to implement Sentinl to send alerts. I have managed to get a count of errors sent if it exceeds a specified threshold.
What I'm really struggling with, is filtering for the last day!
Could someone please point me in the right direction!
Herewith the script:
{
"actions": {
"Email Action": {
"throttle_period": "0h0m0s",
"email": {
"to": "juan#company.co.za",
"from": "elk#company.co.za",
"subject": "ELK - ERRORS caused by CreditDecisionServiceAPI.",
"body": "{{payload.hits.total}} ERRORS caused by CreditDecisionServiceAPI. Threshold is 100."
}
},
"Slack Action": {
"throttle_period": "0h0m0s",
"slack": {
"channel": "#alerts",
"message": "{{payload.hits.total}} ERRORS caused by CreditDecisionServiceAPI. Threshold is 100.",
"stateless": false
}
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"index": [
"*"
],
"types": [],
"body": {
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"appName": "CreditDecisionServiceAPI"
}
},
{
"match": {
"level": "ERROR"
}
},
{
"range": {
"timestamp": {
"from": "now-1d"
}
}
}
]
}
}
}
}
}
},
"condition": {
"script": {
"script": "payload.hits.total > 100"
}
},
"transform": {},
"trigger": {
"schedule": {
"later": "every 15 minutes"
}
},
"disable": true,
"report": false,
"title": "watcher_CreditDecisionServiceAPI_Errors"
}
So to be clear, this is the part that's being ignored by the query:
{
"range": {
"timestamp": {
"from": "now-1d"
}
}
}
You need to change it and add the filter Json tag before the range one, like that:
"filter": [
{
"range": {
"timestamp": {
"gte": "now-1d"
}
}
}
]
So we've FINALLY solved the problem!
Elastic search has changes their DSL multiple times, so please note that you need to look at what version you're using for the correct solution. We're on Version: 6.2.3
Below query finally worked:
"query": {
"bool": {
"must": [
{
"match": {
"appName": "CreditDecisionServiceAPI"
}
},
{
"match": {
"level": "ERROR"
}
},
{
"range": {
"#timestamp": {
"gte": "now-1d"
}
}
}
]
}
}

Resources