I'm creating a monitoring for a process using New Relic. The process itself is an AWS Lambda that finishes running in around 15 seconds. Any time this process fails, I want to an alert to be triggered and an email to be sent to me per the policy I've configured.
For testing purposes I'm causing the lambda to fail in a QA environment multiple times in a row to see what gets picked up by New Relic, although in production the failure would only occur a couple (less than 3) times per week, potentially a few days apart.
Here is the chart that depicts all of the failures, the NRQL query, and the thresholds. As we can see, the summed errors are well above the threshold but for some reason the alert email is not being dispatched. Any ideas?
Try increasing your evaluation offset in Condition Settings -> Advanced Settings > Evaluation offset
New Relic polls for Lambda metrics every 5 minutes so if your offset is lower than this you may find that the alert doesn't fire.
In reality I've found this quite unreliable and I'd suggest setting quite a high offset initially to test the alert - maybe 20 or 30 minutes.
According to me the red highlighted area is the timeframe where the alert condition is being violated. Alert should had been triggered, check your notification channel and try sending test notification.
I am new to jBPM. I am working on jBPM version 6.2.0. I want to perform following tasks.
Send reminder email to user / group.
Remind the user again after 1 business day if the task is not yet complete. Continue to send reminder everyday untill the task is done.
Also what happens if jboss / tomcat server restarts after sending one reminder email. Will the later emails still schedule ?
I am able to add Deadlines (Escalation- Notification) But it runs once and sends only 1 email. I need to keep reminding the user on a daily basis (or hourly) to complete the task.
I tried looking in jBPM 6 user guide but it does not have clarity about Boundary timer events and intermediate catch time events. And when i use any of them then it runs once.
Any help is much appreciated.
Here is an example of something that I did recently for sending periodic emails.
This should loop until a user finally completes the task. You might have trouble with the one business day rule since I do not know if the ISO 8601 spec is flexible enough to know about weekends/holidays/business days. You could add that logic into your service task for sending the email.
Be aware that this loop will continue forever until the task is complete. You might want to consider adding some additional timeout. You could add a loop count so after X amount of times the process will be cancelled. Some of my processes have a rule that if the process is not complete in Y days, the process should be cancelled. I accomplished that by have a process variable CancelDate and set a Timer Event definition to Date/Time and the value #{CancelDate}.
So I've been using Boto in Python to try and configure autoscaling based on CPUUtilization, more or less exactly as specified in this example:
http://boto.readthedocs.org/en/latest/autoscale_tut.html
However both alarms in CloudWatch just report:
State Details: State changed to 'INSUFFICIENT_DATA' at 2012/11/12
16:30 UTC. Reason: Unchecked: Initial alarm creation
Auto scaling is working fine but the alarms aren't picking up any CPUUtilization data at all. Any ideas for things I can try?
Edit: The instance itself reports CPU utilisation data, just not when I try and create an alarm in CloudWatch, programatically in python or in the interface. Detailed monitoring is also enabled just in case...
Thanks!
The official answer from AWS goes like this:
Hi, There is an inherent delay in transitioning into INSUFFICIENT_DATA
state (only) as alarms wait for a period of time to compensate for
metric generation latency. For an alarm with a 60 second period, the
delay before transition into I_D state will be between 5 and 10
minutes.
John.
Apparently this is a temporary state and will likely resolve itself.
I am not sure what's going on in the backend, but if you compare the alarm history you will see AWS remove the 'unit' column if you just modify the alarm without any change as at7000ft said. So remove the unit column of your script.
Make sure that the alarm's Namespace is 'AWS/EC2'.
I know this is a long time after the original question, but in case others find this via Google, I had the same problem, and it turned out I set alarm's Namespace improperly.
It is needed to publish data with the same unit used to create the alarm. If you didn't specify one, it will be a <None> unit.
Unit can be specified in aws put-metric-data and aws-put-metric-alarm with --unit <value>
Unit <value> can be:
Seconds
Bytes
Bits
Percent
Count
Bytes/Second (bytes per second)
Bits/Second (bits per second)
Count/Second (counts per second)
None (default when no unit is specified)
Units are also case-sensitive, be carefull about that in your scripts.
For CPUUtilization, you can use Percent.
After the first data-set is sent to your alarm (it can take up to 5 minutes for a non-detailed monitored instance), the alarm will switch to the OK or ALARM state instead of the INSUFFICIENT_DATA one.
I am having the same INSUFFICIENT_DATA alarm state show up in CloudWatch for an RDS CPUUtilization > 60 alarm created with CloudFormation. ("Reason: Unchecked: Initial alarm creation" shows up under details). This is a very crude fix but I found that by selecting the alarm, clicking the Modify button, and then the Save button (without changing anything) the alarm goes to the OK state and everything is file.
I had this problem. Make sure the metric name you use to create the alarm matches the actual metric name.
You can list your metrics with:
aws cloudwatch list-metrics --namespace=<NAMESPACE, e.g. System/Linux, etc>
Find the metric and the MetricName. Make sure your alarm is configured for that metric.
As far as I know, default metric resolution is 5 minutes (which can be lowered to 1 minute if you pay up, or something like that), so if your alarm's measurement period is lower than that, then it'll remain permanently in an INSUFFICIENT_DATA state. In my case, I had a 1 minute measurement period on CPU utilization, and changing it to 5 minutes has fixed the state issue.
I had a similar problem, my alarm was constantly in INSUFFICIENT_DATA status although I can see the metric in the GUI.
Come out that this happen, because I specified the wrong Unit for the metric, when I created the Alarm. No error was reported back but it never became GREEN.
Better to avoid to specify it, if you are not sure, and AWS will do the correct match in the background.
There is a directory /var/tmp/aws-mon/ that contains a couple files. One is instance-id. The instance I was on was created from an AMI and this file retained the old instance id. I just edited it and made sure /var/tmp/aws-mon/placement/availability-zone was also correct. The alarms changed to OK almost instantly.
Also ran into this problem but for a different reason: I passed ES cluster ARN instead of domain name in my Cloudformation template. It was pretty frustrating
Example: Lets say I have a workflow which send an email 2 days before warranty enddate.
This workflow is triggered on the "Created" of a entity.
step 1: wait condition - process timeout < (warrantyendate - 2)
after wait: send email.
So when the record is created, the workflow is started. But what happens when the user goes back and updates the warranty enddate.
Does the workflow check the updated warranty enddate or does it still use the enddate entered when it was triggered (i.e the initial on create value)?
My understanding is that the workflow uses the data in the system at the time of execution.
The important thing to take note here is that a workflow can be executed many times, at these times the data in the system can be different. Crm caches the state of the workflow, but not the data. Process Architecture for Microsoft Dynamics CRM 2011 describes this.
So, each time the process timeout condition is checked it will use the current value of warrantyEndDate. If the value is changed, next time the condition is checked the new value will be used.
In any case as #BenPatterson1 suggests, you are probably best just testing to be sure.
After trying this myself, if the value of the field included in the condition changes, the workflow engine fires up from sleep(waiting) and checks the condition again.
If it meets the condition, then continues to the next step or will continue to wait.
Is it possible to change the resolution time calculation to start not with the issue creation time, but rather with the time when an issue was transferred into a certain state?
The use case is as follows - We use a kanban-ish development method, where we create most issues/featues/stories in a backlog upfront; thus, this kills the usefulness of the resolution time gadget. In our case, the lead/resolution time should rather be calculated using the time where an issue has been pulled to the selected issues.
As this calculation is the basis for multiple gadgets, maybe it could be changed per gadget in order to avoid unforeseen issues with other gadgets?
There is a service level management tool SLAdiator (http://sladiator.com) which calculates resolution / reaction times based on the duration that ticket has spent in a certain status (or statuses). You can view these tickets online as well as get reports.