aws status check failed alarm is not taking the action - amazon-ec2

I have one instance which always gives the headache of failing system status check which I had to reboot the instance in order to get it running again.
I see that there's an option to create status check alarm which I did
I did receive the notification through email + sns as I have set but the instance did not get rebooted that I have to go into ec2 dashboard to reboot manually
Any settings I am not setting correctly or if anyone has other ideas how I can reboot the instance automatically if status check failed?
Thanks in advance for any suggestions.

You can only use the recover action on a system status check failure and reboot only works on instance status checks failures. They are distinctly different failures with different causes.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
I would setup separate alarms for each condition. One for instance status check failure and system level failure, reboot and recover respectively.

Related

Couldn’t acquire exclusive lock on DB at ‘/eventstore/db’

I’m trying to install eventstore on ubuntu 20.04 but everytime I run evenstored --what-if (as root or as simple user, or as sudo) I get the following error message : Couldn't acquire exclusive lock on DB at '/eventstore/db'..
I tried many things :
I tried ensuring that eventstore user and group were owner of the folder.
reinstalling eventstore
rebooting server
stop process with systemctl stop eventstore and starting it back again
I also tried launching service first (as root / sudo or simple user) before using eventstored --what-if.
I can’t figure out why I keep getting this message as if many instance of eventstore where launched at the same time.
EDIT :
Here is my config file (/etc/eventstore/eventstore.conf)
# Paths
Db: /eventstore/db
Index: /eventstore/index
Log: /eventstore/logs
# Certificates configuration
CertificateFile: /etc/eventstore/certs/cert.crt
CertificatePrivateKeyFile: /etc/eventstore/certs/privkey.key
TrustedRootCertificatesPath: /etc/ssl/certs
CertificateReservedNodeCommonName: "*.mathob-jehanno.com"
# Network configuration
IntIp: 37.187.2.103
ExtIp: 37.187.2.103
IntHostAdvertiseAs: mathob-jehanno.com
ExtHostAdvertiseAs: mathob-jehanno.com
HttpPort: 2113
IntTcpPort: 1112
EnableExternalTcp: false
EnableAtomPubOverHTTP: false
# Projections configuration
RunProjections: None
It happened to me previously. I was running v20 without supplying the necessary settings like the certificates were missing. The server crashed because of this, but the last message you see is this Couldn't acquire exclusive lock on DB at '/eventstore/db'. You might look close and see if it's a warning, and the real reason for the crash is mentioned earlier in the stack trace of the original error.
Ok so
First of all, comments helped a lot :
this error message is following another one which give more detail about what the problem is.
One thing to know is that eventstored --what-if is supposed to be run while service is not running so user need to stop the service before (systemctl stop eventstore).
I then changed the path to db, index and logs file to match the default value (it prevented me some permissions error).

CodeDeploy allowTraffic Fails but my code is still deployed on instances

I am using codeDeploy and when I run it gets stuck on in progress mode. By further researching the problem I found that it fails because of the AllowTraffic script. it just says script failed. I have looked into the logs but there are no errors. Also aws documentation suggested that it may be health check problem but both my instances are healthy in my target group.
Weird thing is that the code gets deployed despite failed status.
Can someone help?
Thanks a bunch
did you enable the elastic load balancer? If so then check your healthcheck settings on your ELB. If it fails on the AllowTraffic it means that it's not getting a successful return from your load balancer. For example, you are doing redirects on your ELB. The status code will be 301. You should add this on your ELB healthcheck.
If health check is fine you can also try to change Application Deployment Settings:
to CodeDeployDefault.OneAtATime
for me CodeDeployDefault.AllAtOnce was failing with same error.
If AllowTraffic stage isn't passing successfully, then usually there are 2 issues-
Either the target group in your application load balancer is unhealthy.
The Target group setting in your deployment configuration is not pointing to the target group mentioned in your load balancer under the rules.

Why windows service not starting in sometimes AWS EC2 micro?

I have created a windows service exe from the python code, it starts when I start it manually in AWS Ec2 instances. Also Starts automatically some time when the ec2 boots up. But sometimes the service will not be stared in the instance, why is it happening some times. For your info I also increased the timeout to service start till 700000 in regedit key. still the service will not start automatically. Why is that happening? can I get some solution for this?
If the service is set to start automatically at boot but it isn't, there should be a record describing the failure in the "System" area in the Event Viewer. Check those logs.
Also, try setting the service's "Startup type" to "Automatic (Delayed Start)". Doing so will delay service startup by a couple of minutes, which may be enough to fix the problem if it is a "race condition" as the system starts.

OpenNMS alert when a folder is not empty?

I'm trying to create an OpenNMS alert when a certain folder ISN'T empty but can't seem to find a way of doing it. Any ideas?
I assume you have a service which goes down if your folder is empty. See the short video. By default notifications are turned off. Every service down event will be notified by default. You can be more granular by filtering on nodes and services. The default setting will send a mail to the admin user. You set a mail address in the user of the admin. To configure the access to your mail server, configure the javamail-configuration.properties. I just tried to figure out where you stuck exactly.
One approach could be to poll the certain directory for the empty condition with an agent on your host system and expose the status, e.g. Net-SNMP. You can create a service by using the SNMP Monitor to poll the status of the exposed OID and create a mail notification for this particular service.
Yes, this can be done. I have performed similar tasks using simple perl and bash scripts on Linux.
OpenNMS allows you to create polling configurations based on scripts. Your script is expected to output "0" or "1", with 0 representing "OK" and 1 representing "Not OK".
You could use the GeneralPurposePoller:
https://wiki.opennms.org/wiki/GeneralPurposePoller
However, it seems that you should instead use the SystemExecuteMonitor:
https://wiki.opennms.org/wiki/SystemExecuteMonitor

How to run only failed sessions in a workflow

In a workflow there are sessions connected in parallel and in sequence. Suppose some sessions which are in parallel and in sequential mode are failed, How do I restart the workflow with only failed sessions. How can I design this in Informatica?
Turn 'suspend on error' for workflow
Turn 'restart on recovery' for each session in workflow
Now if any session fail workflow will be suspended until you fix the problem and hit recover on workflow in monitor. When you do so it cause to restart only failed sessions.
A large publishing client asked us to implement something similar to what you asked. We crated a database table to keep track of successful sessions within a workflow. Each session will have a mapping at the end that adds an entry to database which says I passed or failed. When we try to run in a recovery mode we query the database at the beginning of each session to find out if we need to run this session or not.
We also provided a web interface to this table where business users can manually choose which session to run or escape based on their needs.
Recovery option will work only if you have "workflow recovery" turned on in repository. If you dont, then you can check option "fail workflow if task fails" at individual session level and create condition on link that connect workflow to each other. Disadvantage of this method is that your workflow will appear failed and wont execute next sessions until failed one are fixed.
thanks.

Resources