Identifying the unhealthy instance id from target group

Identifying the unhealthy instance id from target group - amazon-ec2

I need help in identifying the unhealthy host from a cluster of 6 ec2's under Target group.
Right now I have a CloudWatch alarm which triggers whenever there is an unhealthy host but doesn't show which is unhealthy. I am slightly aware that I can use a SNS topic to trigger lambda, but need help from our community here.
Is there a way I can run a command on that particular unhealthy instance? or at least reboot it via EC2 actions?
I couldnt extract instance id from cloudwatch ELB metrics to play around.
maybe a different sns topic to run command via systems manager ?
Please give me a detailed answer as I am still learning AWS. :)

You should have a look at this AWS blog for this exact specific use case.
Identifying unhealthy targets of Elastic Load Balancer

Related

How can I subscribe to amazon SNS topic with all pods in cluster?

I have a Kubernetes cluster with 5 replica sets running a Spring Boot server and I would like to subscribe each pod to an amazon sns topic individually. Is it possible? How can I do that?
Thanks!

If you are running a Spring server, you can create an endpoint and mark those endpoints as subscribers to your sns topic. You would need one subscription for each pod (you should be able to get pod address information by running kubectl describe svc ${service_name}) in order for sns to properly fanout messages to all of your pods. AWS outlines this process here
edit: worth noting the above process is not very robust since that pod address list may not be static. it may be better to subscribe your service to the sns topic, since this abstracts the changing pod IP list, and implement pod-to-pod communication, similar to what is outlined in this article

Schedule a task in EC2 Auto Scaling Group

I have multiple EC2s on an autoscaling group. They all run the same java application. In the application, I want to trigger a functionality every month. So, I have a function that uses Spring Schedule and runs every month. But, that function is run on every single EC2 instance in the autoscaling group while it must run only once. How should I approach this issue? I am thinking of using services like Amazon SQS but they would have the same problem.
To be more specific on what I have tried, in one attempt the function puts a record with a key unique to this month on a database which is shared among all the ec2 instances. If the record for this month is already there, the put request is ignored. Now the problems transfer to the reading part? I have a function that reads the database and do the job. But that function is run by every single ec2 instance.

Interesting! You could put a configuration on one of the servers to trigger a monthly activity, but individual instances in an Auto Scaling group should be treated as identical, fragile systems that could be replaced during a month. So, there would be no guarantee that this specific server would be around in one month.
I would suggest you take a step back and look at the monthly event as something that is triggered external to the servers.
I'm going to assume that the cluster of servers is running a web application and there is a Load Balancer in front of the instances that distributes traffic amongst the instances. If so, "something" should send a request to the Load Balancer, and this would be forwarded to one of the instances for processing, just like any normal request.
This particular request would be to a URL used specifically trigger the monthly processing.
This leaves the question of what is the "something" that sends this particular request. For that, there are many options. A simple one would be:
Configure Amazon CloudWatch Events to trigger a Lambda function based on a schedule
The AWS Lambda function would send the HTTP request to the Load Balancer

RabbitMQ AWS autoscaling

I am new to RabbitMQ and I am evaluating it for my next project. Is it possible to use AWS autoscaling with RabbitMQ? How would the multiple instances coordinate messages across multiple instance queues? I see that RabbitMQ has clustering capabilities but appears not to fit in an autoscaling model. I did find this post,
How to set up autoscaling RabbitMQ Cluster AWS
It fixed the scale-up issues but did not address what to do when the instances scale-down. The issue with scaling-down is the potential for messages still in the queues when the instance is removed. Clustering is ok but would like to leverage autoscaling whenever possible.

How to use Consul in leader election?

How do I use Consul to make sure only one service is performing a task?
I've followed the examples in http://www.consul.io/ but I am not 100% sure which way to go. Should I use KV? Should I use services? Or should I use a register a service as a Health Check and make it be callable by the cluster at a given interval?
For example, imagine there are several data centers. Within every data center there are many services running. Every one of these services can send emails. These services have to check if there are any emails to be sent. If there are, then send the emails. However, I don't want the same email be sent more than once.
How would it make sure all emails are sent and none was sent more than once?
I could do this using other technologies, but I am trying to implement this using Consul.

This is exactly the use case for Consul Distributed Locks
For example, let's say you have three servers in different AWS availability zones for fail over. Each one is launched with:
consul lock -verbose lock-name ./run_server.sh
Consul agent will only run the ./run_server.sh command on which ever server acquires the lock first. If ./run_server.sh fails on the server with the lock Consul agent will release the lock and another node which acquires it first will execute ./run_server.sh. This way you get fail over and only one server running at a time. If you registered your Consul health checks properly you'll be able to see that the server on the first node failed and you can repair and restart the consul lock ... on that node and it will block until it can acquire the lock.
Currently, Distributed Locking can only happen within a single Consul Datacenter. But, since it is up to you to decide what a Consul Servers make up a Datacenter, you should be able to solve your issue. If you want locking across Federated Consul Datacenters you'll have to wait for it, since it's a roadmap item.

First Point:
The question is how to use Consul to solve a specific problem. However, Consul cannot solve that specific problem because of intrinsic limitations in the nature of a gossip protocol.
When one datacenter cannot talk to another you cannot safely determine if the problem is the network or the affected datacenter.
The usual solution is to define what happens when one DC cannot talk to another one. For example, if we have 3 datacenters (DC1, DC2, and DC3) we can determine that whenever one DC cannot talk to the other 2 DCs then it will stop updating the database.
If DC1 cannot talk to DC2 and DC3 then DC1 will stop updating the database, and the system will assume DC2 and DC3 are still online.
Let's imagine that DC2 and DC3 are still online and they can talk to each other, then we have quorum to continue running the system.
When DC1 comes online again it will play catch up with the database.
Where can Consul help here? It can communicate between DCs and check if they are online... but so can ICMP.
Take a look at the comments. Did this answer your question? Not really. But I don't think the question has an answer.
Second point: The question is "How to use Consul in leader election?" It would have been better to ask how does Consul elect a new leader. Or "Given the documentation in Consul.io, can you give me an example on how to determine the leader using Consul".
If that is what you really want, then the question was already answered: How does a Consul agent know it is the leader of a cluster?

Lost messages when migrating RabbitMQ from one EC2 instance to another

I have RabbitMQ installed and working well on an EC2 CentOS 6 instance, with an assortment of queues and topics. I decided to migrate this working instance to another, new EC2 server instance with the same OS and initial setup, just smaller.
I created an AMI (Amazon server image) from the existing installation, and then used this AMI to create a new server instance. RabbitMQ came up just fine, as did all the topics, users, virtual hosts, queues, etc.
However, the queues all came back with 0 messages in them, although messages did exist in the queues before creating the server image.
Questions:
Did I miss something in my migration?
Where are messages are explicitly 'stored' while they're within rabbit queues?
I believe the messages were sent as 'Persistent' but not 100% sure about that. I am aware of replication of RabbitMQ instances, but figured this method of server recreation would be simpler/quicker?

#robthewolf's comments got be searching some more but with a slightly different slant (around whether one could explicitly save off queue messages in a backing database/key-value store)
That led me to this old, but seemingly still-relevant blog post that clearly describes rabbit's current 'persistence' methods for all cases (persistent publishing, durable, etc.)
http://www.rabbitmq.com/blog/2011/01/20/rabbitmq-backing-stores-databases-and-disks/

If messages were persistent you can check this SO question - RabbitMq uses Mnesia storage which is connected to ip address of machine it is running on, so few tweaks in the answers there can resolve the issue.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio