Why is my CloudWatch alarm not being applied to the EC2 instances? - aws-lambda

I have python code in a Lambda function to apply a CloudWatch alarm to EC2 instances.
The CloudWatch alarm is to reboot them if they are non-responsive for 10 minutes. This alarm is easy to make on a per EC2 instance basis, but that's a lot of manual work, we have many servers.
I've set up a CloudWatch rule that triggers my Lambda function when an EC2 instance enters the ''running'' state after a reboot, or after a new EC2 instance is launched and gets to ''running''.
I have tried specifying a specific server in my code, and that works. However, what I want is a piece of code that applies it to servers as they are rebooted; so at to cover them all as maintenance windows come around and they all get rebooted.
from collections import defaultdict
import boto3
ec2_sns = 'SNS-Topic:'
ec2_rec ="arn:aws:automate:eu-central-1:ec2:recover"
def lambda_handler(event, context):
ec2 = boto3.resource('ec2')
cw = boto3.client('cloudwatch')
ec2info = defaultdict()
running_instances = ec2.instances.filter(Filters=[{'Name': 'tag-
key','Values': ['cloudwatch'],}])
for instance in running_instances:
for tag in instance.tags:
if 'Name'in tag['Key']:
name = tag['Value']
ec2info[instance.id] = {'Name':
name,'InstanceId':instance.instance_id,}
attributes = ['Name','InstanceId']
for instance_id, instance in ec2info.items():
instanceid =instance["InstanceId"]
nameinsta = instance["Name"]
print(instanceid,nameinsta )
#Create StatusCheckFailed Alamrs
cw.put_metric_alarm(
AlarmName = ('InstanceId') +
"_System_Unresponsive_(Created by Lambda)",
AlarmDescription='System_unresponsive for 10
minutes',
ActionsEnabled=True,
OKActions=[
'No data',
],
AlarmActions=[
'arn:aws:lambda:eu-central
1:788677770941:function:System_unresponsive:reboot',
],
InsufficientDataActions=[
'Insuficient data',
],
MetricName='StatusCheckFailed',
Namespace='AWS/EC2',
Statistic='Average',
Dimensions=[ {'Name': "InstanceId",'Value':
instanceid},],
Period=300,
Unit='Seconds',
EvaluationPeriods=2,
DatapointsToAlarm=2,
Threshold=1,
ComparisonOperator='LessThanOrEqualToThreshold')
I expect that the code applies the specified CloudWatch alarm to servers as they are rebooted, but it doesn't.
When you test it all you get is ''null'' as a result.

You can use CloudTrail to get insights into the API calls that AWS is doing to start the instances and catch just those specific events with CloudWatch Events.
Once you catch the right events and send them to a lambda, the lambda will receive the instance ID in the event information. You could use that information to create/update the alarms just for the instance contained in the event. You could use print(json.dumps(event)) inside the function to inspect the event contents in CloudWatch Logs.

Related

Turn on All ec2(+ future created ec2) 'termination protection' Using Lambda

Im trying to turn on 'termination protection' for all ec2.
(termination protection doesn't work to spot instance, so i want to add skip condition not to make an error for spot instance.)
I saw a code like below, however the code doesn't work.
import json
import boto3
def lambda_handler(event, context):
client = boto3.client('ec2')
ec2_regions = [region['RegionName'] for region in client.describe_regions()['Regions']]
for region in ec2_regions:
client = boto3.client('ec2', region_name=region)
conn = boto3.resource('ec2',region_name=region)
instances = conn.instances.filter()
for instance in instances:
if instance.state["Name"] == "running":
#print instance.id # , instance.instance_type, region)
terminate_protection=client.describe_instance_attribute(InstanceId =instance.id,Attribute = 'disableApiTermination')
protection_value=(terminate_protection['DisableApiTermination']['Value'])
if protection_value == False:
client.modify_instance_attribute(InstanceId=instance.id,Attribute="disableApiTermination",Value= "True" )
Summary,,,
I want to turn on 'Termination protection' for all EC2 which is running(not spot instance).
The region should be ap-northeast-2.
Could you help me to fix this code to running appropriatly?
if you want to skip the spot instance all you need to do this is figure out which one is spot instance.
You need to use describe_instances api and then using if-else condition, request_id is empty its a spot instance, if not then its not a spot instance
import boto3
ec2 = boto3.resource('ec2')
instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]) #add filter of your own choice
for instance in instances:
if instance.spot_instance_request_id:
# logic to skip termination ( spot instance )
else:
# logic to terminate ( not spot instance )
You can refer a similar question on this -> https://stackoverflow.com/a/45604396/13126651
docs for describe_instances

Unable to Launch EC2 Instances Asynchronously via Terraform

I am willing to launch two instances via Terraform. First one will generate some certificate files, push to S3 bucket. The second instance will pull those certificates from particular S3 bucket. Both operations will be handled by user data. The problem here is pull commands (aws cli) in user data of second instance are not working. (It is working when I try from shell) I think the issue is about terraform is launching both instances synchronously so that second instance is getting launched before first instance pushes the certificates to S3.
I also tried to handle this by adding "depends_on" to my code but it did not work. I am looking for a way to launch the instances asynchronously. Like second instance will be launched after 30 seconds then first instance is launched. Here I am pasting the related part of the code.
data "template_file" "first_executor" {
template = file("some_path/first_executor.sh")
}
resource "aws_instance" "first_instance" {
ami = data.aws_ami.amazon-linux-2.id
instance_type = "t2.micro"
user_data = data.template_file.first_executor.rendered
network_interface {
device_index = 0
network_interface_id = aws_network_interface.first_instance-network-interface.id
}
}
###
data "template_file" "second_executor" {
template = file("some_path/second_executor.sh")
}
resource "aws_instance" "second_instance" {
depends_on = [aws_instance.first_instance]
ami = data.aws_ami.amazon-linux-2.id
instance_type = "t2.micro"
user_data = data.template_file.second_executor.rendered
network_interface {
device_index = 0
network_interface_id = aws_network_interface.second-network-interface.id
}
}
Answer is no. "depends_on" in Terraform means it will wait for a resource to be available. This means, your second EC2 will be created as soon as first EC2 is triggered.
Terraform will not wait till your first EC2 is in "running" state or if user data is executed.
I would suggest go with depdens_on and then, in your second EC2 user data script, add some logic to have a loop which will look up S3 and will wait and repeat till the resources are found.

Create CloudWatch alarm that sets an instance to standby via SNS/Lambda

What I am looking to do is set an instance to standby mode when it hits an alarm state. I already have an alarm set up to detect when my instance hits 90% CPU for a while. The alarm currently sends a Slack and text message via SNS calling a Lambda function. I would like to add is to have the instance go into standby mode. The instances are in an autoscaling group.
I found that you can perform this through the CLI using the command :
aws autoscaling enter-standby --instance-ids i-66b4f7d5be234234234 --auto-scaling-group-name my-asg --should-decrement-desired-capacity
You can also do this with boto3 :
response = client.enter_standby(
InstanceIds=[
'string',
],
AutoScalingGroupName='string',
ShouldDecrementDesiredCapacity=True|False
)
I assume I need to write another Lambda function that will be triggered by SNS that will use the boto3 code to do this?
Is there a better/easier way before I start?
I already have the InstanceId passed into the event to the Lambda so I will have to add the ASG name in the event.
Is there a way to get the ASG name in the Lambda function when I already have the Instance ID? Then I do not have to pass it in with the event.
Thanks!
Your question has a couple sub-parts, so I'll try to answer them in order:
I assume I need to write another Lambda function that will be triggered by SNS that will use the boto3 code to do this?
You don't need to, you could overload your existing function. I could see a valid argument for either separate functions (separation of concerns) or one function (since "reacting to CPU hitting 90%" is basically "one thing").
Is there a better/easier way before I start?
I don't know of any other way you could do it, other than Cloudwatch -> SNS -> Lambda.
Is there a way to get the ASG name in the Lambda function when I already have the Instance ID?
Yes, see this question for an example. It's up to you whether it looks like doing it in the Lambda or passing an additional parameter is the cleaner option.
For anyone interested, here is what I came up with for the Lambda function (in Python) :
# Puts the instance in the standby mode which takes it off the load balancer
# and a replacement unit is spun up to take its place
#
import json
import boto3
ec2_client = boto3.client('ec2')
asg_client = boto3.client('autoscaling')
def lambda_handler(event, context):
# Get the id from the event JSON
msg = event['Records'][0]['Sns']['Message']
msg_json = json.loads(msg)
id = msg_json['Trigger']['Dimensions'][0]['value']
print("Instance id is " + str(id))
# Capture all the info about the instance so we can extract the ASG name later
response = ec2_client.describe_instances(
Filters=[
{
'Name': 'instance-id',
'Values': [str(id)]
},
],
)
# Get the ASG name from the response JSON
#autoscaling_name = response['Reservations'][0]['Instances'][0]['Tags'][1]['Value']
tags = response['Reservations'][0]['Instances'][0]['Tags']
autoscaling_name = next(t["Value"] for t in tags if t["Key"] == "aws:autoscaling:groupName")
print("Autoscaling name is - " + str(autoscaling_name))
# Put the instance in standby
response = asg_client.enter_standby(
InstanceIds=[
str(id),
],
AutoScalingGroupName=str(autoscaling_name),
ShouldDecrementDesiredCapacity=False
)

getting throttle exception while using aws describe_log_streams

Below is my boto3 code snippet for lambda. My requirement is to read the entire cloudwatch logs and based on certain criteria should push it to S3.
I have used the below snippet to read the cloudwatch logs from each stream. This is working absolutely fine, for lesser data. However for massive logs inside each LogSteam this will throw
Throttle exception - (reached max retries: 4)
Default/Max value is 50.
I tried given certain other values but of no use. Please check and let me know if there is any other alternative for this?
while v_nextToken is not None:
cnt+=1
loglist += '\n' + "No of iterations inside describe_log_streams 2nd stage - Iteration Cnt" + str(cnt)
#Note : Max value of limit=50 and by default value will be 50
#desc_response = client.describe_log_streams(logGroupName=vlog_groups,orderBy='LastEventTime',nextToken=v_nextToken,descending=True, limit=50)
try:
desc_response = client.describe_log_streams(logGroupName=vlog_groups,orderBy='LastEventTime',nextToken=v_nextToken,descending=True, limit=50)
except Exception as e:
print ( "Throttling error" + str(e) )
You can use CW logs subscription filter for Lambda, so the lambda will be triggered directly from the log stream. You can also consider subscribing a Kinesis stream which has some advantages.

EC2 - Connect to running instance by using the API

I create an EC2 instance via the provided interface, and I am using the AWS API to connect to the existing running instance, but when I run the following code I get "You have 0 Amazon EC2 instance(s) running.":
DescribeAvailabilityZonesResult availabilityZonesResult = ec2.describeAvailabilityZones();
System.out.println("You have access to " + availabilityZonesResult.getAvailabilityZones().size() +
" Availability Zones.");
DescribeInstancesResult describeInstancesRequest = ec2.describeInstances();
List<Reservation> reservations = describeInstancesRequest.getReservations();
Set<Instance> instances = new HashSet<Instance>();
for (Reservation reservation : reservations) {
instances.addAll(reservation.getInstances());
}
System.out.println("You have " + instances.size() + " Amazon EC2 instance(s) running.");
Do you have any ideas about what might be the problem?
If you double checked that your instances are actually up and running, they most likely are not in the "us-east-1" instance region (which is the default one that the AWS SDK assumes).
So set your AmazonEC2Client instance to point to the correct endpoint and everything should be fine, e.g. for Europe (Ireland):
ec2.setEndpoint("ec2.eu-west-1.amazonaws.com");
More details, as well as links to where you can find the endpoint strings, in this SO answer.

Resources