ECS provisions multiple servers but never runs the task - amazon-ec2

I have an ECS cluster where the capacity provider is an auto-scaling group of ec2 servers with a Target Tracking scaling policy and Managed Scaling turned on.
The min capacity of the cluster is 0, the max is 100. The instance types it's employing are c5.12xlarge.
I have a task that uses 4 x vCPUs and 4 GiB memory. When I run a single instance of that task on that cluster, ECS very slowly auto scales the group to > 1 servers (usually 2 to begin with, and then eventually adds a third one - I've tried multiple times), but never actually runs the task and stays in a state of PROVISIONING for ages and ages before I get annoyed and stop the task.
Here is a redacted copy of my task description:
{
"family": "my-task",
"taskRoleArn": "arn:aws:iam::999999999999:role/My-IAM-Role",
"executionRoleArn": "arn:aws:iam::999999999999:role/ecsTaskExecutionRole",
"cpu": "4 vCPU",
"memory": 4096,
"containerDefinitions": [
{
"name": "my-task",
"image": "999999999999.dkr.ecr.us-east-1.amazonaws.com/my-container:latest",
"essential": true,
"portMappings": [
{
"containerPort": 12012,
"hostPort": 12012,
"protocol": "tcp"
}
],
"mountPoints": [
{
"sourceVolume": "myEfsVolume",
"containerPath": "/mnt/efs",
"readOnly": false
}
]
}
],
"volumes": [
{
"name": "myEfsVolume",
"efsVolumeConfiguration": {
"fileSystemId": "fs-1234567",
"transitEncryption": "ENABLED",
"authorizationConfig": {
"iam": "ENABLED"
}
}
}
],
"requiresCompatibilities": [
"EC2"
],
"tags": [
...
]
}
My questions are:
Why, if I'm running a single task that would easily run on once instance, is it scaling the group to at least 2 servers?
Why does it never just deploy and run my task?
Where can I look to see what the hell is going on with it (logs, etc)?

So it turns out that, even if you set an ASG to be the capacity provider for an ECS cluster, if you haven't set the User Data up in the launch configuration for that ASG to have something like the following:
#!/bin/bash
echo ECS_CLUSTER=my-cluster-name >> /etc/ecs/ecs.config;echo ECS_BACKEND_HOST= >> /etc/ecs/ecs.config;
then it will never make a single instance available to your cluster. ECS will respond by continuing to increase the desired capacity of the ASG.
Personally I feel like this is something that ECS should ensure happens without your knowledge. Maybe there's a good reason why not.

Related

How do I change root volume size of AWS Batch at runtime

I have an application which makes requests to AWS to start batch jobs. The jobs vary, and therefore resource requirements change for each job.
It is clear how to change CPUs and memory, however I cannot figure out how to specify root volume size, or if it is even possible
Here is an example of the code I am running:
import boto3
client = boto3.client('batch')
JOB_QUEUE = "job-queue"
JOB_DEFINITION="job-definition"
container_overrides = {
'vcpus': 1,
'memory': 1024,
'command': ['echo', 'Hello World'],
# 'volume_size': 50 # this is not valid
'environment': [ # this just creates env variables
{
'name': 'volume_size',
'value': '50'
}
]
}
response = client.submit_job(
jobName="volume-size-test",
jobQueue=JOB_QUEUE,
jobDefinition=JOB_DEFINITION,
containerOverrides=container_overrides)
My question is similar to this However I am specifically asking if this is possible at runtime. I can change the launch template however that doesn't solve the issue of being able to specify required resources when making the request. Unless the solution is to create multiple launch templates and then select that at run time, though that seems unnecessarily complicated
You can use AWS Elastic File System for this. EFS volumes can be mounted to the containers created for your job definition. EFS doesn't require you to provide a specific volume size because it automatically grows and shrinks depending on the usage.
You need to specify an Amazon EFS file system in your job definition through the efsVolumeConfiguration property
{
"containerProperties": [
{
"image": "amazonlinux:2",
"command": [
"ls",
"-la",
"/mount/efs"
],
"mountPoints": [
{
"sourceVolume": "myEfsVolume",
"containerPath": "/mount/efs",
"readOnly": true
}
],
"volumes": [
{
"name": "myEfsVolume",
"efsVolumeConfiguration": {
"fileSystemId": "fs-12345678",
"rootDirectory": "/path/to/my/data",
"transitEncryption": "ENABLED",
"transitEncryptionPort": integer,
"authorizationConfig": {
"accessPointId": "fsap-1234567890abcdef1",
"iam": "ENABLED"
}
}
}
]
}
]
}
Reference: https://docs.aws.amazon.com/batch/latest/userguide/efs-volumes.html

How to create Azure Databricks Job cluster to save some costing compared to Standard cluster?

I have a few pipeline jobs on Azure Databricks that run ETL solutions using standard or high concurency clusters.
I've noticed on azure costings page that job cluster is a cheaper option that should do the same thing. https://azure.microsoft.com/en-gb/pricing/calculator/
All Purpose - Standard_DS3_v2
0.75DBU
×
£0.292Per DBU per hour
×
=
£0.22
Job Cluster - Standard_DS3_v2
0.75DBU
×
£0.109Per DBU per hour
×
=
£0.08
I have configured job cluster by creating a new job and selecting new job cluster as per tutorial below: https://docs.databricks.com/jobs.html#create-a-job
The job was successfull and ran for couple days. However, the cost did not really go down. Have I missed anything?
Cluster Config
{
"autoscale": {
"min_workers": 2,
"max_workers": 24
},
"cluster_name": "",
"spark_version": "9.1.x-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true",
"spark.scheduler.mode": "FAIR",
"spark.sql.sources.partitionOverwriteMode": "dynamic",
"spark.databricks.service.server.enabled": "true",
"spark.databricks.repl.allowedLanguages": "sql,python,r",
"avro.mapred.ignore.inputs.without.extension": "true",
"spark.databricks.cluster.profile": "serverless",
"spark.databricks.service.port": "8787"
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"enable_elastic_disk": true,
"cluster_source": "JOB",
"init_scripts": []
}

Schedule AWS-Lambda with Java and CloudWatch Triggers

I am new to AWS and AWS-Lambdas. I have to create a lambda function to run a cron job in every 10 minutes. I am planning to add a Cloudwatch trigger to trigger the same in every 10 minutes but without any event. I looked up on the internet and found that some event needs to be there to get it running.
I need to get some clarity and leads on below 2 points:
Can I schedule a job using AWS-Lambda with cloudwatch triggering the same in span of 10 minutes without any events.
How do one need to make it interact with MySQL databases that have been hosted on AWS.
I have my application built on SpringBoot running on multiple instances with a shared database (single source of truth). I have devised everything stated above using Spring's in-built scheduler and proper synchronisation on DB level using locks but because of the distributed nature of instances, I have been advised to do the same using lambdas.
You need to pass ScheduledEvent object to the handleRequest() of the lambda.
handleRequest(ScheduledEvent event, Contex context)
Configure a cron job that runs every 10 minutes in your cloudwatch template (if using cloudformation). This will make sure to trigger your lambda after every 10 min.
Make sure to add below mentioned dependency to your pom.
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-lambda-java-events</artifactId>
<version>2.2.5</version>
</dependency>
Method 2:
You can specify something like this in your cloudformation template. This will not require any argument to be passed to your handler(), incase you do not require any event related information. This will automatically trigger your lambda as per your cron job.
"ScheduledRule": {
"Type": "AWS::Events::Rule",
"Properties": {
"Description": "ScheduledRule",
"ScheduleExpression": {
"Fn::Join": [
"",
[
"cron(",
{
"Ref": "ScheduleCronExpression"
},
")"
]
]
},
"State": "ENABLED",
"Targets": [
{
"Arn": {
"Fn::GetAtt": [
"LAMBDANAME",
"Arn"
]
},
"Id": "TargetFunctionV1"
}
]
}
},
"PermissionForEventsToInvokeLambdaFunction": {
"Type": "AWS::Lambda::Permission",
"Properties": {
"FunctionName": {
"Ref": "NAME"
},
"Action": "lambda:InvokeFunction",
"Principal": "events.amazonaws.com",
"SourceArn": {
"Fn::GetAtt": [
"ScheduledRule",
"Arn"
]
}
}
}
}
If you want to run a cronjob from cloudwatch event is the only option.
If you don't want to use cloudwatch events then go ahead with EC2 instance. But EC2 will cost you more than the cloudwatch event.
Note: Cloudwatch events rule steup is just like defining cronjob in crontab in any linux system, nothing much. In linux serevr you will define everything as a RAW one but here its just an UI based one.

Jelastic API environment create trigger data

The jelastic api environment.Trigger.AddTrigger takes "data" as parameter, but i cannot find what are all the different possible variables that i can use. Jelastic API docs just says "data : string , information about trigger". Is this "data" documented on somewhere else?
There are some JPS javascript/java examples that i have found, that are pointing me to the right direction, but more information would be nice to have.
https://github.com/jelastic-jps/magento-cluster/blob/master/scripts/addTriggers.js
https://docs.cloudscripting.com/0.99/examples/horizontal-scaling/
https://github.com/jelastic-jps/basic-examples/blob/master/automatic-horizontal-scaling/scripts/enableAutoScaling.js
The environment.Trigger.AddTrigger method requires a set of the parameters:
name - name of a notification trigger
nodeGroup - target node group (you can apply trigger to any node
group within the chosen environment)
period - load period for nodes
condition - rules for monitoring resources
type - comparison sign, the available values are GREATER and LESS
value - percentage of a resource that is monitored
resourceType - types of resources that are monitored by a trigger,
namely CPU, Memory (RAM), Network, Disk I/O, and Disk IOPS
valueType - measurement value. Here, PERCENTAGES is the only possible
measurement value. The available range is from 0 up to 100.
actions - object to describe a trigger action
type - trigger action, the available values are NOTIFY, ADD_NODE, and
REMOVE_NODE
customData:
notify - alert notification sent to a user via email
The following code shows how a new trigger can be created:
{
"type": "update",
"name": "AddTrigger",
"onInstall": {
"environment.trigger.AddTrigger": {
"data": {
"name": "new alert",
"nodeGroup": "sqldb",
"period": "10",
"condition": {
"type": "GREATER",
"value": "55",
"resourceType": "MEM",
"valueType": "PERCENTAGES"
},
"actions": [
{
"type": "NOTIFY",
"customData": {
"notify": false
}
}
]
}
}
}
}
More information about events and other CloudScripting language features you can find here.

How to Register Durable External Service in Consul

I am registering an external service in consul through Catalog API http://127.0.0.1:8500/v1/catalog/register with a payload as follows :
{
"Datacenter": "dc1",
"Node": "pedram",
"Address": "www.google.com",
"Service": {
"ID": "google",
"Service": "google",
"Address": "www.google.com",
"Port": 80
},
"Check": {
"Node": "pedram",
"CheckID": "service:google",
"Status": "passing",
"ServiceID": "google",
"script": "curl www.google.com > /dev/null 2>&1",
"interval": "10s"
}
}
The external service registers successfully and I see it in the list of registered services, but after a while it disappears. It seems that it's got unregistered automatically.
I am running the consul in -dev mode.
What's the problem?
I found that I should register external services in separate node. My application's local services are getting registered in a node named
"Node": "pedram"
when I register external services in this node, they will be get removed automatically.
But when I register my external services in a new node, all the new external services are get registered durably and ready to be used as all other local services.
my new payload is as follows :
{
"Datacenter": "dc1",
"Node": "newNode",
"Address": "www.google.com",
"Service": {
"ID": "google",
"Service": "google",
"Address": "www.google.com",
"Port": 80
},
"Check": {
"Node": "newNode",
"CheckID": "service:google",
"Status": "passing",
"ServiceID": "google"
}
}
This is excepted behavior. In Consul Anti-Entropy docs
If any services or checks exist in the catalog that the agent is not aware of, they will be automatically removed to make the catalog reflect the proper set of services and health information for that agent. Consul treats the state of the agent as authoritative; if there are any differences between the agent and catalog view, the agent-local view will always be used.
In your settings, the agent in the host 'pedram' didn't aware of the service register. so the anti-entropy strategy removes the service.
You shouldn't be using -dev mode, except for testing/playing around. for your health check, I'd recommend not using a "script": "curl www.google.com > /dev/null 2>&1",
Instead I'd recommend using a http health check:
"http": "https://www.google.com",
More about health checks is available here: https://www.consul.io/docs/agent/checks.html
Also, you should probably move to HTTPS (on port 443) if you can.
it also might help to save this as a .JSON file, and let consul read it as part of it's startup, as I'm guessing you want this to be a long-running external service. You can do that with a command like:
/usr/local/bin/consul agent -config-dir=/etc/consul/consul.d
and every .json file in /etc/consul/consul.d/ will be read as part of it's config. If you change the files, consul reload will restart.
I'd make those changes(not run in dev mode, etc) and see if the problem still exists. I'm guessing it won't.

Resources