How to create Azure Databricks Job cluster to save some costing compared to Standard cluster? - azure-databricks

I have a few pipeline jobs on Azure Databricks that run ETL solutions using standard or high concurency clusters.
I've noticed on azure costings page that job cluster is a cheaper option that should do the same thing. https://azure.microsoft.com/en-gb/pricing/calculator/
All Purpose - Standard_DS3_v2
0.75DBU
×
£0.292Per DBU per hour
×
=
£0.22
Job Cluster - Standard_DS3_v2
0.75DBU
×
£0.109Per DBU per hour
×
=
£0.08
I have configured job cluster by creating a new job and selecting new job cluster as per tutorial below: https://docs.databricks.com/jobs.html#create-a-job
The job was successfull and ran for couple days. However, the cost did not really go down. Have I missed anything?
Cluster Config
{
"autoscale": {
"min_workers": 2,
"max_workers": 24
},
"cluster_name": "",
"spark_version": "9.1.x-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true",
"spark.scheduler.mode": "FAIR",
"spark.sql.sources.partitionOverwriteMode": "dynamic",
"spark.databricks.service.server.enabled": "true",
"spark.databricks.repl.allowedLanguages": "sql,python,r",
"avro.mapred.ignore.inputs.without.extension": "true",
"spark.databricks.cluster.profile": "serverless",
"spark.databricks.service.port": "8787"
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"ssh_public_keys": [],
"custom_tags": {},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"enable_elastic_disk": true,
"cluster_source": "JOB",
"init_scripts": []
}

Related

Apache Pinot server component consumes unexpected amount of memory

Problem Description:
Docker is used to deploy Apache Pinot on production servers (VMs).
Pinot's official documentation has been followed for this purpose.
What has been done
Pinot servers consume more memory than the data and replication factor we have.
The things has been tried were the followings:
Defining Xms and Xmx flags for JVM in ‍JAVA_OPTS environment variables
Setup monitoring on machines in order to gain the observability
Remove the indices (like inverted index) from the table definition
System Specification:
we have 3 servers, 2 controllers and 2 brokers with the following specifications:
24 core CPU
64 gigabytes of Memory
738 gigabytes of SSD disk
Sample Docker-compose file on one of the servers:
version: '3.7'
services:
pinot-server:
image: apachepinot/pinot:0.11.0
command: "StartServer -clusterName bigdata-pinot-ansible -zkAddress 172.16.24.14:2181,172.16.24.15:2181 -configFileName /server.conf"
restart: unless-stopped
hostname: server1
container_name: server1
ports:
- "8096-8099:8096-8099"
- "9000:9000"
- "8008:8008"
environment:
JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx20G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log -javaagent:/opt/pinot/etc/jmx_prometheus_javaagent/jmx_prometheus_javaagent-0.12.0.jar=8008:/opt/pinot/etc/jmx_prometheus_javaagent/configs/pinot.yml"
volumes:
- ./server.conf:/server.conf
- ./data/server_data/segment:/var/pinot/server/data/segment
- ./data/server_data/index:/var/pinot/server/data/index
table config:
{
"tableName": "<table-name>",
"tableType": "REALTIME",
"segmentsConfig": {
"schemaName": "<schema-name>",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "60",
"replication": "3",
"timeColumnName": "date",
"allowNullTimeValue": false,
"replicasPerPartition": "3",
"segmentPushType": "APPEND",
"completionConfig": {
"completionMode": "DOWNLOAD"
}
},
"tenants": {
"broker": "DefaultTenant",
"server": "DefaultTenant",
"tagOverrideConfig": {
"realtimeCompleted": "DefaultTenant_OFFLINE"
}
},
"tableIndexConfig": {
"noDictionaryColumns": [
<some-fileds>
],
"rangeIndexColumns": [
<some-fileds>
],
"rangeIndexVersion": 1,
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false,
"sortedColumn": [
"date",
"id"
],
"bloomFilterColumns": [],
"loadMode": "MMAP",
"onHeapDictionaryColumns": [],
"varLengthDictionaryColumns": [],
"enableDefaultStarTree": false,
"enableDynamicStarTreeCreation": false,
"aggregateMetrics": false,
"nullHandlingEnabled": false
},
"metadata": {},
"routing": {
"instanceSelectorType": "strictReplicaGroup"
},
"query": {},
"fieldConfigList": [],
"upsertConfig": {
"mode": "FULL",
"hashFunction": "NONE"
},
"ingestionConfig": {
"streamIngestionConfig": {
"streamConfigMaps": [
{
"streamType": "kafka",
"stream.kafka.topic.name": "<topic-name>",
"stream.kafka.broker.list": "<kafka-brokers-list>",
"stream.kafka.consumer.type": "lowlevel",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
"stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"stream.kafka.decoder.prop.format": "JSON",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "1h",
"realtime.segment.flush.segment.size": "300M"
}
]
}
},
"isDimTable": false
}
server.conf file:
pinot.server.netty.port=8098
pinot.server.adminapi.port=8097
pinot.server.instance.dataDir=/var/pinot/server/data/index
pinot.server.instance.segmentTarDir=/var/pinot/server/data/segment
pinot.set.instance.id.to.hostname=true
After ingesting data from real-time stream (Kafka in our case) the data grows in the memory and the containers faced to OOMKilled error:
We have no clue about what is happening on the server, would someone help us finding the root cause of this problem?
P.S. 1: For following the complete process of how the Pinot is deployed you can see this repository on github.
P.S. 2: It is known that the size of data in Pinot can be calculated using the following formula:
Data size = size of data (retention) * retention period * replication factor
For example if we have data with retention of 2d (two days), and each day we have approximately 2 gigabytes of data, and the replication factor equals to 3, the data size is about 2 * 2 * 3 = 12 gigabytes
As it is described in the question, the problem is with creating the table not the Apache Pinot itself. Apache Pinot keeps the keys for Upsert operation on heap. In order to scale the performance, it is required to increase the Kafka partitions.
Based on the documentation, the default upsert mode is equals to None.

Elasticsearch : Snapshots not executed automatically (cron schedule issue)

I've changed the schedule of my managed snapshot policy. By default, the cron is each 30 minutes, and I wanted just one per day. Since... the snapshots are not executed automatically...
I've putted back the previous configuration but it doesn't fixed the problem :/
There's no recent failure (last failure is Feb 03, 2022 10:30 AM).
BUT ! I'm able to manually take a snapshot. It's only the cron that is not working.
My deployment is at version 7.17.1, this is the json config returned by GET https://ES.HOST/_slm/policy/ :
{
"cloud-snapshot-policy": {
"version": 10,
"modified_date_millis": 1648205967792,
"policy": {
"name": "<cloud-snapshot-{now/d}>",
"schedule": "0 */30 * * * ?",
"repository": "found-snapshots",
"config": {
"partial": true
},
"retention": {
"expire_after": "259200s",
"min_count": 10,
"max_count": 100
}
},
"last_success": {
"snapshot_name": "cloud-snapshot-2022.03.28-7bsk_rtkt7klcmre3yghtq",
"start_time": 1648449376402,
"time": 1648451047195
},
"last_failure": {
"snapshot_name": "cloud-snapshot-2022.02.03-yyl4pcwfqvynla9v7noqgg",
"time": 1643880630250
},
"next_execution_millis": 1648458000000,
"stats": {
"policy": "cloud-snapshot-policy",
"snapshots_taken": 23346,
"snapshots_failed": 230,
"snapshots_deleted": 23214,
"snapshot_deletion_failures": 119
}
}
}
My deployment is managed on Elastic Cloud, using Azure.
I've taken a snapshot manually 2 hours ago and, as you can see, a snapshot should have been executed 3 minutes ago but doesn't appear in my found-snapshots reposity (even in the logs, there's nothing...)
So it's "just" the cron schedule that needs to be fixed... Any suggestion ?
Thanks !
Well...
POST https://ES.HOST/_slm/stop
then...
POST https://ES.HOST/_slm/start
and the cron schedule work again !
Thanks to this resource : https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-lifecycle-management-api.html

ECS provisions multiple servers but never runs the task

I have an ECS cluster where the capacity provider is an auto-scaling group of ec2 servers with a Target Tracking scaling policy and Managed Scaling turned on.
The min capacity of the cluster is 0, the max is 100. The instance types it's employing are c5.12xlarge.
I have a task that uses 4 x vCPUs and 4 GiB memory. When I run a single instance of that task on that cluster, ECS very slowly auto scales the group to > 1 servers (usually 2 to begin with, and then eventually adds a third one - I've tried multiple times), but never actually runs the task and stays in a state of PROVISIONING for ages and ages before I get annoyed and stop the task.
Here is a redacted copy of my task description:
{
"family": "my-task",
"taskRoleArn": "arn:aws:iam::999999999999:role/My-IAM-Role",
"executionRoleArn": "arn:aws:iam::999999999999:role/ecsTaskExecutionRole",
"cpu": "4 vCPU",
"memory": 4096,
"containerDefinitions": [
{
"name": "my-task",
"image": "999999999999.dkr.ecr.us-east-1.amazonaws.com/my-container:latest",
"essential": true,
"portMappings": [
{
"containerPort": 12012,
"hostPort": 12012,
"protocol": "tcp"
}
],
"mountPoints": [
{
"sourceVolume": "myEfsVolume",
"containerPath": "/mnt/efs",
"readOnly": false
}
]
}
],
"volumes": [
{
"name": "myEfsVolume",
"efsVolumeConfiguration": {
"fileSystemId": "fs-1234567",
"transitEncryption": "ENABLED",
"authorizationConfig": {
"iam": "ENABLED"
}
}
}
],
"requiresCompatibilities": [
"EC2"
],
"tags": [
...
]
}
My questions are:
Why, if I'm running a single task that would easily run on once instance, is it scaling the group to at least 2 servers?
Why does it never just deploy and run my task?
Where can I look to see what the hell is going on with it (logs, etc)?
So it turns out that, even if you set an ASG to be the capacity provider for an ECS cluster, if you haven't set the User Data up in the launch configuration for that ASG to have something like the following:
#!/bin/bash
echo ECS_CLUSTER=my-cluster-name >> /etc/ecs/ecs.config;echo ECS_BACKEND_HOST= >> /etc/ecs/ecs.config;
then it will never make a single instance available to your cluster. ECS will respond by continuing to increase the desired capacity of the ASG.
Personally I feel like this is something that ECS should ensure happens without your knowledge. Maybe there's a good reason why not.

How to rotate ELK logs?

I have indexes around 250 GB all-together in 3 host i.e. 750 GB data in ELK cluster.
So how can I rotate ELK logs to keep three months data in my ELK cluster and older logs should be pushed some other place.
You could create your index using "indexname-%{+YYYY.MM}" naming format. This will create a distinct index every month.
You could then filter this index, based on timestamp, using a plugin like curator.
The curator could help you set up a CRON job to purge those older indexes or back them up on some s3 repository.
Reference - Backup or Restore using curator
Moreover, you could even restore these backup indexes whenever needed directly from s3 repo for historical analysis.
Answer by dexter_ is correct, but as the answer is old, a better answer would be:
version 7.x of elastic stack provides a index life cycle management policies, which can be easily managed with kibana GUI and is native to elk stack.
PS, you still have to manage the indices like "indexname-%{+YYYY.MM}" as suggested dexter_
elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html
It took me a while to figure out exact syntax and rules, so I'll post the final policy I used to remove old indexes (it's based on the example from https://aws.amazon.com/blogs/big-data/automating-index-state-management-for-amazon-opensearch-service-successor-to-amazon-elasticsearch-service/):
{
"policy": {
"description": "Removes old indexes",
"default_state": "active",
"states": [
{
"name": "active",
"transitions": [
{
"state_name": "delete",
"conditions": {
"min_index_age": "14d"
}
}
]
},
{
"name": "delete",
"actions": [
{
"delete": {}
}
],
"transitions": []
}
],
"ism_template": {
"index_patterns": [
"mylogs-*"
]
}
}
}
It will automatically apply the policy for any new mylogs-* indexes, but you'll need to apply it manually for existing ones (under "Index Management" -> "Indices").

Schedule AWS-Lambda with Java and CloudWatch Triggers

I am new to AWS and AWS-Lambdas. I have to create a lambda function to run a cron job in every 10 minutes. I am planning to add a Cloudwatch trigger to trigger the same in every 10 minutes but without any event. I looked up on the internet and found that some event needs to be there to get it running.
I need to get some clarity and leads on below 2 points:
Can I schedule a job using AWS-Lambda with cloudwatch triggering the same in span of 10 minutes without any events.
How do one need to make it interact with MySQL databases that have been hosted on AWS.
I have my application built on SpringBoot running on multiple instances with a shared database (single source of truth). I have devised everything stated above using Spring's in-built scheduler and proper synchronisation on DB level using locks but because of the distributed nature of instances, I have been advised to do the same using lambdas.
You need to pass ScheduledEvent object to the handleRequest() of the lambda.
handleRequest(ScheduledEvent event, Contex context)
Configure a cron job that runs every 10 minutes in your cloudwatch template (if using cloudformation). This will make sure to trigger your lambda after every 10 min.
Make sure to add below mentioned dependency to your pom.
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-lambda-java-events</artifactId>
<version>2.2.5</version>
</dependency>
Method 2:
You can specify something like this in your cloudformation template. This will not require any argument to be passed to your handler(), incase you do not require any event related information. This will automatically trigger your lambda as per your cron job.
"ScheduledRule": {
"Type": "AWS::Events::Rule",
"Properties": {
"Description": "ScheduledRule",
"ScheduleExpression": {
"Fn::Join": [
"",
[
"cron(",
{
"Ref": "ScheduleCronExpression"
},
")"
]
]
},
"State": "ENABLED",
"Targets": [
{
"Arn": {
"Fn::GetAtt": [
"LAMBDANAME",
"Arn"
]
},
"Id": "TargetFunctionV1"
}
]
}
},
"PermissionForEventsToInvokeLambdaFunction": {
"Type": "AWS::Lambda::Permission",
"Properties": {
"FunctionName": {
"Ref": "NAME"
},
"Action": "lambda:InvokeFunction",
"Principal": "events.amazonaws.com",
"SourceArn": {
"Fn::GetAtt": [
"ScheduledRule",
"Arn"
]
}
}
}
}
If you want to run a cronjob from cloudwatch event is the only option.
If you don't want to use cloudwatch events then go ahead with EC2 instance. But EC2 will cost you more than the cloudwatch event.
Note: Cloudwatch events rule steup is just like defining cronjob in crontab in any linux system, nothing much. In linux serevr you will define everything as a RAW one but here its just an UI based one.

Resources