Databricks notebook integrated mlflow artifact location and retention - azure-databricks

Currently by default in notebook run, it will create an experiment ID, but the Artifact Location would point to something under dbfs:/databricks/mlflow/{experiment id}. If there is a way we may change this in default experiment creation? We like to manage the storage outside databricks.
How long is default TTL for experiment runs and metrics? Is it configurable and how?

You can use mlflow_set_experiment('<PATH>') to specify where you want your runs and all of their contents to be logged. See the docs here.
If you are working on Databricks and want to log to a particular blob storage, you can mount the blob storage to Databricks File System (DBFS) and point MLflow to it when you set the experiment.
If you are talking about running it in Databricks and directly logging the results locally, I don't think you can do that. However, you can use GitHub and MLflow Projects to develop on Databricks and then run locally, or vice versa.

Related

How to use Azure Spot instances on Databricks

Spot instances brings the posibility to use free resources in the cloud paying a lower price, however if the cloud demand is increased your resources will be dealocated. This is very usefull for non critical workloads whenever you can aford to loose some of the work done. More info 2 3
Databricks has the posibility to run spot instances on AWS but there is no documentation about how to do it on Azure.
Is it possible to run Databricks clusters on Azure Spot instances?
Yes, it is possible but not using Databricks UI. To use Azure spot instances on Databricks you need to use databricks cli.
Note
With the cli tool is it possible to administrate -create, edit, delete- clusters and instances-pools. However, to simplify the process, I'll focus on editing an existing cluster.
You can install databricks cli using pip install databricks-cli and configure your credentials with databricks configure --token. For more information, visit databricks documentation.
Run the command datbricks clusters list to know the ID of the cluster you want to modify:
$ datbricks clusters list
0422-112415-fifes919 Big Spark3 TERMINATED
0612-341234-jails230 Normal Spark3 TERMINATED
0212-623261-mopes727 Small 7.6 TERMINATED
In my case, I have 3 clusters. First column is the cluster ID, second one is the name of the cluster. Last column is the state.
The command databricks cluster get generates the cluster config in json format. Let's generate the json file to modify it:
databricks clusters get --cluster-id 0422-112415-fifes919 > /tmp/my_cluster.json
This file contains all the configuration related to the cluster like name, instance type, owner... In our case we are looking for the azure_attributes section. You will see something similar to:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1.0
},
...
We need to change the availability to SPOT_WITH_FALLBACK_AZURE and spot_bid_max_price with our bid price. Edit the file with your favorite tool. The result should be something like:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": 0.4566
},
...
Once modified, just update the cluster with the new configuration file using databricks clusters edit:
databricks clusters edit --json-file /tmp/my_cluster.json
Now, everytime you start the cluster, the workers will be spot instances.To confirm this, you can go to the configuration tab inside the worker VM that is allocated in the resource group managed by databricks. You will see the Azure spot is active and with the price configured.
Databricks on AWS has more configuration options like SPOT for the availability field. However, until the documentation is released we'll need to wait or configure with try-error approach.

How to update a laravel project in aws Elastic beanstalk, while keeping the same storage

I want to update my Laravel project in aws beanstalk, but the problem is the storage in tha aws elastic beanstalk is now different , and i want to keep it, i dont know how , cuz my Project contains a storage folder, but it's empty, and if i update it , i'll loose all the files
how can i update the code, but keep the storage ?
Your application should be designed to be stateless. The reason is that your EB instances always run in an Auto Scaling group.
This means that they can be terminated and replaced at any time, without your knowledge or involvement. There are many scenarios under which that may happen. Examples are, Availability Zone re-balance, migration to new physical hardware, scaling in and out activities, or instance health degradation.
Subsequently, you are always at risk loosing your storage, whether you like it or not.
Therefore you application should be designed as stateless, which means that it does not store any data on the instance. This is achieved usually by storing the data in an external storage such as EFS:
How can I mount an Amazon EFS volume to an instance in my Elastic Beanstalk environment?
But if you still want to keep your design, you can always use .ebextentions scripts to help you replace the storage folder. Specifically, in Commands you would make a copy of your storage folder to a safe location at the start of the new deployment. Then in Container commands you would copy the files back to your new application folder, just before the new deployment completes.

hazelcast-jet deployment and data ingestion

I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
How to ingest data from thousands of sources? The sources push data instead of being pulled.
How to config client so that it knows where to submit the tasks?
It would be super useful if there is a comprehensive example where I can learn from.
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
Download and unzip the Hazelcast Jet distribution on each machine:
$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip
$ unzip hazelcast-jet-3.1.zip
$ cd hazelcast-jet-3.1
Go to the lib directory of the unzipped distribution and download the hazelcast-aws module:
$ cd lib
$ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar
Edit bin/common.sh to add the module to the classpath. Towards the end of the file is a line
CLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
You can duplicate this line and replace -jet-3.1 with -aws-2.4.
Edit config/hazelcast.xml to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.
Start the cluster with jet-start.sh.
How to config client so that it knows where to submit the tasks?
A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");
However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.
How to ingest data from thousands of sources? The sources push data instead of being pulled.
I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal source to get the update events from it.

How do I run a cron inside a kubernetes pod/container which has a running spring-boot application?

I have a spring-boot application running on a container. One of the APIs is a file upload API and every time a file is uploaded it has to be scanned for viruses. We have uvscan to scan the uploaded file. I'm looking at adding uvscan to the base image but the virus definitions need to be updated on a daily basis. I've created a script to update the virus definitions. The simplest way currently is to run a cron inside the container which invokes the script. Is there any other alternative to do this? Can the uvscan utility be isolated from the app pod and invoked from the application?
There are many ways to solve the problem. I hope, I can help you to find what suits you best.
From my perspective, it would be pretty convenient to have a CronJob that builds and pushes the new docker image with uvscan and the updated virus definition database on a daily basis.
In your file processing sequence you can create a scan Job using Kubernetes API, and provide it access to shared volume with a file you need to scan.
Scan Job will use :latest image, and if new images will appear in the registry it will download new image and create pod from it.
The downside is when you create images daily it consumes "some" amount of disk space, so you may need to invent the process of removing the old images from the registry and from the docker cache on each node of Kubernetes cluster.
Alternatively, you can put AV database on a shared volume or using Mount Propagation and update it independently of pods. If uvscan opens AV database in read-only mode it should be possible.
On the other hand it usually takes time to load virus definition into the memory, so it might be better to run virus scan as a Deployment than as a Job with a daily restart after new image was pushed to the registry.
At my place of work, we also run our dockerized services within EC2 instances. If you only need to update the definitions once a day, I would recommend utilizing an AWS Lamda function. It's relatively affordable and you don't need to worry about the overhead of a scheduler, etc. If you need help setting up the Lambda, I could always provide more context. Nevertheless, I'm only offering another solution for you in the AWS realm of things.
So basically I simply added a cron to the application running inside the container to update the virus definitions.

Modify datastax cassandra ami startup script

I am exploring the possibility of modifying https://github.com/riptano/ComboAMI to support Ec2MultiRegionSnitch.
In that:
Add option --snitch Ec2MultiRegionSnitch -> modify cassandra.yaml to write snitch as multi region
Add option --broadcast_address_as_public_ip yes -> modify cassandra.yaml to write broadcast_address: public_ip
Add option --seeds 100.222.111.222, so as the newly created instances can join an existing cassandra, e.g. 100.222.111.222.
Tested the settings and worked.
The restrictions
I can't copy the datastax ami to be my own ami.
I can't snapshot an existing datastax cassandra instance into an AMI, such that I modify the script locally to get it launched.
The question:
How to modify the script and test it out.
Should I use AutoScalingGroup with a Launchconfiguration to point to this AMI, then use sed to modify the cassandra.yaml, service restart cassandra instead? It is not obvious to me how to run a script after the AWS launch configuration has completed launching the instance, especially I can't get the AWS::Instances::GetAtt PublicIP for the broadcast address. Ideally speaking the changes should have been done during cassandra.yaml construction in the script, not after.
Thanks!
That's correct, the AMI has to be rebuilt on a clean image under your account. We have instructions here on how to do so:
https://github.com/riptano/ComboAMI/blob/2.5/presetup/setup.md
As far as the the AutoScalingGroup question, I'm not sure how beneficial that would be. If you create your own image, off your own repo, feel free to create a pull request and I'll look them over to merge them into the official AMI.

Resources