Kedro deployment to databricks - kedro

Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.
I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?
I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.
Any thoughts on a best practice?

If you are not using docker and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.
CI/CD pipeline builds using kedro package. Creates a wheel file.
Upload dist and conf to dbfs or AzureBlob file copy (if using Azure Databricks)
This will upload everything to databricks on every git push
Then you can have a notebook with the following:
You can have an init script in databricks something like:
from cargoai import run
from cargoai.pipeline import create_pipeline
branch = dbutils.widgets.get("branch")
conf = run.get_config(
project_path=f"/dbfs/project_name/build/cicd/{branch}"
)
catalog = run.create_catalog(config=conf)
pipeline = create_pipeline()
Here conf, catalog, and pipeline will be available
Call this init script when you want to run a branch or a master branch in production like: %run "/Projects/InitialSetup/load_pipeline" $branch="master"
For development and testing, you can run specific nodespipeline = pipeline.only_nodes_with_tags(*tags)
Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)
In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory to schedule and run this.

I found the best option was to just use another tool for packaging, deploying, and running the job. Using mlflow with kedro seems like a good fit. I do most everything in Kedro but use MLFlow for the packaging and job execution: https://medium.com/#QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5
name: My Project
conda_env: conda.yaml
entry_points:
main:
command: "kedro install && kedro run"
Then running it with:
mlflow run -b databricks -c cluster.json . -P env="staging" --experiment-name /test/exp

So there is a section of the documentation that deals with Databricks:
https://kedro.readthedocs.io/en/latest/04_user_guide/12_working_with_databricks.html
The easiest way to get started will probably be to sync with git and run via a Databricks notebook. However, as mentioned, there are other ways using the ".whl" and referencing the "conf" folder.

Related

Windows: How to migrate project databases from Docker Hyper-V backend to WSL2 backend

I have projects in windows but when i upgraded docker to work with wsl 2 then i have to run ddev commands from wsl console and db containers have empty database.
One way to to migrate dbs is to dump from old container and the import into new container. But is there a way to do this automatically for all projects? or atleast project by project.
Start the project in hyper-v docker environment and start up the project like ddev start. After running up the project then there are 2 ways to import the project either by taking a snapshot or exporting sql format which is more portable ( in case you want to setup project elsewhere other than ddev ).
To take snapshot you can use ddev snapshot command and it will make a db snapshot under .ddev/db_snapshots folder. Then you can copy it from there and place it in wsl2 project dir under the same dir like .ddev/db_snapshots. After that run ddev restore-snapshot [snapshot name]. for more docs https://ddev.readthedocs.io/en/latest/users/cli-usage/#snapshotting-and-restoring-a-database
Other method is to use ddev export-db from the old project dir and then using ddev import-db in the new project dir under wsl2. Export command docs https://ddev.readthedocs.io/en/latest/users/cli-usage/#exporting-a-database Import command docs https://ddev.readthedocs.io/en/latest/users/cli-usage/#importing-a-database

How to deploy web application to AWS instance from GitLab repository

Right now, I deploy my (Spring Boot) application to EC2 instance like:
Build JAR file on local machine
Deploy/Upload JAR via scp command (Ubuntu) from my local machine
I would like to automate that process, but:
without using Jenkins + Rundeck CI/CD tools
without using AWS CodeDeploy service since that does not support GitLab
Question: Is it possible to perform 2 simple steps (that are now done manualy - building and deploying via scp) with GitLab CI/CD tools and if so, can you present simple steps to do it.
Thanks!
You need to create a .gitlab-ci.yml file in your repository with CI jobs defined to do the two tasks you've defined.
Here's an example to get you started.
stages:
- build
- deploy
build:
stage: build
image: gradle:jdk
script:
- gradle build
artifacts:
paths:
- my_app.jar
deploy:
stage: deploy
image: ubuntu:latest
script:
- apt-get update
- apt-get -y install openssh-client
- scp my_app.jar target.server:/my_app.jar
In this example, the build job run a gradle container and uses gradle to build the app. GitLab CI artifacts are used to capture the built jar (my_app.jar), which will be passed on to the deploy job.
The deploy job runs an ubuntu container, installs openssh-client (for scp), then executes scp to open my_app.jar (passed from the build job) to the target server.
You have to fill in the actual details of building and copying your app. For secrets like SSH keys, set project level CI/CD variables that will be passed in to your CI jobs.
Create shell file with the following contents.
#!/bin/bash
# Copy JAR file to EC2 via SCP with PEM in home directory (usually /home/ec2-user)
scp -i user_key.pem file.txt ec2-user#my.ec2.id.amazonaws.com:/home/ec2-user
#SSH to EC2 Instnace
ssh -T -i "bastion_keypair.pem" ec2-user#y.ec2.id.amazonaws.com /bin/bash <<-'END2'
#The following commands will be executed automatically by bash.
#Consdier this as remote shell script.
killall java
java -jar ~/myJar.jar server ~/config.yml &>/dev/null &
echo 'done'
#Once completed, the shell will exit.
END2
In 2020, this should be easier with GitLab 13.0 (May 2020), using an older feature Auto DevOps (introduced in GitLab 11.0, June 2018)
Auto DevOps provides pre-defined CI/CD configuration allowing you to automatically detect, build, test, deploy, and monitor your applications.
Leveraging CI/CD best practices and tools, Auto DevOps aims to simplify the setup and execution of a mature and modern software development lifecycle.
Overview
But now (May 2020):
Auto Deploy to ECS
Until now, there hasn’t been a simple way to deploy to Amazon Web Services. As a result, Gitlab users had to spend a lot of time figuring out their own configuration.
In Gitlab 13.0, Auto DevOps has been extended to support deployment to AWS!
Gitlab users who are deploying to AWS Elastic Container Service (ECS) can now take advantage of Auto DevOps, even if they are not using Kubernetes. Auto DevOps simplifies and accelerates delivery and cloud deployment with a complete delivery pipeline out of the box. Simply commit code and Gitlab does the rest! With the elimination of the complexities, teams can focus on the innovative aspects of software creation!
In order to enable this workflow, users need to:
define AWS typed environment variables: ‘AWS_ACCESS_KEY_ID’ ‘AWS_ACCOUNT_ID’ and ‘AWS_REGION’, and
enable Auto DevOps.
Then, your ECS deployment will be automatically built for you with a complete, automatic, delivery pipeline.
See documentation and issue

Build docker image without docker installed

Is it somehow possible to build images without having docker installed. On maven build of my project I'd like to produce docker image, but I don't want to force others to install docker on their machines.
I can think of some virtual box image with docker installed, but it is kind of heavy solution. Is there some way to build the image with some maven plugin only, some Go code or already prepared virtual box image for exactly this purpose?
It boils down to question how to use docker without forcing users to install anything. Either just for build or even for running docker images.
UPDATE
There are some, not really up to date, maven plugins for virtual machine provisioning with vagrant or with vbox. I have found article about building docker images without docker on basel
So far I see two options either I can somehow build the images only or run some VM with docker daemon inside(which can be used not only for builds, but even for integration tests)
We can create Docker image without Docker being installed.
Jib Maven and Gradle Plugins
Google has an open source tool called Jib that is relatively new, but
quite interesting for a number of reasons. Probably the most interesting
thing is that you don’t need docker to run it - it builds the image using
the same standard output as you get from docker build but doesn’t use
docker unless you ask it to - so it works in environments where docker is
not installed (not uncommon in build servers). You also don’t need a
Dockerfile (it would be ignored anyway), or anything in your pom.xml to
get an image built in Maven (Gradle would require you to at least install
the plugin in build.gradle).
Another interesting feature of Jib is that it is opinionated about
layers, and it optimizes them in a slightly different way than the multi-
layer Dockerfile created above. Just like in the fat jar, Jib separates
local application resources from dependencies, but it goes a step further
and also puts snapshot dependencies into a separate layer, since they are
more likely to change. There are configuration options for customizing the
layout further.
Pls refer this link https://cloud.google.com/blog/products/gcp/introducing-jib-build-java-docker-images-better
For example with Spring Boot refer https://spring.io/blog/2018/11/08/spring-boot-in-a-container
Have a look at the following tools:
Fabric8-maven-plugin - http://maven.fabric8.io/ - good maven integration, uses a remote docker (openshift) cluster for the builds.
Buildah - https://github.com/containers/buildah - builds without a docker daemon but does have other pre-requisites.
Fabric8-maven-plugin
The fabric8-maven-plugin brings your Java applications on to Kubernetes and OpenShift. It provides a tight integration into Maven and benefits from the build configuration already provided. This plugin focus on two tasks: Building Docker images and creating Kubernetes and OpenShift resource descriptors.
fabric8-maven-plugin seems particularly appropriate if you have a Kubernetes / Openshift cluster available. It uses the Openshift APIs to build and optionally deploy an image directly to your cluster.
I was able to build and deploy their zero-config spring-boot example extremely quickly, no Dockerfile necessary, just write your application code and it takes care of all the boilerplate.
Assuming you have the basic setup to connect to OpenShift from your desktop already, it will package up the project .jar in a container and start it on Openshift. The minimum maven configuration is to add the plugin to your pom.xml build/plugins section:
<plugin>
<groupId>io.fabric8</groupId>
<artifactId>fabric8-maven-plugin</artifactId>
<version>3.5.41</version>
</plugin>
then build+deploy using
$ mvn fabric8:deploy
If you require more control and prefer to manage your own Dockerfile, it can handle this too, this is shown in samples/secret-config.
Buildah
Buildah is a tool that facilitates building Open Container Initiative (OCI) container images. The package provides a command line tool that can be used to:
create a working container, either from scratch or using an image as a starting point
create an image, either from a working container or via the instructions in a Dockerfile
images can be built in either the OCI image format or the traditional upstream docker image format
mount a working container's root filesystem for manipulation
unmount a working container's root filesystem
use the updated contents of a container's root filesystem as a filesystem layer to create a new image
delete a working container or an image
rename a local container
I don't want to force others to install docker on their machines.
If by "without Docker installed" you mean without having to install Docker locally on every machine running the build, you can leverage the Docker Engine API which allow you to call a Docker Daemon from a distant host.
The Docker Engine API is a RESTful API accessed by an HTTP client such
as wget or curl, or the HTTP library which is part of most modern
programming languages.
For example, the Fabric8 Docker Maven Plugin does just that using the DOCKER_HOST parameter. You'll need a recent Docker version and you'll have to configure at least one Docker Daemon properly so it can securely accept remote requests (there are lot of resources on this subject, such as the official doc, here or here). From then on, your Docker build can be done remotely without having to install Docker locally.
Google has released Kaniko for this purpose. It should be run as a container, whether in Kubernetes, Docker or gVisor.
I was running into the same problems, and I did not find any solution, thus i developed odagrun, it's a runner for Gitlab with integrated registry api, update DockerHub, Microbadger etc.
OpenSource and has a MIT license.
Ideal to create a docker image on the fly, without the need of a docker daemon nor the need of a root account, or any image at all (image: scratch will do), currrently still in development, but i use it every day.
Requirements
project repository on Gitlab
an openshift cluster (an openshift-online-starter will do for most medium/small
extract how the docker image for this project was created:
# create and push image to ImageStream:
build_rootfs:
image: centos
stage: build-image
dependencies:
- build
before_script:
- mkdir -pv rootfs
- cp -v output/oc-* rootfs/
- mkdir -pv rootfs/etc/pki/tls/certs
- mkdir -pv rootfs/bin-runner
- cp -v /etc/pki/tls/certs/ca-bundle.crt rootfs/etc/pki/tls/certs/ca-bundle.crt
- chmod -Rv 777 rootfs
tags:
- oc-runner-shared
script:
- registry_push --rootfs --name=test-$CI_PIPELINE_ID --ISR --config

Simple docker deployment tactics

Hey guys so I've spend the past few days really digging into Docker and I've learned a ton. I'm getting to the point where I'd like to deploy to a digitalocean droplet but I'm starting to wonder about the strategy of building/deploying an image.
I have a perfect Dev setup where I've created a file volume tied to my app.
docker run -d -p 80:3000 --name pug_web -v $DIR/app:/Development test_web
I'd hate to have to run the app in production out of the /Development folder, where I'm actually building the app. This is a nodejs/express app and I'd love to concat/minify/etc. into a local dist folder ane add that build folder to a new dist ready image.
I guess what I'm asking is, A). can I have different dockerfiles, one for Dev and one for Dist? if not B). can I have if statements in my docker files that would do something like... if ENV == 'dist' add /dist... etc.
I'm struggling to figure out how to move this from a Dev environment locally to a tightened up production ready image without any conditionals.
I do both.
My Dockerfile checks out the code for the application from Git. During development I mount a volume over the top of this folder with the version of the code I'm working on. When I'm ready to deploy to production, I just check into Git and re-build the image.
I also have a script that is executed from the ENTRYPOINT command. The script looks at the environment variable "ENV" and if it is set to "DEV" it will start my development server with debugging turned on, otherwise it will launch the production version of the server.
Alternatively, you can avoid using Docker in development, and instead have a Dockerfile at the root of your repo. You can then use your CI server (in our case Jenkins, but Dockerhub also allows for automated build repositories that can do that for you, if you're a small team or don't have access to a dedicated build server.
Then you can just pull the image and run it on your production box.

Source code changes in kubernetes, SaltStack & Docker in production and locally

This is an abstract question and I hope that I am able to describe this clear.
Basically; What is the workflow in distributing of source code to Kubernetes that is running in production. As you don't run Docker with -v in production, how do you update running pods.
In production:
Do you use SaltStack to update each container in each pod?
Or
Do you rebuild Docker images and restart every pod?
Locally:
With Vagrant you can share a local folder for source code. With Docker you can use -v, but if you have Kubernetes running locally how would you mirror production as close as possible?
If you use Vagrant with boot2docker, how can you combine this with Docker -v?
Short answer is that you shouldn't "distribute source code", you should rather "build and deploy". In terms of Docker and Kubernetes, you would build by means of building and uploading the container image to the registry and then perform a rolling update with Kubernetes.
It would probably help to take a look at the specific example script, but the gist is in the usage summary in current Kubernetes CLI:
kubecfg [OPTIONS] [-u <time>] [-image <image>] rollingupdate <controller>
If you intend to try things out in development, and are looking for instant code update, I'm not sure Kubernetes helps much there. It's been designed for production systems and shadow deploys are not a kind of things one does sanely.

Resources