Spring-XD: Deployment of modules to certain containers - spring-xd

Three questions regarding deployment of modules to Spring XD container:
For certain sources and sinks it's necessary to say to which container a module should be deployed. Let's say we have a lot of containers on different machines, and we want to establish a stream reading a log file from one machine. The source module of type tail has to be deployed to the container running on the machine with the log file. How can you do that?
You may want to restrict the execution of modules to a group of containers. Let's say we have some powerful machines for our batch processing with containers on it, and we have other machines where our container runs parallel to some other processes only for ingesting data (log files etc.). Is that possible?
If we have a custom module, is it possible to add the module xml and the jars just to certain containers, so that those modules are just executed there? Or is it necessary that we have the same module definitions on all containers?
Thanks!

You bring up excellent points, we have been doing some design work around these issues, in particular #1 and #2 and will have some functionality here in our next milestone release in about 1 month time.
In terms of #3, the model for resolving the jars that are loaded in the containers requires the local file system or a shared file system to resolve the classpath. This is also something that has come up in our prototypes of using Spring XD on the CloudFoundry PaaS and we want to provide a more dynamic/at runtime ability to located and load new modules. No estimate on when that will be address.
Thanks for questions!
Cheers,
Mark

Related

Any performance difference between loading ML Module (Xquery / Javascript) from physical disk localation and loading from ML Module DB inside ML?

By default, ML HTTP server will use the Module DB inside ML.
(It seems all ML training materials refer to that type of configuration.)
Any changes in the XQuery programs will need to upload into the Module DB first. That could be accomplished by using mlLoadModules or mlReloadModules ml-gradle commands.
CI/CD does not access the ML cluster directly. Everything is via ml-gradle from a machine dedicated from code deployment to different ML enviroments like dev/uat/prod etc.
However it is also possible to configure the ML app server to use the XQuery program from physical disk location like below screenshot.
With that configuration, it is not required to reload the programs into ML Module DB.
The changes in the program have to be in the ML server itself. CI/CD will need to access to the ML cluster directly. One advantage of this way is that developer will easily see whether the changes in the program have been indeed deployed, as all changes are sitting as physical readable text files in the disk.
Questions:
Which way is better? Why?
Any ML query perforemance difference between these two different approaches?
For the physical file approach, does it mean that CI/CD will need to deploy the program changes to all the ML hosts in the ML cluster? (I guess it is not a concern if HTTP server refers XQuery programs from Module DB inside ML. ML cluster will auto sync the code among different hosts.)
In general, it's recommended to deploy modules to a database rather than the filesystem.
This makes deployment more simple and easy, as you only have to load the module once into the modules database, rather than putting the file on every single host. If you use the filesystem, then you need to put those files on every host in the cluster.
With a modules database, if you were to add nodes to the cluster, you don't have to also deploy the modules. You can then also take advantage of High Availability, backup and restore, and all the other features of a database.
Once a module is read, it is loaded into caches, so the performance impact should be negligible.
If you plan to use REST extensions, then you would need a modules database so that the configurations can be installed in that database.
Some might look to use filesystem for simple development on a single node, in which changes saved to the filesystem are made available without re-deploying. However, you could use something like the ml-gradle mlWatch task to auto-deploy modules as they are modified on the filesystem and achieve effectively the same thing using a modules database.

hazelcast-jet deployment and data ingestion

I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
How to ingest data from thousands of sources? The sources push data instead of being pulled.
How to config client so that it knows where to submit the tasks?
It would be super useful if there is a comprehensive example where I can learn from.
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
Download and unzip the Hazelcast Jet distribution on each machine:
$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip
$ unzip hazelcast-jet-3.1.zip
$ cd hazelcast-jet-3.1
Go to the lib directory of the unzipped distribution and download the hazelcast-aws module:
$ cd lib
$ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar
Edit bin/common.sh to add the module to the classpath. Towards the end of the file is a line
CLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
You can duplicate this line and replace -jet-3.1 with -aws-2.4.
Edit config/hazelcast.xml to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.
Start the cluster with jet-start.sh.
How to config client so that it knows where to submit the tasks?
A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");
However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.
How to ingest data from thousands of sources? The sources push data instead of being pulled.
I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal source to get the update events from it.

How do I run a cron inside a kubernetes pod/container which has a running spring-boot application?

I have a spring-boot application running on a container. One of the APIs is a file upload API and every time a file is uploaded it has to be scanned for viruses. We have uvscan to scan the uploaded file. I'm looking at adding uvscan to the base image but the virus definitions need to be updated on a daily basis. I've created a script to update the virus definitions. The simplest way currently is to run a cron inside the container which invokes the script. Is there any other alternative to do this? Can the uvscan utility be isolated from the app pod and invoked from the application?
There are many ways to solve the problem. I hope, I can help you to find what suits you best.
From my perspective, it would be pretty convenient to have a CronJob that builds and pushes the new docker image with uvscan and the updated virus definition database on a daily basis.
In your file processing sequence you can create a scan Job using Kubernetes API, and provide it access to shared volume with a file you need to scan.
Scan Job will use :latest image, and if new images will appear in the registry it will download new image and create pod from it.
The downside is when you create images daily it consumes "some" amount of disk space, so you may need to invent the process of removing the old images from the registry and from the docker cache on each node of Kubernetes cluster.
Alternatively, you can put AV database on a shared volume or using Mount Propagation and update it independently of pods. If uvscan opens AV database in read-only mode it should be possible.
On the other hand it usually takes time to load virus definition into the memory, so it might be better to run virus scan as a Deployment than as a Job with a daily restart after new image was pushed to the registry.
At my place of work, we also run our dockerized services within EC2 instances. If you only need to update the definitions once a day, I would recommend utilizing an AWS Lamda function. It's relatively affordable and you don't need to worry about the overhead of a scheduler, etc. If you need help setting up the Lambda, I could always provide more context. Nevertheless, I'm only offering another solution for you in the AWS realm of things.
So basically I simply added a cron to the application running inside the container to update the virus definitions.

Jenkins for multiple deployment on multiple server

I am new to Jenkins and know how to create Jobs and add servers for JAR deployment.
I need to create deployment job using Jenkins which takes a JAR file and deploys it of 50-100 servers.
These servers are categorized in 6 categories. there will be different process run on each server but same JAR will be used.
Please suggest what is the best approach to create JOB for this.
As of now, the servers are less(6-7), I have added each server to Jenkins and using command execution over ssh for process execution. But for 50 servers this is not the possibility.
Jenkins is a great tool for managing builds and dependencies, but it is not a great tool for Configuration Management. If you're deploying to more than 2 targets (and especially if different targets have different configurations), I would highly recommend investing the time to learn a configuration management tool.
I can personally recommend Puppet and Ansible. In particular, Ansible works over an SSH connection to the target (which it sounds like you have) and requires only a base Python install.

Celery, Resque, or custom solution for processing jobs on machines in my cloud?

My company has thousands of server instances running application code - some instances run databases, others are serving web apps, still others run APIs or Hadoop jobs. All servers run Linux.
In this cloud, developers typically want to do one of two things to an instance:
Upgrade the version of the application running on that instance. Typically this involves a) tagging the code in the relevant subversion repository, b) building an RPM from that tag, and c) installing that RPM on the relevant application server. Note that this operation would touch four instances: the SVN server, the build host (where the build occurs), the YUM host (where the RPM is stored), and the instance running the application.
Today, a rollout of a new application version might be to 500 instances.
Run an arbitrary script on the instance. The script can be written in any language provided the interpreter exists on that instance. E.g. The UI developer wants to run his "check_memory.php" script which does x, y, z on the 10 UI instances and then restarts the webserver if some conditions are met.
What tools should I look at to help build this system? I've seen Celery and Resque and delayed_job, but they seem like they're built for moving through a lot of tasks. This system is under much less load - maybe on a big day a thousand hundred upgrade jobs might run, and a couple hundred executions of arbitrary scripts. Also, they don't support tasks written in any language.
How should the central "job processor" communicate with the instances? SSH, message queues (which one), something else?
Thank you for your help.
NOTE: this cloud is proprietary, so EC2 tools are not an option.
I can think of two approaches:
Set up password-less SSH on the servers, have a file that contains the list of all machines in the cluster, and run your scripts directly using SSH. For example: ssh user#foo.com "ls -la". This is the same approach used by Hadoop's cluster startup and shutdown scripts. If you want to assign tasks dynamically, you can pick nodes at random.
Use something like Torque or Sun Grid Engine to manage your cluster.
The package installation can be wrapped inside a script, so you just need to solve the second problem, and use that solution to solve the first one :)

Resources