remote Kernel not found while scheduling the notebook in Vertex AI - google-cloud-vertex-ai

I am new to the cloud and it's components. I am trying to schedule a notebook in vertex AI with remote kernel of dataproc. when the schdeuler invokes the notebook I get an error jupyter_client.kernelspec.NoSuchKernel: No such kernel named remote-54f5982ba157bXXXXXXXXXX-python3. when I run it manually it's working fine. Any idea why it is happening

If you are using Executor, you need to define the JobType as Dataproc and define dataprocParameters accordingly.
https://cloud.google.com/vertex-ai/docs/workbench/reference/rest/v1/ExecutionTemplate#JobType
When you use a Dataproc Kernel you are accessing remotely via a "Mixer Service" which combines both local Kernels and remote Dataproc kernels. When you execute the Notebook via executor, the executor job runs directly in Dataproc so you would need to specify that.
Currently Executor extension does not support this workflow.

Related

How to use AWS sbatch (SLURM) inside docker on an EC2 instance?

I am trying to get OpenFOAM to run on an AWS EC2 cluster using AWS parallelCluster.
One possibility is to compile OpenFOAM. Another is to use a docker container. I am trying to get the second option to work.
However, I am running into trouble understanding how I should orchestrate the various operations. Basically what I need is :
copy an OpenFOAM case from S3 to FSx file system on the master node
run the docker container containing OpenFOAM
Perform OpenFOAM operations, some of them using the cluster (running the computation in parallel being the most important one)
I want to put all of this into scripts to make it reproducible. But I am wondering how should I structure the scripts together to have SLURM handle the parallel side of things.
My problem at the moment is that the Master node shell knows the command e.g. sbatch but when I launch docker to access the OpenFOAM command, it "forgets" the sbatch commands.
How could I export all SLURM related commands (sbatch, ...) to docker easily ? Is this the correct way to handle the problem ?
Thanks for the support
for the first option there is a workshop that walks you through:
cfd-on-pcluster.
For the second option; I created a container workshop that uses HPC container runtimes containers-on-pcluster.
I incorporated a section about GROMACS but I am happy to add OpenFOAM as well. I am using Spack to create the container images. While I only documented single-node runs, we can certainly add multi-node runs.
Running Docker via sbatch is not going to get you very far, b/c docker is not a user-land runtime. For more info: FOSDEM21 Talk about Containers in HPC
Cheers
Christian (full disclosure: AWS Developer Advocate HPC/Batch)

Using the dask labextenstion to connect to a remote cluster

I'm interested in running a Dask cluster on EMR and interacting with it from inside of a Jupyter Lab notebook running on a separate EC2 instance (e.g. an EC2 instance not within the cluster and not managed by EMR).
The Dask documentation points to dask-labextension as the tool of choice for this use case. dask-labextension relies on a YAML config file (and/or some environment vars) to understand how to talk to the cluster. However, as far as I can tell, this configuration can only be set to point to a local Dask cluster. In other words, you must be in a Jupyter Lab notebook running on an instance within the cluster (presumably on the master instance?) in order to use this extension.
Is my read correct? Is it not currently possible to use dask-labextension with an external Dask cluster?
Dask Labextension can talk to any Dask cluster that is visible from where your web client is running. If you can connect to a dashboard in a web browser then you can copy that same address to the Dask-Labextension search bar and it will connect.

Shell scripts scheduler

Basically, I need to run a set of custom shell scripts on ec2 instances to provision some software. Is there any workflow manager like oozie or airflow with api access to schedule the same. I am asking for alternatives like oozie and airflow, as those are that of hadoop environment schedulers and my environment is not. I can ensure that there can be ssh access from the source machine that will run the workflow manager and the ec2 instance where want to install the software. Is there any such open source workflow schedulers?
I would recommend using Cadence Workflow for your use case. There are multiple provisioning solutions built on top of it. For example Banzai Cloud Pipeline Platform
See the presentation that goes over Cadence programming model.

Mesophere cluster with nodes of different operating systems

I want to setup a Mesophere cluster (mesos, dc/os, marathon) for running different jobs. The nodes which these jobs run on depend on the nature of the job. For e.g a job with C# code will run on a windows node. A job with pure C++ will run on Ubuntu or freebsd and so on. Each of these can again be a cluster. ie I want to have, lets say, 2 windows nodes and 4 ubuntu nodes. So I would like to know:
Can this be achieved in a single deployment ? Or do i need to setup different clusters for reach environment i want , one for windows, one for Ubuntu etc.
Regradless of a single hybrid or multiple environments, does mesos provide granularity of what the nodes send back. i.e I dont want to see high level status like job failed or running etc. My jobs write stats to a file on the system and i want to relay this back to the "main UI" or the layer that is managing all this
Can this be achieved in a single deployment?
If you want to use DCOS, currently offical just support centos/redhat, for ubuntu, you need to use at least 1604, which use systemd not old upstart of ubuntu. But afaik, windows is not support in DCOS.
So for your scenario, you had to use mesos directly not to use dcos, then with one cluster, you can set different mesos agent on ubuntu or windows. And you may add role or attribute when agent register to mesos master, so the framework can distinguish when dispatch job to proper agent.
BTW, for windows, you had to use at least 1.3.0 mesos which support windows, and you had to build it on windows using microsoft visual studio by yourself.
Does mesos provide granularity of what the nodes send back?
Yes, but you can not use default command executor, you need to write your own executor.
In your executor, you can set the value which you want to send back:
update = mesos_pb2.TaskStatus()
update.task_id.value = task.task_id.value
update.state = mesos_pb2.TASK_RUNNING
update.data = 'data with a \0 byte'
driver.sendStatusUpdate(update)
In your framework, you can receive it as follows:
def statusUpdate(self, driver, update):
slave_id, executor_id = self.taskData[update.task_id.value]
Here is an example I found from github which may help you about how to send your own data back.

Have To Manually Start Hadoop Cluster on GCloud

I have been using a Hadoop cluster, created using Google's script, for a few months.
Every time I boot the machines I have to manually start Hadoop using:
sudo su hadoop
cd /home/hadoop/hadoop-install/sbin
./start-all.sh
Besides scripting, how can I resolve this?
Or is this just the way it is by default?
(The first boot after cluster creation always starts Hadoop automatically, why not always?)
You have to configure using init.d.
Document provide more details and sample script for datameer. You need to follow similar steps. Script should be smart enough to check all the nodes in the cluster are up before invoking this script using ssh.
While different third-party scripts and "getting started" solutions like Cloud Launcher have varying degrees of support for automatic restart of Hadoop on boot, the officially supported tools are bdutil as a do-it-yourself deployment tool, and Google Cloud Dataproc as a managed service, both of which are already configured with init.d and/or systemd to automatically start Hadoop on boot.
More detailed instructions on using bdutil here.

Resources