How can I setup Rstudio, sparklyR on an auto scale cluster managed by slurm? - rstudio

I have an aws HPC auto scale cluster managed by slurm, I can submit jobs using sbatch, however I want to use spraklyr on this cluster so that slurm increases the cluster size based on the workload of the sparklyr code in the R script. Is this possible?

Hi Amir is there a reason you are using slurm here? Sparklyr has better integration with Apache Spark and it would be advisable to run it over a spark cluster. You can follow this Blog to know the steps to setup this up with Amazon EMR which is a Service to run Spark cluster on AWS - https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/

Related

Dask on Hadoop Kubernetes

I've installed Hadoop via a helm chart on my microk8s kubernetes cluster.
I would like to know how to create a dask cluster on my different machines on this hadoop cluster. I tried following the the tutorials on the Dask websites, but I keep getting errors because it is looking for the local yarn/hadoop. How do I point to the hadoop on kubernetes so I can create the cluster?
If you want to launch Dask on Yarn we recommend using https://yarn.dask.org
However, if you are using Kubernetes already you might consider https://kubernetes.dask.org, which is more commonly used today.

Multi-Node Hadoop in kubernetes

I already intalled minikube the single node Kubernetes cluster, I just want a help of how to deploy a multi-node hadoop cluster inside this kubernetes node, I need a starting point please!?
For clarification, do you want hadoop to leverage k8s components to run jobs or do you just want it to run as a k8s pod?
Unfortunately I could not find an example of hadoop built as a Kubernetes scheduler. You can probably still run it similar to the spark example.
Update: Spark now ships with better integration for Kubernetes. Information can be found here here

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Is spark standalone scheduler or Yarn scheduler better for a Cloudera 5.4 hadoop cluster?

In regards to being able to run machine learning jobs with Spark. Which is a better choice the Yarn scheduler or the Spark Standalone scheduler?
There is no difference when it comes to run the actual spark job.
Yarn/Mesos helps you to schedule resources if you have different spark applictions running and/or other components running in your cluster (which support Yarn/Mesos of course).
The Spark standalone cluster cannot manage resources. That is if you start a Spark application and it uses all the ressources, the second application will not find any resources left. That means you have to do this by yourself (e.g. adapting Spark config accordingly)

Running mahout using hadoop on Amazon's EMR/EC2

I want to migrate my current local hadoop cluster into amazon . In this hadoop cluster I am using services like mahout,hbase and hive . I have two option now in amazon either go for pure EC2 instances or Elastic map reduce cluster . I want some suggestion on what is better option to move the cluster which has these kinds of requirement .
I always suggest people to go for EMR, as that is managed and will be a bit more costlier than using pure ec2, but the headache and time you will spent in configuring the clusters and then managing them can be saved by running managed services like EMR.
Mahout can easily be run like a custom jar.
Hive cluster can also be launched within minutes.
Similary for HBase, Amazon has recently added creating HBase cluster over EMR.
See other views here.

Resources