does Cloudera run with cluster of machine-with Hadoop - hadoop

I want to ask about "Cloudera" to run Hadoop MapReduce.
Does the cloudera support to run the application over cluster of machine?
What does cloudera actually over, when I use it with Virtual Machine to run the Hadoop MapReduce application?
If Cloudera doesn't support to run over cluster of machine? How can I run Hadoop app in cluster of machine?
I really confused about that:(
Thanks in advance.

Related

YARN-Specify Which Application to be Run on Which Nodemanager

I have a Hadoop YARN cluster including one resourcemanager and 6 nodemanagers. I want to run both Flink and Spark applications on the cluster. So I have two main question about YARN:
In case of Spark, Should I install and config Spark on resource manager and each nodemanagers? When I want to submit a Spark application on YARN, in addition to YARN resourcemanager and nodemanagers, should Spark cluster (master and slaves) be run?
Can I set YARN such that run Flink in some special nodemanagers?
Thanks
For the first question, that depends on whether you're using a packaged Hadoop distribution (like Cloudera CDH, Hortonworks HDP for example) or not. The distros will likely take care of this. If you're not using a distribution, you need to consider if you want to run Spark on YARN or Spark stand-alone.
For the second question, you can specify special Node Managers if you are using Capacity Scheduler with the node-labelling feature enabled and if you are using Hadoop 2.6 and higher.

How to set up Spark on multi-node Hadoop cluster?

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

installation of Oozie on a separate machine then Hadoop

Very new to Oozie, hence please excuse me if I sound like a newbie.
I have a hadoop cluster which is up and running. I want to install Oozie, this i want on a separate machine then then hadoop. Is this possible? the reason for asking is that on every installation guide I have seen it asks to install hadoop on the machine hence am not sure if its technically possible to have hadoop on a separate machine then Oozie.
Thanks in advance
Oozie server serves client's requests, it's a web application which uses embedded Tomcat, it can be installed on any machine where hadoop is reachable from, it's not tied to hadoop by itself. You can specify hadoop's nameNode and jobTracker in workflow properties so oozie will know where to send it's jobs.

Is spark standalone scheduler or Yarn scheduler better for a Cloudera 5.4 hadoop cluster?

In regards to being able to run machine learning jobs with Spark. Which is a better choice the Yarn scheduler or the Spark Standalone scheduler?
There is no difference when it comes to run the actual spark job.
Yarn/Mesos helps you to schedule resources if you have different spark applictions running and/or other components running in your cluster (which support Yarn/Mesos of course).
The Spark standalone cluster cannot manage resources. That is if you start a Spark application and it uses all the ressources, the second application will not find any resources left. That means you have to do this by yourself (e.g. adapting Spark config accordingly)

How to administer Hadoop Cluster

i have running 4 nodes hadoop cluster and i am asking about any way to administer that cluster remotely
for example
administering the cluster from my laptop for
executing MapReduce tasks
disabling or enabling data nodes
is there any way to do that remotely ?
If you're using the Cloudera distribution, the Cloudera Manager webapp would let you do that.
Other distributions may have similar control apps. That would give you per-node control.
For executing MR tasks, you would setup normally submit the job from an external node anyway, pointing to the correct JobTracker and NameNode. So I'm not sure what else you're asking for there.

Resources