Does Mahout need to be installed on the Hadoop's master node? - hadoop

That's a dumb question, but somebody has to ask it.
I've tried running Mahout locally, which worked. Now, I wanna the work to be performed by a remote cluster, not my local machine.
So, should I deploy the Mahout code on Hadoop machines or I can still make Mahout on my local machine interface remotely with Hadoop?

No, you don't install Hadoop programs on the Hadoop workers yourself. That would be a nightmare to maintain. Hadoop does it for you when you provide it the JAR file with all code via hadoop jar.
What runs on your local machine, when you run Mahout or anything else Hadoop-based, is a client program that uses Hadoop code to send info to a cluster to start work. That cluster might be local, or remote -- makes no difference to how you run the client, just what the client talks to.

Related

Jenkins as JobServer on Hadoop EdgeNode

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....
"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

installation of Oozie on a separate machine then Hadoop

Very new to Oozie, hence please excuse me if I sound like a newbie.
I have a hadoop cluster which is up and running. I want to install Oozie, this i want on a separate machine then then hadoop. Is this possible? the reason for asking is that on every installation guide I have seen it asks to install hadoop on the machine hence am not sure if its technically possible to have hadoop on a separate machine then Oozie.
Thanks in advance
Oozie server serves client's requests, it's a web application which uses embedded Tomcat, it can be installed on any machine where hadoop is reachable from, it's not tied to hadoop by itself. You can specify hadoop's nameNode and jobTracker in workflow properties so oozie will know where to send it's jobs.

Hortonworks Sandboxes in a cluster

I'm new to Hadoop ecosystem and i'm trying to understand how a cluster works. Until now, I've been using Hortonworks distribution to test anything in a single-node mode. Now I'm wondering - if it's possible to connect two VM's (running on one PC physically) so that one will be NameNode and the other one DataNode (i'm not sure if they should be separated). I found a similar tutorial for Cloudera, so I guess it's possible in theory.
If it's not even a good idea to run two Hadoop VM's on one PC, - then what is the most painless way to configure and run it on two separate PC's?
May be it will be useful. This post "Setting up a Hadoop cluster"
http://gbif.blogspot.ru/2011/01/setting-up-hadoop-cluster-part-1-manual.html

Moving files to Hadoop HDFS using SFTP

I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.
This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp
A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.

Resources