Jenkins as JobServer on Hadoop EdgeNode - hadoop

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....

"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

Related

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Apache Kylin installation without Sandbox

I was wondering if there are any resources regarding Apache Kylin installation without any sandbox (like cloudera, hortonworks) support. I have managed to do the following:
Install Hadoop 2.6
Install Hive
Install HBase
Then I used the binary from kylin site and so far been able to run it. The problem start when I try to build a cube, the map reduce job gets stuck in step 2. I am thinking if it is still assuming to be in sandbox mode and not submitting job to hadoop at all (there is no entry in hadoop jobtracker).
So I need solution regarding the two:
1. Possible configuration of kylin in pure hadoop setup (no sandbox)
2. somehow enable the kylin setup to submit job to hadoop.
There is no such sandbox or non-sandbox configuration in Kylin. Just make sure the machine where Kylin runs has hadoop setup correctly and you should be fine.
Under the scene, kylin.sh uses hbase classpath and hive -e set | grep 'env:CLASSPATH' to detect hadoop settings. Double check these commands work as expect if you are not sure what cluster Kylin connects to.
If Kylin has problem submitting MR jobs, check two places. First is hadoop resource manager, see if the job has really been submitted or not. Sometimes it's just running slow. Second check kylin.log, see if any exception there. Post the log to kylin dev mailing list and someone will be able to help.
You can install hadoop-2.6 , hive-0.14 ,hbase-0.98.8-hadoop2 with Zookeeper inbuilt or external zookeeper-3.5
Now you can run kylin-v1.1-release on it
If you still face Issues paste the log here

Cascading for Impatient TFIDF example freezing

I'm trying to work with Cascading to create and execute complex data processing workflows on a local Hadoop cluster.
I wish to create a TFIDF vector so I can apply Machine Learning algorithms such as NaiveBayes on it using the Apache Spark framework.
The problem is that after I create the jar and I launch it using the following commands the program freezes. Here is the log file.
You can find the sources here. The related source code is in part6.
Thanks!
I have found the problem. The nodes of the cluster were unhealthy but the log doesn't show that and cascading freezes as it's task has been UNASSIGNED.
So to solve the problem you have to correct the nodes health in my case I just had to correct hadoop-yarn containers directory and also it's local namenode directory.
You might run into other errors, So I suggest that you check your hadoop log files and the WebUI admin for Hadoop Nodes.

Hadoop installation on Amazon cloud

i am new to Hadoop ,i likes to go in hadoop administration line so studied basics of hadoop and tried to install hadoop in pseudo distribution mode and installed successfully and run some basic examples also, now i need to improve me further,so i need to try a way to learn hadoop installation and configuration in real time so decided to go for Amazon micro instance ,can any one please tell how to install and configure hadoop in Amazon cloud.
Thanks in Advance.
I have tried this personally and you will not really be able to use hadoop on a single micro instance due to memory restrictions. IMHO you should atleast try a medium instance to run hadoop or better yet use their elastic-mapreduce api which is a modified version of hadoop. You can run a 3 node cluster for around 00.25 cents an hour. If you really want to learn big data this is the way I went.
You should check out their documentation here
http://aws.amazon.com/documentation/elasticmapreduce/

Does Mahout need to be installed on the Hadoop's master node?

That's a dumb question, but somebody has to ask it.
I've tried running Mahout locally, which worked. Now, I wanna the work to be performed by a remote cluster, not my local machine.
So, should I deploy the Mahout code on Hadoop machines or I can still make Mahout on my local machine interface remotely with Hadoop?
No, you don't install Hadoop programs on the Hadoop workers yourself. That would be a nightmare to maintain. Hadoop does it for you when you provide it the JAR file with all code via hadoop jar.
What runs on your local machine, when you run Mahout or anything else Hadoop-based, is a client program that uses Hadoop code to send info to a cluster to start work. That cluster might be local, or remote -- makes no difference to how you run the client, just what the client talks to.

Resources