Configuring AWS cluster using automation script - hadoop

We are looking for the possibility of an automation script which we can give how many master and data nodes we need and it would configure a cluster. Probably giving the credentials in a properties file.
Currently our approach is to login to the console and configure the Hadoop cluster. It would be great if there could be an automated way around it.

I've seen this done very nicely using Foreman, Chef, and Ambari Blueprints. Foreman was used to provision the VMs, Chef scripts were used to install Ambari, configure the Ambari blueprint, and to create the cluster using the Blueprint.

Related

Shell scripts scheduler

Basically, I need to run a set of custom shell scripts on ec2 instances to provision some software. Is there any workflow manager like oozie or airflow with api access to schedule the same. I am asking for alternatives like oozie and airflow, as those are that of hadoop environment schedulers and my environment is not. I can ensure that there can be ssh access from the source machine that will run the workflow manager and the ec2 instance where want to install the software. Is there any such open source workflow schedulers?
I would recommend using Cadence Workflow for your use case. There are multiple provisioning solutions built on top of it. For example Banzai Cloud Pipeline Platform
See the presentation that goes over Cadence programming model.

Jenkins as JobServer on Hadoop EdgeNode

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....
"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

Fabric scripts to install Hadoop on a cluster of machines?

I am beginning to install hadoop on a cluster. I have ssh access to these machines and I have already installed fabric on them. I was wondering if someone has already written a fabfile to install and deploy hadoop to a cluster easily.
I found this project [0]; but this is written for deploying over AWS instances. I was looking for something where I can just fill in the IPs of my machines and then execute a set of fab commands to bring up the cluster.
[0] http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#ec2-deployment-with-fabric-script
I'm AlexJF, the author of the scripts you linked.
The scripts you reference can also be used outside EC2. You just need to configure, as you requested, the list of hosts and configurations on the top of the fabfile.py. Be sure to set EC2 = False (which just happens to be the default).
You'll then have several useful commands available to you.

Have To Manually Start Hadoop Cluster on GCloud

I have been using a Hadoop cluster, created using Google's script, for a few months.
Every time I boot the machines I have to manually start Hadoop using:
sudo su hadoop
cd /home/hadoop/hadoop-install/sbin
./start-all.sh
Besides scripting, how can I resolve this?
Or is this just the way it is by default?
(The first boot after cluster creation always starts Hadoop automatically, why not always?)
You have to configure using init.d.
Document provide more details and sample script for datameer. You need to follow similar steps. Script should be smart enough to check all the nodes in the cluster are up before invoking this script using ssh.
While different third-party scripts and "getting started" solutions like Cloud Launcher have varying degrees of support for automatic restart of Hadoop on boot, the officially supported tools are bdutil as a do-it-yourself deployment tool, and Google Cloud Dataproc as a managed service, both of which are already configured with init.d and/or systemd to automatically start Hadoop on boot.
More detailed instructions on using bdutil here.

How to dockerize individual apps inside Hadoop in multi tenant environment?

I have seen HortonWorks put the full Hadoop inside a docker that allows to install Hadoop in different environments. But how about the individual apps inside Hadoop that run on YARN? Especially in a multi-tenant environment, this would be useful.
Appreciate any thoughts on how to achieve this.

Resources