How To Sync the contents to all nodes - hadoop

I was following this tutorial trying to install and configure Spark on my cluster..
My cluster (5 nodes) is hosted on AWS and installed from Cloudera Manager.
It is mentioned in the tutorial that "Sync the contents of /etc/spark/conf to all nodes." after the modification of the configuration file.
I am really wondering what is the easies way to make that happen. I read a post that has a similar question like mine HERE. Based on my understanding, for the configuration files of hadoop, hdfs ...etc. which are monitored by zookeeper or cloudera manager. That might be the case to use CM deploy or zookeeper to make it happen.
However, Spark's configuration file is totally outside zookeeper's scope. How can I "sync" to other nodes..
Many thanks!

Why don't you mount the same EBS via NFS to /etc/spark/conf or one of it's parents so the files are automatically synced?

Related

Jenkins as JobServer on Hadoop EdgeNode

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....
"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

Plain vanilla Hadoop installation vs Hadoop installation using Ambari

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.
Next, I installed hadoop (HDP) Using Ambari.
Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration files e.g. yarn-site.xml etc.
So can anyone help me demystify this?
All the configuration changes must be done via the Ambari UI. There is no use for the configuration files since Ambari persists the configurations in Ambari Database.
If you still need them, they are under /etc/hadoop/conf/.
It's true that configuration changes must be made via Ambari UI and that those configurations are stored in a database.
Why is it necessary to change these configuration properties in Ambari UI and not directly on disk?
Every time a service is restarted and it has a stale configuration the ambari-agent is responsible for writing the latest configuration to disk. They are written to /etc/<service-name>/conf. If you were to make changes directly to the configuration files on disk they would get overwritten by the aforementioned process.
However the configuration files found on disk DO still have a use...
The configuration files (on disk) are used by the various hadoop daemons when they're started/running.
Basically the benefit of using Ambari UI in Cluster Hadoop deployment. It will give you central management point.
For example:
10 pcs Hadoop cluster setup.
Plain vanilla Hadoop:
If you change any configuration you must be changed in 10 pcs
Ambari UI :
Due to configuration store in db. you just change in management portal all changes effect reflected on all node by single point change.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Locating Cloudera Manager HDFS config files

I've installed a cluster via Cloudera Manager, and now I need to launch the cluster manually.
I've been using the following command:
$ sudo -u hdfs hadoop namenode / datanode / jobtracker
But then the dfs.name.dir is set up /tmp. I can't seem to find where cloudera manager has the HDFS config files. The ones in /usr/lib/hadoop-02*/conf seem to be minimal. They're missing the dfs.name.dir which is what I'm looking for particularly. I'm on an RHLE 6 system, by the way. Being lazy, I though I could just copy over cloudera manager's HDFS config files, so I don't have to manually create them, the copy them over to 6 nodes :)
Thanks
I was facing same problem.
I was changing configuration parameters from cloudera manager ui but was clueless where my changes were getting updated on local file system.
I ran grep command and found out that in my case configuration were stored at /var/run/cloudera-scm-agent/process/*-hdfs-NAMENODE directory.
So David is right, whenever we change configs from ui and restart service, it creates new config. settings in /var/run/cloudera-scm-agent/process/ directory.
Using CentOS 6.5, the Cloudera Manager special files do not show up in a SEARCH FILES result because their permissions are set to hide from all but the 'hdfs' user. In addition, there are multiple versions of hdfs-site.xml on the local drive some of which have partial amounts of real settings. The actual settings file is in the DATANODE folder not the NAMENODE folder as evidenced by the lack of dfs.datanode.data.dir values in the latter.
Cloudera manager deploying config file each time you start cluster, each time in different directory. Directories are named after process id or something like this.
The configuration is passed explicitly to each deamon as parameter. So if you will look into command line of each hadoop deamons you can see where is configuration sitting (or just grep over disk for hdfs-site.xml. Names of config files are the same as usual.
I was in the same boat and found this answer:
To allow Hadoop client users to work with the HDFS, MapReduce, YARN
and HBase services you created, Cloudera Manager generates client
configuration files that contain the relevant configuration files with
the settings from your services. These files are deployed
automatically by Cloudera Manager based on the services you have
installed, when you add a service, or when you add a Gateway role on a
host.
You can download and distribute these client configuration files
manually to the users of a service, if necessary.
The Client Configuration URLs command on the cluster Actions menu
opens a pop-up that displays links to the client configuration zip
files created for the services installed in your cluster. You can
download these zip files by clicking the link.
See Deploying Client Configuration Files for more information on this
topic.
On our system I got there via http://your_server:7180/cmf/services/status and clicked the Actions popup under the Add Cluster button. Hope that helps.

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources