Plain vanilla Hadoop installation vs Hadoop installation using Ambari - hadoop

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.
Next, I installed hadoop (HDP) Using Ambari.
Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration files e.g. yarn-site.xml etc.
So can anyone help me demystify this?

All the configuration changes must be done via the Ambari UI. There is no use for the configuration files since Ambari persists the configurations in Ambari Database.
If you still need them, they are under /etc/hadoop/conf/.

It's true that configuration changes must be made via Ambari UI and that those configurations are stored in a database.
Why is it necessary to change these configuration properties in Ambari UI and not directly on disk?
Every time a service is restarted and it has a stale configuration the ambari-agent is responsible for writing the latest configuration to disk. They are written to /etc/<service-name>/conf. If you were to make changes directly to the configuration files on disk they would get overwritten by the aforementioned process.
However the configuration files found on disk DO still have a use...
The configuration files (on disk) are used by the various hadoop daemons when they're started/running.

Basically the benefit of using Ambari UI in Cluster Hadoop deployment. It will give you central management point.
For example:
10 pcs Hadoop cluster setup.
Plain vanilla Hadoop:
If you change any configuration you must be changed in 10 pcs
Ambari UI :
Due to configuration store in db. you just change in management portal all changes effect reflected on all node by single point change.

Related

Pentaho v8.1 + Hadoop v2.7.4 : problem connecting to Hadoop from Pentaho PDI

I am having difficulties trying to get Pentaho PDI to access Hadoop.
I did some research and found that Pentaho uses Adapters called Shims, I see these as connectors to Hadoop, the way that JDBC drivers are in Java world for database connectivity.
It seems that in the new version of PDI(v8.1), they have 4 Shims installed by default, they all seem to be specific distributions from the Big Data Companies like HortonWorks, MapR, Cloudera.
When I did further research on Pentaho PDI Big Data, in earlier versions, they had support for "Vanilla" installations of Apache Hadoop.
I just downloaded Apache Hadoop from the open source site, and installed it on Windows.
So my installation of Hadoop would be considered the "Vanilla" Hadoop installation.
But when I tried things out in PDI, I used the HortonWorks Shim, and when I tested things in terms of connection, it said that it did succeed to connect to Hadoop, BUT could not find the default directory and the root directory.
I have screen shots of the errors below:
So, one can see that the errors are coming from the access to directories, it seems:
1)User Home Directory Access
2) Root Directory Access
SO, since I am using the HortonWorks Shim, and i know that it has some default directories(I have used the HortonWorks Hadoop Virtual Machine before).
(1)
My Question is: If i use HortonWorks Shim to connect to my "Vanilla" Hadoop installation, do i need to tweet some configuration file to set some default directories.
(2) If I cannot use the HortonWorks Shim, how do i install a "Vanilla" Hadoop Shim?
Also I found this related post from year 2013 here on stackoverflow:
Unable to connect to HDFS using PDI step
Not sure how relevant this link of information is.
Hope someone that has experience with this can help out.
I forgot to add this additional information:
The core-site.xml file that i have for Hadoop, it's contents are this:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
SO that covers it.
Many times the lack of connection to the directory can be related to the user.
When using Hadoop with Pentaho because it is necessary that the user who runs the Pentaho is the same user who has the Hadoop cores.
For example if you have a user called jluciano on Hadoop, then you need to check a user on the system who uses the same name and then run the process in Pentaho, so the accesses to the directory will roll :).
Test it there and anything warns you.

Any open-souce software for me to manage big-data cluster including hadoop/hive/spark/?

I am looking for an open-source system for me to manage my big-data cluster which is composed of 50+ machines including components like hadoop, hdfs, hive, spark, oozie, hbase, zookeeper, kylin.
I want to manage them in a web system .The meaning of "manage" is :
I can restart the component one-by-one with only one click ,such
as when I click the "restart" button ,the component zookeeper will
be restarted one machine by another
I can deploy a component with only one click, such as when I
deploy a new zookeeper , I can make a compiled zookeeper prepared in
one machine ,then I click "deploy", it will deployed to all machines
automatically.
I can upgrade a component with only one click ,such as when I
want to update a zookeeper cluster, I can put the updated zookeeper
in a machine ,then I click "update" ,then the updated zookeeper will
override all the old version of zookeeper in other machines.
all in all , what I want is a management system for my big-data cluster like restart,deploy,upgrade,view the log ,modify the configuration and so on , or at least some of them .
I have considered Ambari, but it can only be used to deploy my whole system from absolute scratch, but my big-data cluster is already running for 1 years.
Any suggestions?
Ambari is what you want. It's the only open source solution for managing hadoop stacks that meets your listed requirements. You are correct that it doesn't work with already provisioned clusters, this is because to achieve such a tight integration with all those services it must know how they were provisioned and where everything is and know what configurations exist for each. The only way Ambari will know that is if it was used to provision those services.
Investing the time to recreate your cluster with Ambari may feel like its painful but in the long run it will payoff due to the added ability to upgrade and manage services so easily going forward.

HBase region servers going down when try to configure Apache Phoenix

I'm using CDH 5.3.1 and HBase 0.98.6-cdh5.3.1 and trying to configure Apache Phoenix 4.4.0
As per the documentation provided in Apache Phoenix Installation
Copied phoenix-4.4.0-HBase-0.98-server.jar file in lib directory (/opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/lib/hbase/lib) of both master and region servers
Restarted HBase service from Cloudera Manager.
When I check the HBase instances I see the region servers are down and I don't see any problem in log files.
I even tried to copy all the jars from the phoenix folder and still facing the same issue.
I have even tried to configure Phoenix 4.3.0 and 4.1.0 but still no luck.
Can someone point me what else I need to configure or anything else that I need to do to resolve this issue
I'm able to configure Apache Phoenix using Parcels. Following are the steps to install Phoenix using Cloudera Manager
In Cloudera Manager, go to Hosts, then Parcels.
Select Edit Settings.
Click the + sign next to an existing Remote Parcel Repository URL, and add the following URL: http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.0/. Click Save Changes.
Select Hosts, then Parcels.
In the list of Parcel Names, CLABS_PHOENIX is now available. Select it and choose Download.
The first cluster is selected by default. To choose a different cluster for distribution, select it. Find CLABS_PHOENIX in the list, and click Distribute.
If you plan to use secondary indexing, add the following to the hbase-site.xml advanced configuration snippet. Go to the HBase service, click Configuration, and choose HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml. Paste in the following XML, then save the changes.
<property>
<name>hbase.regionserver.wal.codec</name>
<value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>
Whether you edited the HBase configuration or not, restart the HBase service. Click Actions > Restart
For detailed installation steps and other details refer this link
I dont think, Phoenix4.4.0 is compatible with CDH version you are running. This discussion on mailing list will help you:http://search-hadoop.com/m/9UY0h2n4MOg1IX6OR1

How To Sync the contents to all nodes

I was following this tutorial trying to install and configure Spark on my cluster..
My cluster (5 nodes) is hosted on AWS and installed from Cloudera Manager.
It is mentioned in the tutorial that "Sync the contents of /etc/spark/conf to all nodes." after the modification of the configuration file.
I am really wondering what is the easies way to make that happen. I read a post that has a similar question like mine HERE. Based on my understanding, for the configuration files of hadoop, hdfs ...etc. which are monitored by zookeeper or cloudera manager. That might be the case to use CM deploy or zookeeper to make it happen.
However, Spark's configuration file is totally outside zookeeper's scope. How can I "sync" to other nodes..
Many thanks!
Why don't you mount the same EBS via NFS to /etc/spark/conf or one of it's parents so the files are automatically synced?

Locating Cloudera Manager HDFS config files

I've installed a cluster via Cloudera Manager, and now I need to launch the cluster manually.
I've been using the following command:
$ sudo -u hdfs hadoop namenode / datanode / jobtracker
But then the dfs.name.dir is set up /tmp. I can't seem to find where cloudera manager has the HDFS config files. The ones in /usr/lib/hadoop-02*/conf seem to be minimal. They're missing the dfs.name.dir which is what I'm looking for particularly. I'm on an RHLE 6 system, by the way. Being lazy, I though I could just copy over cloudera manager's HDFS config files, so I don't have to manually create them, the copy them over to 6 nodes :)
Thanks
I was facing same problem.
I was changing configuration parameters from cloudera manager ui but was clueless where my changes were getting updated on local file system.
I ran grep command and found out that in my case configuration were stored at /var/run/cloudera-scm-agent/process/*-hdfs-NAMENODE directory.
So David is right, whenever we change configs from ui and restart service, it creates new config. settings in /var/run/cloudera-scm-agent/process/ directory.
Using CentOS 6.5, the Cloudera Manager special files do not show up in a SEARCH FILES result because their permissions are set to hide from all but the 'hdfs' user. In addition, there are multiple versions of hdfs-site.xml on the local drive some of which have partial amounts of real settings. The actual settings file is in the DATANODE folder not the NAMENODE folder as evidenced by the lack of dfs.datanode.data.dir values in the latter.
Cloudera manager deploying config file each time you start cluster, each time in different directory. Directories are named after process id or something like this.
The configuration is passed explicitly to each deamon as parameter. So if you will look into command line of each hadoop deamons you can see where is configuration sitting (or just grep over disk for hdfs-site.xml. Names of config files are the same as usual.
I was in the same boat and found this answer:
To allow Hadoop client users to work with the HDFS, MapReduce, YARN
and HBase services you created, Cloudera Manager generates client
configuration files that contain the relevant configuration files with
the settings from your services. These files are deployed
automatically by Cloudera Manager based on the services you have
installed, when you add a service, or when you add a Gateway role on a
host.
You can download and distribute these client configuration files
manually to the users of a service, if necessary.
The Client Configuration URLs command on the cluster Actions menu
opens a pop-up that displays links to the client configuration zip
files created for the services installed in your cluster. You can
download these zip files by clicking the link.
See Deploying Client Configuration Files for more information on this
topic.
On our system I got there via http://your_server:7180/cmf/services/status and clicked the Actions popup under the Add Cluster button. Hope that helps.

Resources