hbase.master.port overridden programmatically? - hadoop

I installed hbase from cloudera 5.3.3 distribution and as I run the hbase everything seems to be working fine...
When I try assign hbase.master.port via /etc/hbase/conf/hbase-site.xml it does not pick it from there.
I see this from master node info http://MASTERNODE:60010/conf
<property>
<name>hbase.master.port</name>
<value>0</value>
<source>programatically</source>
</property>
hbase distribution: 0.98.6-cdh5.3.3
What does this 'programmatically' mean and how can I disable/override it ?

answering my own question :(
as i just figured out the hbase standalone mode do not takes hbase.master.port into account
https://github.com/cloudera/hbase/blob/cdh4.5.0-release/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java#L141
standalone mode:
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_hbase_standalone_start.html
only way to assign a port is to setup ,at least a Pseudo-Distributed Mode,
see this:
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_ig_hbase_pseudo_configure.html

This means, its being set in some app/code.Are you using Cloduera Manager? You will need to set it in Cloduera Manager. If you are not using Cloudera Manager, then you will need to modify hbase-site.xml for HBase cluster and do a reboot of HBase cluster.

Since version 1.4.2 there's hbase.localcluster.assign.random.ports option which prevents ports overriding

Related

Hadoop 2.x -- how to configure secondary namenode?

I have an old Hadoop install that I'm looking to update to Hadoop 2. In the
old setup, I have a $HADOOP_HOME/conf/masters file that specifies the
secondary namenode.
Looking through the Hadoop 2 documentation I can't find any mention of a
"masters" file, or how to setup a secondary namenode.
Any help in the right direction would be appreciated.
The slaves and masters files in the conf folder are only used by some scripts in the bin folder like start-mapred.sh, start-dfs.sh and start-all.sh scripts.
These scripts are a mere convenience so that you can run them from a single node to ssh into each master / slave node and start the desired hadoop service daemons.
You only need these files on the name node machine if you intend to launch your cluster from this single node (using password-less ssh).
Alternatively, You can also start an Hadoop daemon manually on a machine via
bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker]
In order to run the secondary name node, use the above script on the designated machines providing the 'secondarynamenode' value to the script
See #pwnz0r 's 2nd comment on answer on How separate hadoop secondary namenode from primary namenode?
To reiterate here:
In hdfs-site.xml:
<property>
<name>dfs.secondary.http.address</name>
<value>$secondarynamenode.full.hostname:50090</value>
<description>SecondaryNameNodeHostname</description>
</property>
I am using Hadoop 2.6 and had to use
<property>
<name>dfs.secondary.http.address</name>
<value>secondarynamenode.hostname:50090</value>
</property>
for further details refer https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Update hdfs-site.xml file by updating and adding following property
cd $HADOOP_HOME/etc/hadoop
sudo vi hdfs-site.xml
Then paste these lines into configuration tag
<property>
<name>dfs.secondary.http.address</name>
<value>hostname:50090</value>
</property>

Cloudera Manager - dfs.datanode.du.reserved not working

I have set dfs.datanode.du.reserved property to 10 GB using Cloudera Manager. But when I check the map-reduce job.xml file, I find dfs.datanode.du.reserved is still set to 0. How do I verify whether the property is set ??
PS: I am using Cloudera Standard 4.7.2 with CDH 4.4.0
This flag is set in the hdfs-site.xml and not in the mapred-site.xml.
You will not be able to see this flag in the client configurations (/etc/hadoop/conf/hdfs-site.xml) without tweaking configuration.
It is only set in the datanode configuration that is regenerated by Cloudera Manager. This configuration can be found in /var/run/cloudera-scm-agent/process/XXXXXX-hdfs-DATANODE/hdfs-site.xml, where XXXXXX is a incremented number of some kind (used by Cloudera Manager).
From within Cloudera Manager you can see this flag on Datanode (), click Processes, then Configuration files/Environment - Show and then you find the hdfs-site.xml for a datanode.

Why do we need to format HDFS after every time we restart machine?

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)
hdfs-site.xml file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.
Then I
Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?
By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).
Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

Bigtop Hbase tables disappeared after PC restart

I installed Bigtop 0.7.0 on Ubuntu 12.04 and I started without any problem the master server with:
sudo hbase master start
I was able to connect with hbase shell and create a table.
After I restarted the PC, I saw that table is not there anymore.
I read that the problem is that it stores tables in /tmp which is cleared after restart, so I tried to change the configuration hbase-site.xml to set another folder.
the default hbase-site.xml was:
<configuration/>
(No properties defined)
When I wrote in hbase-site.xml, then I tried to start the hbase master again and I recieved Zookeeper client exception not possible to connect to server.
Can you please give me some advice on how to configure this right or if there is maybe some other problem that I'm not aware of?
EDIT (from the comments):
My hbase-site.xml is:
<configuration>
<!--property>
<name>hbase.rootdir</name>
<value>file://app/hadoop/tmp/hbase</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<property-->
</configuration>

Best place for json Serde JAR in CDH Hadoop for use with Hive/Hue/MapReduce

I'm using Hive/Hue/MapReduce with a json Serde. To get this working I have copied the json_serde.jar to several lib directories on every cluster node:
/opt/cloudera/parcels/CDH/lib/hive/lib
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib
/opt/cloudera/parcels/CDH/lib/hadoop/lib
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib
...
On every CDH update of the cluster I have to do that again.
Is there a more elegant way where the distribution of the Serde in the cluster would be automatic and resistant to updates?
If using HiveServer2 (Default in Cloudera 5.0+) the following configuration will work across your entire cluster without having to copy the jar to each node.
In your hive-site.xml config file, or if you're using Cloudera Manager in the "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" config box
<property>
<name>hive.aux.jars.path</name>
<value>/user/hive/aux_jars/hive-serdes-1.0-snapshot.jar</value>
</property>
Then create the directory in your HDFS filesystem (/user/hive/aux_jars) and place the jar file in it. If you are running HUE you can do this part via the web UI, just click on File Browser at the top right.
It depends on the version of Hue and if using Beeswax or HiveServer2:
Beeswax: there is a workaround with the HIVE_AUX_JARS_PATH https://issues.cloudera.org/browse/HUE-1127
HiveServer2 supports a hive.aux.jars.path property in the hive-site.xml. HiveServer2 does not support a .hiverc and Hue is looking at providing an equivalent at some point: https://issues.cloudera.org/browse/HUE-1066

Resources