Configure Multiple OpenTSDB to use single HBase backend - hadoop

I have to configure multiple openTSDB to put data into a sinfle HBase. Out of all TSD, one would be running local to Hbase and others would be remote. For which I tried running TSD with the extra argument of --zkquorum=xxx:xxx:xxx:xxx:YY, though it was able to connect but gave some exception inside. Can someone please tell me what all I have to configure to run multiple TSD to use single Hbase.

If you followed the instructions at http://opentsdb.net/setup-hbase.html to setup a single-node cluster, you'd need to remove the properties hbase.zookeeper.dns.interface and hbase.regionserver.dns.interface and hbase.master.dns.interface so that HBase and ZooKeeper don't bind to localhost.

Related

Jenkins as JobServer on Hadoop EdgeNode

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....
"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

How to configure HUE to be connected to remote Hive server?

I'm trying to use HUE Beeswax to connect my company's Hive database. Firstly, is it possible to use HUE installed on my mac to be connected with remote Hive server? If it does, how am I supposed to find the address for the Hive server which is running on our private server? Only thing I can do is to type 'hive' and put some sql queries in hive shell. I already installed HUE but can't figure out how to connect it to the remote Hive server. Any tips would be much appreciated.
If all you want is a desktop connection to Hive, you only need a JDBC client, not a full web app like Hue.
In any case, Hive CLI is deprecated. Beeline is preferred. To use Beeline and Hue, you need a HiveServer2 running.
To find the address of the HiveServer2, if you have it, you need to find your hive-site.xml file on the Hadoop cluster, and export it. Other ways to get this information are available in Ambari or Cloudera Manager (but if you're using a Cloudera CDH cluster, you already have Hue). The Thrift interface is what you want. Default port is 10000
When you setup the Hue, you will need to find the hue.ini file, in which, edit the section that starts with [beeswax] and fill in the necessary values. Personally, I find that section fairly straightforward
You can read the Hue github to find the requirements for running it on a Mac

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Apache Hive Installation on pseudo distributed or multi node cluster environment

I have installed hadoop on multi node environment in my PC as below
1: 4 virtual box instances loaded with ubuntu(14.04)
2: 1-master node , 2-slave node and remaining vm instance works as client
Note: All 4 VM'S are running in my PC itself
I was able to complete apace-2.6 hadoop setup successfully on the above mentioned setup .Now I want to install hive in order to do some data summarization, query, and analysis .
But I am not sure how I have to proceed further. I have few queries mentioned below :
Q1: Do I need to install/setup Apache Hive(0.14) on all nodes(master/name-node and slave/data-node)? or is it only on master node?
Q2: what is the mode should be used to deal with the meta-store is it local mode or remote mode ?
Q3: In case if I want to use mysql for hive meta-store,should I install it on master/name node itself or do I need to use separate client machine for this?
please can some one also share me if there are any steps to be followed to configure metastore? in multi node/pseudo distributed environment.
BR,
San
You need to install the required Hive services (HiveServer2, Metastore, WebHCat) only once. In your lab scenario, you would probably put them on the master. The client can then run Beeline (the HiveServer2 client.)
If you configure the Metastore as Local, Hive will use a local Derby database. Again, for your lab setup, this is probably just what you need/want.
In a production scenario, you would
set up a dedicated server for supporting services that should not fight for resources with the namenode process(es)
and use a dedicated database server for your Metastore database, which will be remote.

Resources