Pig accessing HBase using Spring Data Hadoop - spring

Has anyone got experience of using Spring Data Hadoop to run a Pig script that connects to HBase using Elephant Bird's HBaseLoader?
I'm new to all of the above, but need to take some existing Pig scripts that were executed via a shell script and instead wrap them up in a self-contained Java application. Currently the scripts are run from a specific server that has Hadoop, HBase and Pig installed, and config for all of the above in /etc/. Pig has the HBase config on its classpath, so I'm guessing this is how it know how to connect to HBase
I want to have all configuration in Spring. Is this possible if I need Pig to connect to HBase? How do I configure HBase such that the Pig script and the Elephant Bird library will know how to connect to it?

Related

How HBase add its dependency jars and use HADOOP_CLASSPATH

48. HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.
The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.
I'm learning how HBase interacts with MapReduce
I know what the above two ways mean, but I don't know how to configure the recommended way
Could anyone tell me how to configure it in the recommended way?
As the docs show, prior to running hadoop jar, you can export HADOOP_CLASSPATH=$(hbase classpath) and you can use hadoop jar ... -libjars [...]
The true recommended way would be to bundle your HBase dependencies as an Uber JAR in your mapreduce application
The only caveat is that you need to ensure that your project uses the same/compatible hbase-mapreduce client versions as the server.
That way, you don't need any extra configuration, except maybe specifying the hbase-site.xml

Prometheus Integration with Hadoop (Ozone Cluster)

I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.

Apache Hive on Apache Spark

Does anyone has worked on this configuration: Apache Hive on Apache Spark?
What is the latest version compatibility for this configuration?
I want to implement this in my production systems. Kindly help with the compatibility matrix for Apache Hadoop, Apache Hive, Apache Spark and Apache Zeppelin.
You have to use hive2 (0.11+) and SPARK 2.2.0 and in hive-site.xml. And you have to set Spark as executor engine so you can easily run your queries on top of Spark.
In hive2 there are some options like Tez, llap etc. For more information kindly check the document Hive on Spark: Getting Started.
follow the tutorial
apache hive installation
and then just copy the hive-site.xml to $APACHE_HOME/conf
Hive is moving to rely only on the Tez execution engine. Please build all new workloads on MapReduce or Tez.

Hadoop integration testing

I would like to know what is the best way to perform integration tests in Hadoop ecosystem?
Currently, I use Hadoop, HBase and Oozie, and I was wondering what would be the best approach to test the integration. So I don't want a mock of Oozie or HBase, but I want a 'light-weight' instances of those so I could for example write to HBase from a web service, without the need to inject a mock. Similarly, I don't want a mock Oozie client, but light-weight Oozie running on some port.
Would it be a good approach to setup a pseudo-mode cluster on a single machine and install HBase and Oozie additionally, or is there a better way?

Cassandra and Hadoop

I am new to Cassandra and Hadoop. I am trying to read cassandra data on hourly basis and dump into HDFS. Cassandra and Hadoop are on different clusters. Any pointers on Clients/API I could use to do this is much appreciated.
I recommend Java because Hadoop and Cassandra are both Java based. Astyanax is a good Java Cassandra API.
I've used org.apache.hadoop to write to HDFS using Java but there might be something better out there.

Resources