I have a cluster of hadoop cdh4.5 managed using cloudera manager. I have a custom log4j.properties file with specific configurations. I have added this log4j.properties through cloudera manager to the cluster (for each process too ie., namenode,datanode,jobtracker,tasktracker). But it was not taken by the hadoop. Has anyone faced this problem earlier?
Before using cdh4, I used hadoop-0.20.2 and just having this log4j.properties in hadoop configuration was enough to pick it. So is this issue anyway related to cloudera manager?
Related
I'm working on a project that relies on Hadoop but MRv1 architecture (Hadoop-1.1.2). I tried oozie scheduler for creating workflows(mapred) but gave up eventually, cause it is a nightmare to configure and I couldn't get it to work. I was wondering if I should try these other workflow Schedulers such as Azkaban or Apache Airflow. Would they be compatible with my requirements ?
48. HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.
The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.
I'm learning how HBase interacts with MapReduce
I know what the above two ways mean, but I don't know how to configure the recommended way
Could anyone tell me how to configure it in the recommended way?
As the docs show, prior to running hadoop jar, you can export HADOOP_CLASSPATH=$(hbase classpath) and you can use hadoop jar ... -libjars [...]
The true recommended way would be to bundle your HBase dependencies as an Uber JAR in your mapreduce application
The only caveat is that you need to ensure that your project uses the same/compatible hbase-mapreduce client versions as the server.
That way, you don't need any extra configuration, except maybe specifying the hbase-site.xml
I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.
Any insights on how can we use thin jar to submit spark applications?
The scenario is such that if some specific dependency is not present in the classpath of the project or is specific to some distribution cloudera or hortonworks it throws an exception if the appropriate version of jars are not used.
How can we avoid such scenarios?
The only thin jar you can make is one that doesn't compile the Spark core libraries into the JAR. For example Spark SQL and Spark Streaming don't need included, but unless Spark was compiled with Hive support during installation, you'll still need that one.
You'll need to contact your Hadoop cluster administrators to know what version of Spark is available, how it was built, and what libraries are available in $SPARK_HOME out of the box.
In my experience, I've never ran into a specific dependency to HDP or CDH as I've ran a Spark 2.3 job submitted to YARN fine, while neither vendor officially supports that version. The only thing you need is to match the Spark version with your code, not necessarily Hadoop/YARN/Hive versions. Kafka, Cassandra, other connectors are all extra anyway and they can't be in a thin jar
Does anyone has worked on this configuration: Apache Hive on Apache Spark?
What is the latest version compatibility for this configuration?
I want to implement this in my production systems. Kindly help with the compatibility matrix for Apache Hadoop, Apache Hive, Apache Spark and Apache Zeppelin.
You have to use hive2 (0.11+) and SPARK 2.2.0 and in hive-site.xml. And you have to set Spark as executor engine so you can easily run your queries on top of Spark.
In hive2 there are some options like Tez, llap etc. For more information kindly check the document Hive on Spark: Getting Started.
follow the tutorial
apache hive installation
and then just copy the hive-site.xml to $APACHE_HOME/conf
Hive is moving to rely only on the Tez execution engine. Please build all new workloads on MapReduce or Tez.