I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.
Related
I'm working on a project that relies on Hadoop but MRv1 architecture (Hadoop-1.1.2). I tried oozie scheduler for creating workflows(mapred) but gave up eventually, cause it is a nightmare to configure and I couldn't get it to work. I was wondering if I should try these other workflow Schedulers such as Azkaban or Apache Airflow. Would they be compatible with my requirements ?
Does anyone know which containers technology (Docker, LXC,....) is used in apache Hadoop, especially in (HDFS) and (Mapreduce)?
I know its used container technology but I can not find which one in specific.
Out of the box, none. What YARN calls "containers", by default, are bare-metal JVM instances that are scheduled using the YARN NodeManagers.
YARN can be configured to use Docker or runC.
Any insights on how can we use thin jar to submit spark applications?
The scenario is such that if some specific dependency is not present in the classpath of the project or is specific to some distribution cloudera or hortonworks it throws an exception if the appropriate version of jars are not used.
How can we avoid such scenarios?
The only thin jar you can make is one that doesn't compile the Spark core libraries into the JAR. For example Spark SQL and Spark Streaming don't need included, but unless Spark was compiled with Hive support during installation, you'll still need that one.
You'll need to contact your Hadoop cluster administrators to know what version of Spark is available, how it was built, and what libraries are available in $SPARK_HOME out of the box.
In my experience, I've never ran into a specific dependency to HDP or CDH as I've ran a Spark 2.3 job submitted to YARN fine, while neither vendor officially supports that version. The only thing you need is to match the Spark version with your code, not necessarily Hadoop/YARN/Hive versions. Kafka, Cassandra, other connectors are all extra anyway and they can't be in a thin jar
I am trying to setup a multi-cluster storm system. I have found several 3rd party step by step guides on this. They all have Java, Python, ZeroMQ 2.1.7 and JZMQ as the requirements for the Nimbus and Supervisor/Slave nodes. But on the official Apache Storm website, the only requirements for the Nimbus and Supervisor nodes is Java 6 and Python 2.6.6 (https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html)
Does anyone know if ZeroMQ and JZMQ are required for Storm cluster configuration? And is there an advantage to have these two softwares installed?
From Storm 0.9.0 and onwards, 0MQ should no longer be needed and you can use Netty instead but it needs to be configured. Please see http://storm.apache.org/2013/12/08/storm090-released.html for quick config setup.
I have a cluster of hadoop cdh4.5 managed using cloudera manager. I have a custom log4j.properties file with specific configurations. I have added this log4j.properties through cloudera manager to the cluster (for each process too ie., namenode,datanode,jobtracker,tasktracker). But it was not taken by the hadoop. Has anyone faced this problem earlier?
Before using cdh4, I used hadoop-0.20.2 and just having this log4j.properties in hadoop configuration was enough to pick it. So is this issue anyway related to cloudera manager?