I am trying to setup a multi-cluster storm system. I have found several 3rd party step by step guides on this. They all have Java, Python, ZeroMQ 2.1.7 and JZMQ as the requirements for the Nimbus and Supervisor/Slave nodes. But on the official Apache Storm website, the only requirements for the Nimbus and Supervisor nodes is Java 6 and Python 2.6.6 (https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html)
Does anyone know if ZeroMQ and JZMQ are required for Storm cluster configuration? And is there an advantage to have these two softwares installed?
From Storm 0.9.0 and onwards, 0MQ should no longer be needed and you can use Netty instead but it needs to be configured. Please see http://storm.apache.org/2013/12/08/storm090-released.html for quick config setup.
I'm working on a project that relies on Hadoop but MRv1 architecture (Hadoop-1.1.2). I tried oozie scheduler for creating workflows(mapred) but gave up eventually, cause it is a nightmare to configure and I couldn't get it to work. I was wondering if I should try these other workflow Schedulers such as Azkaban or Apache Airflow. Would they be compatible with my requirements ?
I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.
I was trying to find up any major difference between storm 1.1 and storm 2.0.
Is there any difference while setting up cluster for either of the versions?
(read on official website about new Java-based implementation but has anyone seen any difference between these two versions).
In addition to reading the changelog at https://www.apache.org/dist/storm/apache-storm-2.0.0/RELEASE_NOTES.html, you can look at https://issues.apache.org/jira/browse/STORM-2306?focusedCommentId=16291947&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16291947 for some performance numbers. You can also run your own benchmarks of course.
Any insights on how can we use thin jar to submit spark applications?
The scenario is such that if some specific dependency is not present in the classpath of the project or is specific to some distribution cloudera or hortonworks it throws an exception if the appropriate version of jars are not used.
How can we avoid such scenarios?
The only thin jar you can make is one that doesn't compile the Spark core libraries into the JAR. For example Spark SQL and Spark Streaming don't need included, but unless Spark was compiled with Hive support during installation, you'll still need that one.
You'll need to contact your Hadoop cluster administrators to know what version of Spark is available, how it was built, and what libraries are available in $SPARK_HOME out of the box.
In my experience, I've never ran into a specific dependency to HDP or CDH as I've ran a Spark 2.3 job submitted to YARN fine, while neither vendor officially supports that version. The only thing you need is to match the Spark version with your code, not necessarily Hadoop/YARN/Hive versions. Kafka, Cassandra, other connectors are all extra anyway and they can't be in a thin jar
We are using Cloudera CDH 4.5.0 for HBase and Storm 0.9.3 uses hbase-client. Unfortunately, it seems Cloudera did not provide an hbase-client maven artifact, and I cannot figure out how to satisfy the dependency for org.apache.hadoop.hbase.security.UserProvider. According to the Maven search site, it can be provided by either hbase-client or hbase-common. Can someone tell me if there is a comparable version of either of these that I can use with cdh 4.5.0?
Are you using cdh4.x or cdh5.x? the hbase-client/hbase-common jars are only in cdh5 (hbase 96+). The cdh4 release has only one big hbase jar containing everything. Also UserProvider doesn't seems to be present in 4.5.0 but is present from 4.6.x
hbase-client depends on hbase-common, so in general you need both if you want to use client.
(if you are looking only for the UserProvider class that is in hbase-common)