I'm working on a project that relies on Hadoop but MRv1 architecture (Hadoop-1.1.2). I tried oozie scheduler for creating workflows(mapred) but gave up eventually, cause it is a nightmare to configure and I couldn't get it to work. I was wondering if I should try these other workflow Schedulers such as Azkaban or Apache Airflow. Would they be compatible with my requirements ?
Related
Does anyone know which containers technology (Docker, LXC,....) is used in apache Hadoop, especially in (HDFS) and (Mapreduce)?
I know its used container technology but I can not find which one in specific.
Out of the box, none. What YARN calls "containers", by default, are bare-metal JVM instances that are scheduled using the YARN NodeManagers.
YARN can be configured to use Docker or runC.
I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.
I want to run MapReduce application using Hadoop 2.6.5 (in my own native cluster) and I want to update some things in YARN thus I have seen that I can write my own YARN application (https://hadoop.apache.org/docs/r2.6.5/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html).
But it seems like if I am running YARN in this way (with YARN client) then, I can't use the MapReduce paradigm (Map and Reduce Function) with a Job class.
Is there any option to write my own ApplicationManager and using the MapReduce paradigm with the simple Job class?
These are some useful examples I have found regarding writing YARN application
https://github.com/noitcudni/yarn
https://github.com/hortonworks/simple-yarn-app
https://github.com/blrunner/yarn-beginners-examples/blob/master/src/main/java/com/wikibooks/hadoop/yarn/examples/MyClient.java
*Using Spring or Twill will result in the same problem.
Any insights on how can we use thin jar to submit spark applications?
The scenario is such that if some specific dependency is not present in the classpath of the project or is specific to some distribution cloudera or hortonworks it throws an exception if the appropriate version of jars are not used.
How can we avoid such scenarios?
The only thin jar you can make is one that doesn't compile the Spark core libraries into the JAR. For example Spark SQL and Spark Streaming don't need included, but unless Spark was compiled with Hive support during installation, you'll still need that one.
You'll need to contact your Hadoop cluster administrators to know what version of Spark is available, how it was built, and what libraries are available in $SPARK_HOME out of the box.
In my experience, I've never ran into a specific dependency to HDP or CDH as I've ran a Spark 2.3 job submitted to YARN fine, while neither vendor officially supports that version. The only thing you need is to match the Spark version with your code, not necessarily Hadoop/YARN/Hive versions. Kafka, Cassandra, other connectors are all extra anyway and they can't be in a thin jar
We are thinking of the integration of apache spark in our calculation process where we at first wanted to use apache oozie and standard MR or MO (Map-Only) jobs.
After some research several questions remain:
Is it possible to orchestrate an apache spark process by using apache oozie? If yes, how?
Is oozie necessary anymore or could spark handle orchestration by itself? (unification seems to be one of the main concerns in spark)
Please consider the following scenarios when answering:
executing a work flow every 4 hours
executing a work flow whenever specific data is accessible
trigger a work flow and configure it with parameters
Thanks for your answers in advance.
Spark is supported in Oozie 4.2 as an action type, see docs. The scenarios you mentioned are common Oozie features.