I am reading in a book (Professional Hadoop Solutions) that JVM reuse can be enabled by specifying the job configuration mapreduce.job.jvm.numtasks. My question is do we need to set this in the Driver class?
I tried looking for this configuration in mapreduce.Job object, and I don't find it. Could this API be replaced elsewhere in the version of Hadoop I am using? Or am I not looking in the right place? I am using Hadoop version 1.0.3.
I also tried to look for the older property mapred.job.reuse.jvm.num.tasks, and I couldn't.
Thanks!
Your source is referring to the newer Hadoop configuration API for Hadoop 2.x (YARN). Within the shift to YARN a lot of configuration names have been revised. The changes are documented here on the the offical site for the related Hadoop release (in this case the by Amazon's Elastic MapReduce adopted version 2.4.0).
It explicitly mentions the old configuration name mapred.job.reuse.jvm.num.tasks has been replaced by the new name mapreduce.job.jvm.numtasks.
Furthermore the documentation for the MapReduce default configuration says this for mapreduce.job.jvm.numtasks:
How many tasks to run per jvm. If set to -1, there is no limit.
The default configuration for Hadoop 1.2.1 (compatible configuration API to 1.0.3) can be found on GrepCode for example.
Regarding your question, where to set this property. It either can be set
for the whole cluster in ${HADOOP_CONF_DIR}/mapred-site.xml,
or you specify it in the configuration of your Job (or JobContext), as long it is not declared final within your cluster:
job.getConfiguration().set("mapred.job.reuse.jvm.num.tasks","-1");
You could define it in mapred-site.xml:
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>-1</value>
</property>
Use it when you have shorter task that run for a definite period of time.
Related
48. HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.
The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.
I'm learning how HBase interacts with MapReduce
I know what the above two ways mean, but I don't know how to configure the recommended way
Could anyone tell me how to configure it in the recommended way?
As the docs show, prior to running hadoop jar, you can export HADOOP_CLASSPATH=$(hbase classpath) and you can use hadoop jar ... -libjars [...]
The true recommended way would be to bundle your HBase dependencies as an Uber JAR in your mapreduce application
The only caveat is that you need to ensure that your project uses the same/compatible hbase-mapreduce client versions as the server.
That way, you don't need any extra configuration, except maybe specifying the hbase-site.xml
I am trying to follow the Apache documentation in order to integrate Prometheus with Apache Hadoop. One of the preliminary steps is to setup Apache Ozone cluster. However, I am finding issues in running the ozone cluster concurrently with Hadoop. It throws a class not found exception for "org.apache.hadoop.ozone.HddsDatanodeService" whenever I try to start the ozone manager or storage container manager.
I also found that ozone 1.0 release is pretty recent and it is mentioned that it is tested with Hadoop 3.1. I have a running Hadoop cluster of version of 3.3.0. Now, I doubt if the version is a problem.
The tar ball for Ozone also has the Hadoop config files, but I wanted to configure ozone with my existing Hadoop cluster. I want to configure the ozone with my existing hadoop cluster.
Please let me know what should be the right approach here. If this can not be done, then please also let me know what is good way to monitor and extract metrics for Apache Hadoop in production.
I was trying to run spark(1.6.0) application which was using com.databricks.spark.csv jar to load csv file on yarn client mode from eclipse. It was throwing
CSVRelatio$annonfunc$func not found exception. That was resolved by setting
spark.hadoop.yarn.application.classpath
property in SparkConf.
My question is spark.hadoop.yarn.application.classpathproperty was not
listed in any of the spark official documents. So where can I find all such
properties? I know it is silly questions but there are many beginners who
refer official
documents (https://spark.apache.org/docs/1.6.0/configuration.html) and they
are not at all aware about these properties.
There are not listed because there are not Spark properties. spark. prefix is used only, so Spark recognizes, that these should be parsed, and put into org.apache.hadoop.conf.Configuration.
Where to look for a documentation? You should check Hadoop documentation for a corresponding component. For example for YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
You should also not that Spark has its own classpath related properties including:
spark.jars
spark.packages
spark.driver.extraClassPath / spark.executor.extraClassPath
....
So, i would like to create an apache spark integration in my spring application by following this guide provided by spring (http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html). Now i have a few questions as it seems that sparks 2.0.1 does not include the spark-assembly jar.
What are my options in proceeding with this as it seems that the integration is dependant on the jar?
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Is there a way to get the jar with apache 2.0.1?
Yes you are right - spark 2.0.1 does not include uber jar with itself like in 1.6.x and below (eg. spark-1.6.2-bin-hadoop2.6\lib\spark-assembly-1.6.2-hadoop2.6.0.jar)
Spark 2.0.0+ spark-release-2-0-0.html doesn't require a fat assembly uber jar. However when you compare content of spark-assembly-1.6.2-hadoop2.6.0 and libs (content of jar files) in spark-2.0.0-bin-hadoop2.7\jars\ you can see almost the same content with same classes, packages etc.
If i am able to find the old jar would i be able to use it with apache 2.0.1?
Personally I dont think so. There might be potentionally some problems with backward compatibility and it is weird to have something that was removed in latest version.
You are right that SparkYarnTasklet need assembly jar because there is some postPropertiesSet validation:
#Override
public void afterPropertiesSet() throws Exception {
Assert.hasText(sparkAssemblyJar, "sparkAssemblyJar property was not set. " +
"You must specify the path for the spark-assembly jar file. " +
"It can either be a local file or stored in HDFS using an 'hdfs://' prefix.");
But, this sparkAssemblyJar is only used in sparkConf.set("spark.yarn.jar", sparkAssemblyJar);
when you will use SparkYarnTasklet, the program will probably fail on validation (You can try to extend SparkYarnTasklet and Override afterPropertiesSet without validation)
And documentation about "spark.yarn.jar:"
To make Spark runtime jars accessible from YARN side, you can specify
spark.yarn.archive or spark.yarn.jars. For details please refer to
Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is
specified, Spark will create a zip file with all jars under
$SPARK_HOME/jars and upload it to the distributed cache.
so take a look into properties: spark.yarn.jars and spark.yarn.archive.
So compare what is spark.yarn.jar in 1.6.x- and 2.0.0+
spark.yarn.jar in 1.6.2 :
The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to hdfs:///some/path.
spark.yarn.jar in 2.0.1:
List of libraries containing Spark code to distribute to YARN
containers. By default, Spark on YARN will use Spark jars installed
locally, but the Spark jars can also be in a world-readable location
on HDFS. This allows YARN to cache it on nodes so that it doesn't need
to be distributed each time an application runs. To point to jars on
HDFS, for example, set this configuration to hdfs:///some/path. Globs
are allowed.
but this seems to set all jars one by one.
But in 2.0.0+ there is spark.yarn.archive that replaces spark.yarn.jars and provide a way how to avoid passing jars one by one - create archive with all jars in root "dir".
I think spring-hadoop will reflect changes in 2.0.0+ in a few weeks, but for "quick fix" I will probably try to override SparkYarnTasklet and reflect changes for 2.0.1 - as I saw exactly execute and afterPropertiesSet methods.
I'm in the process of adding support for unicode normalization in ES with the help of the ICU analysis plugin. Installing this in a dedicated cluster is relatively easy, but I also need this plugin to be available during testing, where we use a JVM local node. Since it's a JVM local node I can't simply call the commands as explained in the plugin documentation. How can I get my plugin to work for this local node?
After digging through the source code of Elasticsearch I figured out the answer, and it is stupidly simple: Just make sure the plugins are in your classpath and ES will pick them up automatically. In my case adding the plugin to my pom.xml was enough.