In yarn-default.xml for Apache Hadoop 3.0.0 it shows the default value for yarn.nodemanager.log-dirs as ${yarn.log.dir}/userlogs.
Where is yarn.log.dir defined? Does it have a default value?
I do not find it in any of the default configurations (core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml).
I do not find it mentioned in any of the environment scripts (hadoop-env.sh,httpfs-env.sh,kms-env.sh,mapred-env.sh,yarn-env.sh.
Equally baffling to me is that when I grep the code for "yarn.nodemanager.log-dirs" the only places it shows up are in yarn-default.xml and markdown files, not in any java code anywhere. So how does setting yarn.nodemanager.log-dirs do anything?
yarn.log.dir is a Java property, based by a -D flag
In the yarn-env.sh, you should see YARN_LOG_DIR
# default log directory and file
if [ "$YARN_LOG_DIR" = "" ]; then
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
...
fi
YARN_OPTS="$YARN_OPTS -Dyarn.log.dir=$YARN_LOG_DIR"
Related
I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.
Hive query output that is using UDFs consists of these 2 warnings at the end. How do I suppress these 2 warnings. Please note that the 2 warnings come right after the output as part of output.
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
hadoop version
Hadoop 2.6.0-cdh5.4.0
hive --version
Hive 1.1.0-cdh5.4.0
If you use beeline instead of Hive the error goes away. Not the best solution, but I'm planning to post to the CDH user group asking the same question to see if it's a bug that can be fixed.
This error occurs due to adding of assembly jar which which contains classes from icl-over-slf4j.jar (which is causing the stdout messages) and slf4j-log4j12.jar.
You can try couple of things to begin with:
Try removing the assembly jar, in case if using.
Look at the following link: https://issues.apache.org/jira/browse/HIVE-12179
This suggest that we can trigger a flag in Hive where spark-assembly is loaded only if HIVE_ADD_SPARK_ASSEMBLY = "true".
https://community.hortonworks.com/questions/34311/warning-message-in-hive-output-after-upgrading-to.html :
Although there is a workaround if to avoid any end time changes and that is to manually remove the 2 lines from the end of the files using shell script.
Have tried to set HIVE_ADD_SPARK_ASSEMBLY=false, but it didn't work.
Finally, I found a post question at Cloudera community. See: https://community.cloudera.com/t5/Support-Questions/Warning-message-in-Hive-output-after-upgrading-to-hive/td-p/157141
You could try the follow command, it works for me!
hive -S -d ns=$hiveDB -d tab=$t -d dunsCol=$c1 -d phase="$ph1" -d error=$c2 -d ts=$eColumnArray -d reporting_window=$rDate -f $dir'select_count.hsql' | grep -v "^WARN" > $gOutPut 2> /dev/null
What are the priorities of the following 3 options for setting number of reduces? In other words, if all three are set, which one will be taken into account?
Option1:
setNumReduceTasks(2) within the application code
Option2:
-D mapreduce.job.reduces=2 as command line argument
Option3:
through $HADOOP_CONF_DIR/mapred-site.xml file
<property>
<name>mapreduce.job.reduces</name>
<value>2</value>
</property>
According to the Hadoop - The Definitive Guide
The -D option is used to set the configuration property with key color to the value
yellow. Options specified with -D take priority over properties from the configuration
files. This is very useful because you can put defaults into configuration files and then
override them with the -D option as needed. A common example of this is setting the
number of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will
override the number of reducers set on the cluster or set in any client-side configuration
files.
You have them racked in priority order - option 1 will override 2, and 2 will override 3. In other words Option 1 will be the one used by your job in this scenario
First Priority: Passing configuration parameters through command line (while submitting MR Application)
Second Priority: Setting configuration parameters in application code
Third Priority: It will read default parameters from multiple xml files such as core-site.xml, hadoop-env.sh, hdfs-site.xml, log4j.properties and mapred-site.xml
I've tried changing the memory limits using the -Xmx flag in the catalina.sh, as I've used in the past for Linux installs, but when I access psi-probe (previously lambda probe) It claims I have a limit of 1.78Gb. I've tried setting the max limit to 4096m and 6144m and had no effect. The machine I'm running it on has adequate memory to support these configurations, but the limit is still reported as 1.78Gb.
I have a particularly heavy application that keels over with a heap space error at approx 1.6Gb.
Any suggestions as to why this config is being ignored or where it might be overwritten?
EDIT:
Contents of setenv.sh are:
#!/bin/sh
#
# setenv.sh
#
# You may edit this script to set defaults for such variables as JAVA_HOME.
#
# For Apple Java, the $JAVA_HOME is not well respected by the JNI launching code
# in jsvc. On Apple Java systems, you are better off setting JAVA_JVM_VERSION
# to the proper java name, such as 1.4, 1.5, or CurrentJDK, and let JAVA_HOME
# be calculated from that.
#
# First source the conf/setenv.local file to allow user to configure environment
# in an even more minimal fashion.
if [ -r "$CATALINA_HOME/conf/setenv.local" ]; then
. "$CATALINA_HOME/conf/setenv.local"
fi
# Attempt to set JAVA_HOME if it's not already set
if [ -z "$JAVA_HOME" ]; then
# Set JAVA_JVM_VERSION and JAVA_HOME for Darwin
if [ `uname -s` = "Darwin" ]; then
# Look for a java version specified by JAVA_JVM_VERSION, falling back to current version
# Set JAVA_HOME to reflect the version
for jversion in $JAVA_JVM_VERSION CurrentJDK ; do
jhome="/System/Library/Frameworks/JavaVM.framework/Versions/${jversion}/Home"
if [ -z "$JAVA_HOME" -a -d "${jhome}" ]; then
# Get the actual version that any symlink points to, since
# jni doesn't like JAVA_JVM_VERSION set to CurrentJDK
saved=`pwd`
cd "/System/Library/Frameworks/JavaVM.framework/Versions/${jversion}"
actualvers=$(basename $(pwd -P))
cd $saved
export JAVA_JVM_VERSION=${actualvers}
export JAVA_HOME=${jhome}
fi
done
fi
fi
setenv.local:
#!/bin/sh
#
# setenv.local
#
# This script, if present, is executed by tomcatctl through setenv.sh
# in order to set up any environment prior to executation of tomcat.
#
# For Apple Java, JAVA_JVM_VERSION may be used to specify a particular
# java version to run. It should be something like 1.4, 1.5, or CurrentJDK.
#export JAVA_JVM_VERSION=1.5
catalina.sh (partial, the file is quite long):
#JAVA_OPTS="-Xmx4096m -Xms4096m -XX:PermSize=6144m -XX:MaxPermSize=6144m $JAVA_OPTS"
CATALINA_OPTS="-Xmx4096m -Xms4096m -XX:PermSize=6144m -XX:MaxPermSize=6144m $CATALINA_OPTS"
I want to pass multiple configuration parameters to my Hadoop job through GenericOptionsParser.
With "-D abc=xyz" I can pass one argument and able to retrieve the same from the configuration object but I am not able to pass the multiple argument.
Is it possible to pass multiple argument?If yes how?
Passed the parameters as -D color=yellow -D number=10
Had the following code in the run() method
String color = getConf().get("color");
System.out.println("color = " + color);
String number = getConf().get("number");
System.out.println("number = " + number);
The following was the o/p in the console
color = yellow
number = 10
I recently ran in to this issue after upgrading from Hadoop 1.2.1 to Hadoop 2.4.1. The problem is that Hadoop's dependency on commons-cli 1.2 was being omitted due to a conflict with commons-cli 1.1 that was pulled in from Cassandra 2.0.5.
After a quick look through the source it looks like commons-cli options that have an uninitialized number of values (what Hadoop's GenericOptionsParser does) default to a limit of 1 in version 1.1 and no limit in 1.2.
I hope that helps!
I tested passing multiple parameters and I used the -D flag multiple times.
$HADOOP_HOME/bin/hadoop jar /path/to/my.jar -D mapred.heartbeats.in.second=80 -D mapred.map.max.attempts=2 ...`
Doing this changed the values to what I specified in the Job's configuration.