Hadoop command line -D options not working - hadoop

I am trying to pass a variable (not property) using -D command line option in hadoop like -Dmapred.mapper.mystring=somexyz. I am able to set a conf property in Driver program and read it back in mapper.
So I can use this to pass my string as additional parameter and set it in Driver. But I want to see if -D option can be used to do the same
My command is:
$HADOOP_HOME/bin/hadoop jar /home/hduser/Hadoop_learning_path/toolgrep.jar /home/hduser/hadoopData/inputdir/ /home/hduser/hadoopData/grepoutput -Dmapred.mapper.mystring=somexyz
Driver program
String s_ptrn=conf.get("mapred.mapper.regex");
System.out.println("debug: in Tool Class mapred.mapper.regex "+s_ptrn + "\n");
Gives NULL
BUT this works
conf.set("DUMMYVAL","100000000000000000000000000000000000000"); in driver is read properly in mapper by get method.
My question is if all of Internet is saying i can use -D option then why cant i? is it that this cannot be used for any argument and only for properties? whihc we can read by putitng in file that i should read in driver program then use it?
Something like
Configuration conf = new Configuration();
conf.addResource("~/conf.xml");
in driver program and this is the only way.

As Thomas wrote, you are missing the space. You are also passing variable mapred.mapper.mystring in your CLI, but in the code you are trying to get mapred.mapper.regex. If you want to use -D parameter, you should be using Tool interface. More about it is here - Hadoop: Implementing the Tool interface for MapReduce driver.
Or you can parse your CLI arguments like this:
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
while (i<otherArgs.length) {
if (otherArgs[i].equals("-x")) {
//Save your CLI argument
yourVariable = otherArgs[++i];
}
//then save yourVariable into conf for using in map phase
Than your command can be like this:
$HADOOP_HOME/bin/hadoop jar /home/hduser/Hadoop_learning_path/toolgrep.jar /home/hduser/hadoopData/inputdir/ /home/hduser/hadoopData/grepoutput -x yourVariable
Hope it helps

To use -D option with hadoop jar command correctly, given below syntax should be used:
hadoop jar {hadoop-jar-file-path} {job-main-class} -D {generic options} {input-directory} {output-directory}
Hence -D option should be placed after job main class name i.e at third position. Because when we issue hadoop jar command then, hadoop scripts invokes RunJar class main(). This main () parses first argument to set Job Jar file in classpath and uses second argument to invoke job class main().
Once Job class main () is called then control is transferred to GenericOptionsParser which first parses generic command line arguments (if any) and sets them in Job's configuration object and then calls Job class' run () with remaining arguments (i.e input and output path)

Related

Getting a date in properties file in string format

I want to get a date from a key dynamically which will be used by java program to perform some tasks.
I have to get values from property file to java. cannot do vice versa
So basically the value for this key, job.date=2022-03-23 i can get through date -d tomorrow "+%Y-%m-%d". But this works fine when job.date is accessed from shell script and gives a parsing error when accessed from java class.
so looking for java understandable snippet, or a way to override it while executing the java class with jar
You should use (if this is the property and value you want)
java -Djob.date=$(date -d tomorrow +"%Y-%m-%d") ...
java -Djob.date=$(date -d tomorrow +"%Y-%m-%d") ...
The above gave me exception, error loading main class.
It worked with following snippet:
sed -i "s/\(testauto\.as\.of\.date=\).*\$/\1${tomorrow}/" abc.properties
Added this line in shell script that was calling the java class.

How to determine Pig Execution Mode programatically

How can we determine whether pig is running in Local Mode or Map Reduce Mode ? Are there any specific commands to find it out ?
Why you need this?
-x local and pig or -x mapreduce is the commandline option for 2 modes.
And programatically we do
PigServer pigServer = new PigServer("local");
PigServer pigServer = new PigServer("mapreduce");
I think we can log it.
There may be better practices.
If you're invoking Pig programatically, but for some reason don't know the run mode that was chosen when Pig was started, you could call getPigContext().getExecType() on the PigServer instance.
If you need to know the run mode within a Pig script, you could access the client-side command line arguments and parse the run mode from within a UDF as follows:
UDFContext context = UDFContext.getUDFContext();
Properties props = context.getClientSystemProps();
String commandArgs = props.getProperty("pig.cmd.args");
Pattern pattern = Pattern.compile("-x\\s+local");
Matcher matcher = pattern.matcher(commandArgs);
boolean isLocal = matcher.find();

How to disable hadoop combiner?

In wordcount example, the combiner is explicitly set in
job.setCombinerClass(IntSumReducer.class);
I would like to disable the combiner so that the output of mapper is not processed by the combiner. Is there a way to do that using MR config files (i.e. without modifying and recompiling the wordcount code)?
Thanks
Suppose this is your command line
hadoop jar your_hadoop_job.jar your_mr_driver \
command_line_arg1 command_line_arg2 command_line_arg3 \
-libjars all_your_dependency_jars
Here following parameters
command_line_arg1
command_line_arg2
command_line_arg3
will be passed on to your main method as arg[0], arg[1] and arg[3] respectively. Assuming arg[0] and arg[1] is used for identifying input and output folder. You can use arg[3] to pass a boolean flag like ('1' or 'true' or 'yes') to understand if you want to use combiner and accordingly set combiner. Example below (default...it won't set combiner class)
if ( "YyesTrue1".contains(arg[3])){
job.setCombinerClass(IntSumReducer.class);
}

How do you create a Bash function that accepts files of a specific type as arguments?

So far, I know that you have to create a function in order to pass arguments.
However, how do you denote the type of the argument?
For instance, if you want to compile a Java class file and then run the resulting Java file (without having to type the file name twice to distinguish between the extensions each time), how do you let the function know that the names belong to files of different types?
Let's say this is our function:
compileAndRun()
{
javac $1
java $2 # basically, we want to make this take the same argument
# (because the names of the *.class and *.java files are the same)
}
So, instead of typing:
compileAndRun test.class test.java
We wanna just type this:
compileAndRun test
Any help along with any extraneous information you wanna throw in would be much appreciated.
Just use $1 twice. It is safer to connect the two commands with &&, so java is not run if the compilation is not successful.
function compile_n_run () {
javac "$1".java && java "$1".class
}
Arguments to bash functions don't really have types; they are just strings, and it's up to you to use them appropriately. In this case, it's fairly simply to write a function which takes a Java source file, compiles it, and runs the resulting output.
compile_n_run () {
source=$1
expected_output="${source%.java}.class"
javac "$source" && java "$expected_output"
}
$ compile_n_run test.java
I chose to require the full Java source name because it's a little friendlier with auto-completion; you don't have to remove the .java from the command-line, rather you let the function do that for you. (And otherwise, this answer would be identical to choroba's).

where can I see the print string when i add the code "System.out.println("test string");" in NameNode.java file?

Under NameNode.java file
Try to added test code of print string in main() function, the code as below:
System.out.println("test string");
where can I see the print string?
*code compiler successful and used new generation file(hadoop-core-1.0.4.jar) replace to each node.
*All daemon has restart. but not found the print string on terminal.
If you've restarted your name node service, these sys outs will probably go the the name node log file (which can be in a variety of locations depending on your hadoop disto / install). The hadoop-daemon.sh file defines the file as follows:
$HADOOP_LOG_DIR/hadoop-$HADOOP_IDENT_STRING-$command-$HOSTNAME.out
So you'll find it in the HADOOP_LOG_DIR, under the name hadoop-$HADOOP_IDENT_STRING-nanmenode-$HOSTNAME.out - where the other variables will be replaced depending on the runtime user and hostname of your namenode service.
I would suggest you use the predefined logger, rather then System.err / System.out:
LOG.info("log message");

Resources