Can we set the multiple generic arguments with -D option in GenericOptionsParser?

Can we set the multiple generic arguments with -D option in GenericOptionsParser? - hadoop

I want to pass multiple configuration parameters to my Hadoop job through GenericOptionsParser.
With "-D abc=xyz" I can pass one argument and able to retrieve the same from the configuration object but I am not able to pass the multiple argument.
Is it possible to pass multiple argument?If yes how?

Passed the parameters as -D color=yellow -D number=10
Had the following code in the run() method
String color = getConf().get("color");
System.out.println("color = " + color);
String number = getConf().get("number");
System.out.println("number = " + number);
The following was the o/p in the console
color = yellow
number = 10

I recently ran in to this issue after upgrading from Hadoop 1.2.1 to Hadoop 2.4.1. The problem is that Hadoop's dependency on commons-cli 1.2 was being omitted due to a conflict with commons-cli 1.1 that was pulled in from Cassandra 2.0.5.
After a quick look through the source it looks like commons-cli options that have an uninitialized number of values (what Hadoop's GenericOptionsParser does) default to a limit of 1 in version 1.1 and no limit in 1.2.
I hope that helps!

I tested passing multiple parameters and I used the -D flag multiple times.
$HADOOP_HOME/bin/hadoop jar /path/to/my.jar -D mapred.heartbeats.in.second=80 -D mapred.map.max.attempts=2 ...`
Doing this changed the values to what I specified in the Job's configuration.

Related

Consistent Hive and Impala Hash?

I am looking for a consistent way to hash something in both the Hive Query Language and the Impala Query Language where the hashing function produce the same value regardless of if it is done in Hive or in Impala. To clarify, I want something like some_hive_hash_thing(A) = some_other_impala_hash_thing(A).
For Hive, I know there is hash() which uses MD5 (or any of the commands here).
For Impala, I know there is fnv_hash() which uses the FNV algorithm. I know that Hive and Impala have their own hashing functions, but they are completely different from one another.
Ideally, I am looking for a way to do fnv_hash in Hive, or a way to do MD5 in Impala. Does anyone have any suggestions?

It's so late as an answer, but let's keep it here for someone else who may find it helpful.
"A way to do MD5 in Impala" yes there is and you can use UDFs built-in function of Hive in Impala in the recent releases (I'm using CDH 5.12 and it's working well with impala 2.9, and hive 1.1)
you can find here the list of the built-in functions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
or you can simply run
SHOW FUNCTIONS;
in your hive console
beeline -u jdbc:hive2://localhost:10000
so let's do a simulation of adding the MD5 function from hive to Impala.
DESCRIBE FUNCTION md5;
To make sure the function exists and to know the input and output variables type, so here we know that md5(string) takes a string as variable and a string as a return type .
Next we need to find hive-exec jar that contains our MD5 class using the Jar command :
/opt/jdk**/bin/jar tf hive-exec-*.*.*-cdh**.jar | grep Md5
Jar command is usually in the /bin under your java repository if it's not already configured in your environment variables .
you can find hive-exec-X-X.jar file in ../lib/hive/lib/ , if you can't find it just use locate command
so the output is something like :
/opt/jdk**/bin/jar tf hive-exec-*.*.*-cdh**.jar | grep Md5
org/apache/hadoop/hive/ql/udf/UDFMd5.class
save that path for later but we'll replace the '/' by '.' and remove the '.class'
like this :
org.apache.hadoop.hive.ql.udf.UDFMd5
copy the jar file in a directory accessible by HDFS and you may rename it for a simple use 'Im gonna name it hive-exec.jar'.
cp /lib/hive/lib/hive-exec.jar /opt/examples/
chown -R hdfs /opt/examples/
then create a place to put your jars in hdfs
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/hive_jars
Copy your jar file to HDFS using :
sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/hive-exec.jar /user/hive/warehouse/hive_jars/
so now you just have to go to Impala-shell and connect to a database then create your function using your HDFS path to the jar and the .class path we agreed earlier to convert in symbol.
Impala-shell>use udfs;
create function to_md5(string) returns string location '/user/hive/warehouse/hive_jars/hive-exec.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFMd5';
here you go you can use it now like any Impala function :
select to_md5('test');
| udfs.to_md5('test') |
+----------------------------------+
| 098f6bcd4621d373cade4e832627b4f6 |
show functions ;
Query: show functions
+-------------+----------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+----------------------+-------------+---------------+
| STRING | to_md5(STRING) | JAVA | false |

Pyro4 configuration doesn't change

I put the Pyro4 configuration as this in the starting part of my code:
Pyro4.config.THREADPOOL_SIZE = 1
Pyro4.config.THREADPOOL_SIZE_MIN = 1
I check if I tried to run two client code at the same time, it will say ' rejected: no free workers, increase server threadpool size'. It looks like the setting is working, but when I open the console to check the pyro configuration using "python -m Pyro4.configuration", it returns:
THREADPOOL_SIZE = 40
THREADPOOL_SIZE_MIN = 4
Does someone know why?

When you run python -m Pyro4.configuration, it will simply print the default settings (influenced only by any environment variables you may have set). I'm not sure why you think that this should know about the settings you added in your own code.

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.

It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file

I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

hive output consists of these 2 warnings at the end. How do I suppress these 2 warnings

Hive query output that is using UDFs consists of these 2 warnings at the end. How do I suppress these 2 warnings. Please note that the 2 warnings come right after the output as part of output.
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
hadoop version
Hadoop 2.6.0-cdh5.4.0
hive --version
Hive 1.1.0-cdh5.4.0

If you use beeline instead of Hive the error goes away. Not the best solution, but I'm planning to post to the CDH user group asking the same question to see if it's a bug that can be fixed.

This error occurs due to adding of assembly jar which which contains classes from icl-over-slf4j.jar (which is causing the stdout messages) and slf4j-log4j12.jar.
You can try couple of things to begin with:
Try removing the assembly jar, in case if using.
Look at the following link: https://issues.apache.org/jira/browse/HIVE-12179
This suggest that we can trigger a flag in Hive where spark-assembly is loaded only if HIVE_ADD_SPARK_ASSEMBLY = "true".
https://community.hortonworks.com/questions/34311/warning-message-in-hive-output-after-upgrading-to.html :
Although there is a workaround if to avoid any end time changes and that is to manually remove the 2 lines from the end of the files using shell script.

Have tried to set HIVE_ADD_SPARK_ASSEMBLY=false, but it didn't work.
Finally, I found a post question at Cloudera community. See: https://community.cloudera.com/t5/Support-Questions/Warning-message-in-Hive-output-after-upgrading-to-hive/td-p/157141
You could try the follow command, it works for me!
hive -S -d ns=$hiveDB -d tab=$t -d dunsCol=$c1 -d phase="$ph1" -d error=$c2 -d ts=$eColumnArray -d reporting_window=$rDate -f $dir'select_count.hsql' | grep -v "^WARN" > $gOutPut 2> /dev/null

mapreduce program not producing the requied output in distributed mode

I need some help in my map-reduce code.
The code run's perfectly in eclipse and in standalone mode, but when i package the code and try running it locally on pseudo distributed mode, the output is not as i expect.
Map input records = 11
Map input records = 11
Reduce input records = 11
Reduce output records = 0
These are the values i get.
where as when i run the same code in eclipse or in standalone mode with same config & input file
Map input records = 11
Map output records = 11
Reduce input records = 11
Reduce output records = 4
Can any one tell me whats wrong..??
i tried both the ways of building .jar file for eclipse -> export -> runable jar and form terminal as well(javac -classpath hadoop-core-1.0.4 -d classes mapredcode.java && jar -cvf mapredcode.jar -C classes/ .)
and how do i debug this..

Are you using a combiner() method?
And if yes. then is the o/p of combiner the same as that of the mapper?
Because in Hadoop, Combiner is run at the disposal of Hadoop itself and may not be running in the pseudo-disrtibuted mode in your case.
The combiner in itself is nothing but a reducer that is used to lower the network traffic.
And the code should be such that even if a Combiner is not running, the reducer should get the expected format from the mapper.
Hope it helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio