hive output consists of these 2 warnings at the end. How do I suppress these 2 warnings - hadoop

Hive query output that is using UDFs consists of these 2 warnings at the end. How do I suppress these 2 warnings. Please note that the 2 warnings come right after the output as part of output.
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
hadoop version
Hadoop 2.6.0-cdh5.4.0
hive --version
Hive 1.1.0-cdh5.4.0

If you use beeline instead of Hive the error goes away. Not the best solution, but I'm planning to post to the CDH user group asking the same question to see if it's a bug that can be fixed.

This error occurs due to adding of assembly jar which which contains classes from icl-over-slf4j.jar (which is causing the stdout messages) and slf4j-log4j12.jar.
You can try couple of things to begin with:
Try removing the assembly jar, in case if using.
Look at the following link: https://issues.apache.org/jira/browse/HIVE-12179
This suggest that we can trigger a flag in Hive where spark-assembly is loaded only if HIVE_ADD_SPARK_ASSEMBLY = "true".
https://community.hortonworks.com/questions/34311/warning-message-in-hive-output-after-upgrading-to.html :
Although there is a workaround if to avoid any end time changes and that is to manually remove the 2 lines from the end of the files using shell script.

Have tried to set HIVE_ADD_SPARK_ASSEMBLY=false, but it didn't work.
Finally, I found a post question at Cloudera community. See: https://community.cloudera.com/t5/Support-Questions/Warning-message-in-Hive-output-after-upgrading-to-hive/td-p/157141
You could try the follow command, it works for me!
hive -S -d ns=$hiveDB -d tab=$t -d dunsCol=$c1 -d phase="$ph1" -d error=$c2 -d ts=$eColumnArray -d reporting_window=$rDate -f $dir'select_count.hsql' | grep -v "^WARN" > $gOutPut 2> /dev/null

Related

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

Bash commands putting out extra information which results into issues with scripts

Okay, hopefully I can explain this correctly as I have no idea what's causing this or how to resolve this.
For some reason bash commands (on a CentOS 6.x server) are displaying more information than "normally" and that causes issues with certain scripts. I have no clue if there is a name for this, but hopefully someone knows a solution for this.
First example.
Correct / good server:
[root#goodserver ~]# vzctl enter 3567
entered into CT 3567
[root#example /]#
(this is the correct behaviour)
Incorrect / bad server:
[root#badserver /]# vzctl enter 3127
Entering CT
entered into CT 3127
Open /dev/pts/0
[root#example /]#
With the "bad" server it will display more information as usual, like:
Entering CT
Open /dev/pts/0
It's like it parsing extra information on what it's doing.
Ofcourse the above is purely something cosmetic, however with several bash scripts we use, these issues are really issues.
A part of the script we use, uses the following command (there are more, but this is mainly a example of what's wrong):
DOMAIN=`vzctl exec $VEID 'hostname -d'`
The result of the above information is parsed in /etc/named.conf.
On the GOOD server it would be added in the named.conf like this:
zone "example.com" {
type master;
file "example.com";
allow-transfer {
200.190.100.10;
200.190.101.10;
common-allow-transfer;
};
};
The above is correct.
On the BAD server it would be added in the named.conf like this:
zone "Executing command: hostname -d
example.com" {
type master;
file "Executing command: hostname -d
example.com";
allow-transfer {
200.190.100.10;
200.190.101.10;
common-allow-transfer;
};
};
So it's add stuff of the action it does, in this example "Executing command: hostname -d"
Another example here when I run the command on a good server and on the bad server.
Bad server:
[root#bad-server /]# DOMAIN=`vzctl exec 3333 'hostname -d'`
[root#bad-server /]# echo $DOMAIN
Executing command: hostname -d example.com
Good server:
[root#good-server ~]# DOMAIN=`vzctl exec 4444 'hostname -d'`
[root#good-server ~]# echo $DOMAIN
example.com
My knowledge is limited, but I have tried several things checking rsyslog and the grub.conf, but nothing seems out of the ordinary.
I have no clue why it's displaying the extra information.
Probably it's something simple / stupid, but I have been trying to solve this for hours now and I really have no clue...
So any help is really appreciated.
Added information:
Both servers use: kernel.printk = 7 4 1 7
(I don't know if that's useful)
Well (thanks to Aaron for pointing me in the right direction) I finally found the little culprit which was causing all the issues I experienced with this script (which worked for every other server, so no need to change that obviously).
The issues were caused by the VERBOSE leven set in vz.conf (located in /etc/vz/ directory). There is an option in there called "VERBOSE" and in my case it was set to 3.
According to OpenVZ's website it does the following:
Increments logging level up from the default. Can be used multiple times.
Default value is set to the value of VERBOSE parameter in the global
configuration file vz.conf(5), or to 0 if not set by VERBOSE parameter.
After I changed VERBOSE=3 to VERBOSE=0 my script worked fine once again (as it did for every other server). :-)
So a big shoutout to Aaron for pointing me in the right direction. The answer is easy when you know where to look!
Sorry to say, but I am kinda disappointed by ndim's reaction. This is the 2nd time he was very unhelpful and rude in his response after that. He clearly didn't read the issue I posted correctly. Oh well.
I would make sure to properly parse the output of the command. In this case, we are only interested in lines of the form
entered into CT 12345
One way of doing this would be to pipe everything through sed and having sed print only the number when the line looks as above (untested, and I always forget which braces/brackets/parens need a backslash in front of them):
whateverthecommand | sed -n 's/^entered into CT ([0-9]{1,})$/\1/p'

Vowpal Wabbit: make test failing for me at test 59

For some reason I am having trouble with the make test statement while installing ```Vowpal Wabbit``. I am getting the following error:
RunTests: test 59: '/usr/bin/timeout 20 ../vowpalwabbit/vw -d train-sets/argmax_data -k -c --passes 20 --search_rollout oracle --search_alpha 1e-8 --search_task argmax --search 2 --holdout_off' failed (exitcode=1)
Anyone have a clue what this could be?
Just run the command which failed (in single quotes) directly from the test directory, and the reason would become obvious.
It is missing data file:
Reading datafile = test/train-sets/argmax_data
can't open: test/train-sets/argmax_data, error = No such file or directory
vw: std::exception
The issue was introduced in a recent check-in and should soon be fixed (hopefully).
Update (2014-05-31): fixed in the most recent commit.

Passing directories to hadoop streaming : some help needed

The context is that I am trying to run a streaming job on Amazon EMR (the web UI) with a bash script that I run like:
-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE
The input directory has sub-directories in it and these sub-directories have gzipped data files.
The relevant part of mapperScript.sh that fails is :
for filename in "$input"/*; do
dir_name=`dirname $filename`
fname=`basename $filename`
echo "$fname">/dev/stderr
modelname=${fname}.model
modelfile=$model_location/$modelname
echo "$modelfile">/dev/stderr
inputfile=$dirname/$fname
echo "$inputfile">/dev/stderr
outputfile=$output/$fname
echo "$outputfile">/dev/stderr
# Will do some processing on the files in the sub-directories here
done # this is the loop for getting input from all sub-directories
Basically, I need to read the sub-directories in streaming mode and when I run this, hadoop complains saying :
2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file: s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main): Error Launching
job : Not a file: s3://emrdata/test_data/input/data1
I am aware that a similar q has been asked here
The suggestion there was to write one's own InputFormat. I am wondering if I am missing something else in the way my script is written / EMR inputs are given, or whether writing my own InputFormat in Java is my only choice.
I have tried giving my input with a "input/*" to EMR as well, but no luck.
It seems that while there may be some temporary workarounds to this, but inherently hadoop doesn't support this yet as you may see that there is an open ticket on this here.
So inputpatth/*/* may work for 2 level of subdierctories it may fail for further nesting.
The best thing you can do for now is get the listing of the files/folders-without-any-subdirectory and add them recursively after creating a csv list of inputPaths. You may use sinple tools like s3cmd for this.

Can we set the multiple generic arguments with -D option in GenericOptionsParser?

I want to pass multiple configuration parameters to my Hadoop job through GenericOptionsParser.
With "-D abc=xyz" I can pass one argument and able to retrieve the same from the configuration object but I am not able to pass the multiple argument.
Is it possible to pass multiple argument?If yes how?
Passed the parameters as -D color=yellow -D number=10
Had the following code in the run() method
String color = getConf().get("color");
System.out.println("color = " + color);
String number = getConf().get("number");
System.out.println("number = " + number);
The following was the o/p in the console
color = yellow
number = 10
I recently ran in to this issue after upgrading from Hadoop 1.2.1 to Hadoop 2.4.1. The problem is that Hadoop's dependency on commons-cli 1.2 was being omitted due to a conflict with commons-cli 1.1 that was pulled in from Cassandra 2.0.5.
After a quick look through the source it looks like commons-cli options that have an uninitialized number of values (what Hadoop's GenericOptionsParser does) default to a limit of 1 in version 1.1 and no limit in 1.2.
I hope that helps!
I tested passing multiple parameters and I used the -D flag multiple times.
$HADOOP_HOME/bin/hadoop jar /path/to/my.jar -D mapred.heartbeats.in.second=80 -D mapred.map.max.attempts=2 ...`
Doing this changed the values to what I specified in the Job's configuration.

Resources