hive-builtins-0.9.0.jar FileNotFoundException - hadoop

I am newbie and I am trying to run a hive query
hive> SELECT xpath('<a><b id="foo">b1</b><b
id="bar">b2</b></a>','//#id') FROM src LIMIT 1;
when I execute the above command I get the following error
Job Submission failed with exception
'java.io.FileNotFoundException(File does not exist:
hdfs://localhost:9100/usr/local/hive/lib/hive-builtins-0.9.0.jar)'
Execution failed with exit status: 2 Obtaining error information
Task failed! Task ID: Stage-1
It is trying to look for hive-builtins-0.9.0.jar in hdfs. But this file is available under $HIVE_HOME/lib. why should it be uploaded to HDFS?
I have the following setting at the start of the hive
~/.hiverc
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;
If I add this hadoop property in hive-site.xml then it gives me the required output
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
but ideally I want to set it to
<value>hdfs://localhost</value>
as I have other hadoop specific java programs that use hdfs. What is the mistake I am making here. Is there a configuration that I need to set while starting up.
As requested $PATH information
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hive/bin
Please help.
Many thanks

Related

Write to HDFS/Hive using NiFi

I'm using Nifi 1.6.0.
I'm trying to write to HDFS and to Hive (cloudera) with nifi.
On "PutHDFS" I'm configure the "Hadoop Confiugration Resources" with hdfs-site.xml, core-site.xml files, set the directories and when I'm trying to Start it I got the following error:
"Failed to properly initialize processor, If still shcedule to run,
NIFI will attempt to initalize and run the Processor again after the
'Administrative Yield Duration' has elapsed. Failure is due to
java.lang.reflect.InvocationTargetException:
java.lang.reflect.InvicationTargetException"
On "PutHiveStreaming" I'm configure the "Hive Metastore URI" with
thrift://..., the database and the table name and on "Hadoop
Confiugration Resources" I'm put the Hive-site.xml location and when
I'm trying to Start it I got the following error:
"Hive streaming connect/write error, flow file will be penalized and routed to retry.
org.apache.nifi.util.hive.HiveWritter$ConnectFailure: Failed connectiong to EndPoint {metaStoreUri='thrift://myserver:9083', database='mydbname', table='mytablename', partitionVals=[]}:".
How can I solve the errors?
Thanks.
For #1, if you got your *-site.xml files from the cluster, it's possible that they are using internal IPs to refer to components like the DataNodes and you won't be able to reach them directly using that. Try setting dfs.client.use.datanode.hostname to true in your hdfs-site.xml on the client.
For #2, I'm not sure PutHiveStreaming will work against Cloudera, IIRC they use Hive 1.1.x and PutHiveStreaming is based on 1.2.x, so there may be some Thrift incompatibilities. If that doesn't seem to be the issue, make sure the client can connect to the metastore port (looks like 9083).

Can't get Master Kerberos principal for use as renewer for Talend Batch Jobs

we are trying to use talend batch (spark) jobs to access hive in a Kerberos cluster but we are getting the below "Can't get Master Kerberos principal for use as renewer" error.
By using the standard jobs(non spark) in talend we are able to access hive without any issue.
Below are the observation:
When we are running spark jobs talend could able to connect to hive
metastore and validating the syntax. ex if I provide the wrong table
name it does return "table not found".
when we select count(*) from table where there is no data it returns
"NULL" but if some data present in Hdfs(table) It failed with the error
"Can't get Master Kerberos principal for use as renewer".
I am not sure exactly what is the issue which is causing the token problem. could some one help us know the root cause.
One more thing to add instead of hive if I read / write to hdfs using spark batch jobs it works , So only problem is with hive and Kerberos.
You should include the hadoop config in the classpath (:/path/hadoop-configuration). You should include all configuration files in that hadoop configuration directory, not only the core-site.xml and hdfs-site.xml files. It happened to me and that solved the problem.
the same problem when I start spark on k8s,
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: Can't get Master Kerberos principal for use as renewer
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:133)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:243)
at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
and I just add yarn-site.xml to the HADOOP_CONFIG_DIR.
the yarn-site.xml only contains yarn.resourcemanager.principal
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/_HOST#DM.COM</value>
</property>
</configuration>
this working for me.

Error while adding UDF in hive

I have to add a UDF in hive.
The query I am trying is :
create function strip1 as 'com.hadoopbook.hive.Strip' using jar '/home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar'
But I am getting a exception as :
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask. Hive warehouse is non-local, but /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar specifies file on local filesystem. Resources on non-local warehouse should specify a non-local scheme/path
Can anyone tell how to solve this ?
Three options:
copy the jar on hdfs and use that path.
OR
as error is telling you: In the $HIVE_HOME/conf directory there is the hive-default.xml and/or hive-site.xml which has the hive.metastore.warehouse.dir property. add hdfs:/ to this path, and restart/re-run the hive shell/script:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://usr/hive/warehouse </value>
<description>location of the warehouse directory</description>
</property>
OR
if you are running hive queries from hive shell then:
hive> set hive.metastore.warehouse.dir;
hive.metastore.warehouse.dir=/user/hive/warehouse
above command prints the path, just prefix the hdfs:/ to it as below and then re-run your hive command(s) :
hive> set hive.metastore.warehouse.dir="hdfs://user/hive/warehouse";
You could setting the configuration hive.aux.jars.path to /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/
and create hive udf function via below command:
create function strip1 as 'com.hadoopbook.hive.Strip'
You can first try to add UDF jar to a hdfs location instead of the local directory:
$ add jar "hdfs://user/cloudera/hive/udf/Strip.jar"
and then create hive function as below:
$ create function test_function as "com.hadoopbook.hive.Strip"
Hope this helps :)

How to output Hadoop EL counters from streaming Map Reduce job triggered by Oozie?

I am triggering a streaming MapReduce job using Oozie, for which I would like to collect the following Hadoop EL constants:
MAP_IN: Hadoop mapper input records counter name.
MAP_OUT: Hadoop mapper output records counter name.
REDUCE_IN: Hadoop reducer input records counter name.
REDUCE_OUT: Hadoop reducer input record counter name.
I see that these can be accessed using
${ hadoop:counters('mr-action')[RECORDS][REDUCE_OUT]}
However, I have no idea how to get these values to be output back to either the screen via STDOUT or to a file in HDFS on the server from where I'm launching the Oozie workflow.
I've tried passing these values to a shell action and then echo / append to a file, but I believe this is being handled on the data nodes and so I'm not able to see that output. I've also tried setting oozie.action.external.stats.write to true, as one thread suggested, and then calling
oozie job -info -verbose
but I still don't see these counters showing up under an External Stats field. Any suggestions of how to get these counters output will be very helpful.
Before I was doing oozie job -info job-id -verbose which wasn't displaying external stats. Key was to make following changes.
In workflow.xml file, under the action I want to collect counters for, add the following to the configuration:
<action name="mr-action">
<configuration>
<property>
<name>oozie.action.external.stats.write</name>
<value>true</value>
</property>
</configuration>
</action>
Then, after job is run, do the following in the command line:
oozie job -info job-id#mr-action -verbose
which gives me the counters I was looking for.

How to set HADOOP_CLASSPATH via oozie while running HBase job

I'm using CDH5. I'm hit by a HBase bug while running a MapReduce job through Oozie in a fully distributed environment. This job connects to HBase and adds records programmatically. Requesting to refer these links to understand the bug I'm hitting. Please note that I cannot modify the map reduce job code. The job runs fine from commandline after setting HADOOP_CLASSPATH env variable. But there seem to be no way to set/override this environment variable from oozie. As a result the job fails when running from oozie. Anybody experienced and found a workaround for this problem?
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdpch_relnotes-hdp-2.0.9.0-knownissues-hbase.html
https://issues.apache.org/jira/browse/HBASE-11118
You can set the HADOOP_CLASSPATH in the system that runs oozie server. So, sending it every time in request is not required.
Otherwise, we can set it in the xml. In file oozie-site.xml set:
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/home/user/oozie/etc/hadoop</value>
</property>
Where /home/user/oozie/etc/hadoop is the absolute path where hadoop
configuration files are located.

Resources