Configuring HCatalog, WebHCat with Hive - hadoop

I'm installing Hadoop, Hive to be integrated with WebHCat which will be used to run hive queries through it using Map-Reduce jobs of Hadoop.
I installed Hadoop 2.4.1 and Hive 0.13.0 (latest stable versions).
The request I'm sending using the web interface is:
POST: http://localhost:50111/templeton/v1/hive?user.name='hadoop'&statusdir='out'&execute='show tables'
And I got response as the following:
{
"id": "job_local229830426_0001"
}
But in the logs webhcat-console-error.log I find that exit value of this job is 1, which means some error occurred. Tracking this error I found it Missing argument for option: hiveconf
This is the webhcat-site.xml which contains the configurations of webhcat (known previously as templeton):
<configuration>
<property>
<name>templeton.port</name>
<value>50111</value>
<description>The HTTP port for the main server.</description>
</property>
<property>
<name>templeton.hive.path</name>
<value>/usr/local/hive/bin/hive</value>
<description>The path to the Hive executable.</description>
</property>
<property>
<name>templeton.hive.properties</name>
<value>hive.metastore.local=false,hive.metastore.uris=thrift://localhost:9933,hive.metastore.sasl.enabled=false</value>
<description>Properties to set when running hive.</description>
</property>
</configuration>
But the cmd query executed is weird as it have some additional hiveconf parameters with no values:
tool.TrivialExecService: Starting cmd: [/usr/local/hive/bin/hive, --service, cli, --hiveconf, --hiveconf, --hiveconf, hive.metastore.local=false, --hiveconf, hive.metastore.uris=thrift://localhost:9933, --hiveconf, hive.metastore.sasl.enabled=false, -e, show tables]
Any Idea?

Related

Redirecting to log server for container when view logs of a completed spark jobs run on yarn

I'm running spark on yarn.
My spark versoin is 2.1.1, and hadoop version is apache hadoop 2.7.3.
when a spark job running on yarn in cluster mode, I can view the Executor's log via the stdout/stderr links like
http://hadoop-slave1:8042/node/containerlogs/container_1500432603585_0148_01_000001/hadoop/stderr?start=-4096
but when the job completed, view the Executor's log via the stdout/stderr links will get an error page like
Redirecting to log server for container_1500432603585_0148_01_000001
java.lang.Exception: Unknown container. Container either has not
started or has already completed or doesn't belong to this node at
all.
And then it will auto redirect to
http://hadoop-slave1:8042/node/hadoop-master:19888/jobhistory/logs/hadoop-slave1:36207/container_1500432603585_0148_01_000001/container_1500432603585_0148_01_000001/hadoop
and get other error page like
Sorry, got error 404
Please consult RFC 2616 for meanings of the error code.
Error Details
org.apache.hadoop.yarn.webapp.WebAppException: /hadoop-master:19888/jobhistory/logs/hadoop-slave1:50284/container_1500432603585_0145_01_000002/container_1500432603585_0145_01_000002/oryx: controller for hadoop-master:19888 not found
at org.apache.hadoop.yarn.webapp.Router.resolveDefault(Router.java:232)
at org.apache.hadoop.yarn.webapp.Router.resolve(Router.java:140)
at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:134)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
Actually i can visit the Executor's log using this url when the
spark job completed:
http://hadoop-master:19888/jobhistory/logs/hadoop-slave1:36207/container_1500432603585_0148_01_000001/container_1500432603585_0148_01_000001/hadoop
it's a little different from the previous url, it remove the head "hadoop-slave1:8042/node/".
Does anyone knows another better method to view the spark logs when the spark job completed ?
I have configed the yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>${yarn.resourcemanager.hostname}:19888/jobhistory/logs</value>
</property>
and mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>${yarn.resourcemanager.hostname}:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.admin.address </name>
<value>${yarn.resourcemanager.hostname}:10033</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:19888</value>
</property>
I have encounter this situation.view the completed spark steaming job logs through YARN UI History tab, but get error below:
Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.
The solution is configure the file yarn-site.xml. Add key yarn.log.server.url :
<property>
<name>yarn.log.server.url</name>
<value>http://<LOG_SERVER_HOSTNAME>:19888/jobhistory/logs</value>
</property>
Then restart yarn cluster to reload yarn-site.xml.(this step is important!)

Query tables present in external hive from Apache Spark [duplicate]

This question already has answers here:
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
(11 answers)
Closed 2 years ago.
I am relatively new to hadoop ecosystem. My goal is to read hive tables using Apache Spark and process it. Hive is running in EC2 instance. Whereas Spark is running in my local machine.
To do a prototype, I've installed Apache Hadoop by following steps present over here . I've added required environment variables as well.
I've started dfs using $HADOOP_HOME/sbin/start-dfs.sh
I've installed Apache Hive by following steps present over here. I've started hiverserver2 and hive metadatastore. I've configured apache derby db (Server mode) in hive. I've created a sample table 'web_log' and added few rows in it using beeline.
I've added below in hadoop core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
And added below in hdfs-site.xml
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
I've added core-site.xml, hdfs-site.xml and hive-site.xml in $SPARK_HOME/conf in my local spark instance
core-site.xml and hdfs-site.xml are empty. i.e.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
</configuration>
hive-site.xml has below content
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://ec2-instance-external-dbs-name:9083</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>
I've started spark-shell and executed the following command
scala> sqlContext
res0: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext#57d0c779
It seems spark has created HiveContext.
I've executed sql using below command
scala> val df = sqlContext.sql("select * from web_log")
df: org.apache.spark.sql.DataFrame = [viewtime: int, userid: bigint, url: string, referrer: string, ip: string]
The columns and its types matches the sample table 'web_log' that I've created.
Now when I execute scala> df.show, it took some time and throws below error
16/11/21 18:46:17 WARN BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/ec2-instance-private-ip:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3101)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:755)
It seems DFSClient is using EC2 instances internal ip. And AFAIK, I didn't start any application on port 50010.
Do I need to install and start any other application?
How can make sure that DFSClient uses EC2 instance external IP or external DNS name?
Is it possible to access hive from external spark instance?
Add below code snippet to program which you are running ,
hiveContext.getConf.getAll.mkString("\n") this will print which hive metastore its connecting to... you can review all the properties which are not correct.
if they are not what you are looking for, and you cant adjust...
due to some limitations then as described the link. you can try
like this to point to correct uris... etc
hiveContext.setConf("hive.metastore.uris", "thrift://METASTOREl:9083");

Running Hive Query in Spark through Oozie 4.1.0.3

Getting table not found exception while running Hive Query in Spark using Oozie version 4.1.0.3, as java action.
Copied hive-site.xml and hive-default.xml from hdfs path
workflow.xml used:
<start to="scala_java"/>
<action name="scala_java">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>${nameNode}/user/${wf:user()}/${appRoot}/env/devbox/hive- site.xml</job-xml>
<configuration>
<property>
<name>oozie.hive.defaults</name>
<value>${nameNode}/user/${wf:user()}/${appRoot}/env/devbox/hive-default.xml</value>
</property>
<property>
<name>pool.name</name>
<value>${etlPoolName}</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>${QUEUE_NAME}</value>
</property>
</configuration>
<main-class>org.apache.spark.deploy.SparkSubmit</main-class>
<arg>--master</arg>
<arg>yarn-cluster</arg>
<arg>--class</arg>
<arg>HiveFromSparkExample</arg>
<arg>--deploy-mode</arg>
<arg>cluster</arg>
<arg>--queue</arg>
<arg>testq</arg>
<arg>--num-executors</arg>
<arg>64</arg>
<arg>--executor-cores</arg>
<arg>5</arg>
<arg>--jars</arg>
<arg>datanucleus-api-jdo-3.2.6.jar,datanucleus-core-3.2.10.jar,datanucleus- rdbms-3.2.9.jar</arg>
<arg>TEST-0.0.2-SNAPSHOT.jar</arg>
<file>TEST-0.0.2-SNAPSHOT.jar</file>
</java>
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Table not found test_hive_spark_t1)
Exception in thread "Driver" org.apache.hadoop.hive.ql.metadata.InvalidTableException: Table not found test_hive_spark_t1
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:980)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:79)
at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:255)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:255)
A. The X-default config files are just for user information; they are created at install time, from the hard-coded defaults in the JARs.
It's the X-site config files that contain useful information, e.g. how to connect to the Metastore (default for that is "just start an embedded Derby DB with no data inside"... might explain the "table not found message!
B. Hadoop components search for X-site config files in the CLASSPATH; and if they don't find them there, they silently fallback to default.
So you must tell Oozie to download them to local CWD via <file> instructions.
(Except for an explicit Hive Action that uses another, explicit, convention for its specific hive-site but that's not the case here)
hive-default.xml is not needed.
Create a custom hive-site.xml and which has hive.metastore.uris property alone.
Pass the custom hive-site.xml in --files hive-site.xml as spark Arguments.
Remove the job-xml property and oozie-hive-defaults.

get "ERROR: Can't get master address from ZooKeeper; znode data == null" when using Hbase shell

I installed Hadoop2.2.0 and Hbase0.98.0 and here is what I do :
$ ./bin/start-hbase.sh
$ ./bin/hbase shell
2.0.0-p353 :001 > list
then I got this:
ERROR: Can't get master address from ZooKeeper; znode data == null
Why am I getting this error ? Another question:
do I need to run ./sbin/start-dfs.sh and ./sbin/start-yarn.sh before I run base ?
Also, what are used ./sbin/start-dfs.sh and ./sbin/start-yarn.sh for ?
Here is some of my conf doc :
hbase-sites.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://127.0.0.1:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/Users/apple/Documents/tools/hbase-tmpdir/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/apple/Documents/tools/hbase-zookeeper/zookeeper</value>
</property>
</configuration>
core-sites.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/micmiu/tmp/hadoop</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>io.native.lib.available</name>
<value>false</value>
</property>
</configuration>
yarn-sites.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
If you just want to run HBase without going into Zookeeper management for standalone HBase, then remove all the property blocks from hbase-site.xml except the property block named hbase.rootdir.
Now run /bin/start-hbase.sh. HBase comes with its own Zookeeper, which gets started when you run /bin/start-hbase.sh, which will suffice if you are trying to get around things for the first time. Later you can put distributed mode configurations for Zookeeper.
You only need to run /sbin/start-dfs.sh for running HBase since the value of hbase.rootdir is set to hdfs://127.0.0.1:9000/hbase in your hbase-site.xml. If you change it to some location on local the filesystem using file:///some_location_on_local_filesystem, then you don't even need to run /sbin/start-dfs.sh.
hdfs://127.0.0.1:9000/hbase says it's a place on HDFS and /sbin/start-dfs.sh starts namenode and datanode which provides underlying API to access the HDFS file system. For knowing about Yarn, please look at http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html.
This could also happen if the vm or the host machine is put to sleep ,Zookeeper will not stay live.
Restarting the VM should solve the problem.
You need to start zookeeper and then run Hbase-shell
{HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
and you may want to check this property in hbase-env.sh
# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false
Refer to Source - Zookeeper
One quick solution could be to Restart hbase:
1) Stop-hbase.sh
2) Start-hbase.sh
I had the exact same error. The Linux firewall was blocking connectivity. One can test ports via telnet. A quick fix is to turn off the firewall and see if it fixes it:
Completely disable the firewall on all of your nodes. Note: this command will not survive a reboot of your machines.
systemctl stop firewalld
Long term fix is that you must configure the firewall to allow the hbase ports.
Note, your version of hbase may use different ports:
https://issues.apache.org/jira/browse/HBASE-10123
The output from Hbase shell is quite high level that many misconfiguration would cause this message. To help yourself debug, it would be much better to look into the hbase log in
/var/log/hbase
to figure out the root cause of the issue.
I had the same problem too. For me, my root cause was due to hadoop-kms having a conflicting port number with my hbase-master. Both of them are using port 16000 so my HMaster didn't even get started when I invoke hbase shell. After I fixed that, my hbase worked.
Again, kms port conflict might not be your root-cause. Strongly suggest looking into /var/log/hbase to find the root cause.
In my case with same error in running hbase - I did not include the zookeeper properties in the hbase-site.xml and still get the above error messages (as based in Apache hbase guide, only the two properites: rootdir, and distributed are essential).
I can also trace back my output of jps command that find out that indeed my Hregion server and Hmaster were not properly up and running.
After stop and start (like a reset), I did have these two up and running and can run hbase properly.
if it's happening in VMWare or virtual box please restart Cloudera by command init1 please check you have root privilege and retry hope it will help :)
hbase shell

Error running mapreduce sample in hadoop 0.23.6

I deployed Hadoop 0.23.6 in Ubuntu 12.04 LTS. I am able to copy files across and do file manipulation. I am using YARN for mapreduce.
I am getting the following error, when I am trying to run any mapreduce application using the hadoop-mapreduce-examples-0.23.6.jar
Command used:
bin/hadoop jar hadoop-mapreduce-examples-0.23.6.jar randomwriter -Dmapreduce.randomwriter.mapsperhost=1 -Dmapreduce.job.user.name=$USER -Dmapreduce.randomwriter.bytespermap=10000 -Ddfs.blocksize=536870912 -Ddfs.block.size=536870912 -libjars hadoop-mapreduce-client-app-0.23.6.jar output
Hadoop version: 0.23.6
Container launch failed for container_1364342550899_0001_01_000002 : java.lang.IllegalStateException: Invalid shuffle port number -1 returned for attempt_1364342550899_0001_m_000000_0
Verify your yarn-site.xml configuration. You need to have below properties configured.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
For more details, have look at jira
https://issues.apache.org/jira/browse/MAPREDUCE-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Resources