configure hive with hadoop - hadoop

I have configured hadoop 2.2.0 as single node cluster ( was able to run example jar)
Now I need to make hive perform queries using this hadoop
should I set
mapred.job.tracker
to
yarn.resourcemanager.resource-tracker.address
property?
tried so, but can't see the data loaded into hive tables in hdfs

I don't have enough reputation points to add a comment, so trying to help via an answer.
What are the daemons currently running for Hadoop? Use ps -eaf |
grep "java" to check.
Do you see the JobTracker running or the ResourceManager?
Also, can you elaborate on the steps you performed to install Hive?
I have screen cast, Installing Apache Hive that walks you through installing Hive. Next, you can follow my blog post Apache Hive - Getting Started. Hope this helps.

Related

Hive installation

After we install hive-3.2.1 on Hadoop-3.3.0 in Ubuntu, we start the hive services. I am not sure how HIVE identifies hadoop services though we don't give anything related to Hadoop in the HIVE setup process. Does HIVE identify hadoop by the means of HADOOP_HOME environment variable defined in .bashrc file ?
Can someone please confirm my understanding.
Thanks!
Yes, Hive uses HADOOP_HOME/conf to discover the cluster, which could be specified in hive-env.sh

How to check the hadoop distribution used in my cluster?

How can I know whether my cluster has been setup using Hortonworks,Cloudera or normal installation of hadoop components?
Also how can I know the port number of various services?
It is difficult to identify hadoop distribution from port number, since Apache, Hortonworks, Cloudera distros uses different port numbers
Other options are to check for cluster management service agents (Cloudera Manager - agent start up script - /etc/init.d/cloudera-scm-agent , Hortonworks - Ambari agent start up script - /etc/init.d/ambari-agent, Vanilla Apache hadoop will not have any agents in the server
Another option is to check hadoop classpath, below command can be used to get the classpath.
`hadoop classpath`
Most of hadoop distributions include distro name in the classpath, If classpath doesn't contains any of below keywords, distribution/setup will be Apache/Normal installation.
hdp - (Hortonworks)
cdh - (Cloudera)
The simplest way is to run hadoop version command and in output you will see, what version of Hadoop you are having and also which distribution and its version you are running with. If you will find words like cdh or hdp then cdh stands for cloudera and hdp for hortonworks.
For example, here I am having cloudera and with hadoop version command below is output.
Here in first line Hadoop version followed by hadoop distribution and its version.
Hope this will help.
Command hdfs version will give you version of the hadoop and its distribution

How to run Mahout jobs on Spark Engine?

Currently I’m doing some document similarity analysis using Mahout RowSimilarity Job. This can be easily done be running command ‘mahout rowsimilarity…’ from the console. However I noticed that this Job is also supported to be run on Spark engine. I wonder to know how I can run this Job on Spark Engine.
You can use MLlib alternate of mahout in spark. All library in MLlib are processing in distributed mode(Map-reduce in Hadoop).
In Mahout 0.10 provide job execution with spark.
More detail Link
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
step to setup spark with mahout.
1 Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark
2 Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)
3 Define the following environment variables:
export MAHOUT_HOME=[directory into which you checked out Mahout]
export SPARK_HOME=[directory where you unpacked Spark]
export MASTER=[url of the Spark master]
4 Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>. Check FAQ for further troubleshooting.
Please visit link.It uses new mahout 0.10 and works uses spark server.

Does HCatalog require installation before being used?

Can anyone please tell me that, does HCatalog require installation before using? Or it can be used just as a jar file?
I have Cloudera running on a VM, and I can use HCatalog for my MR job, Pig, Hive with no problem. And I thought the same MR code would work with another hadoop installed platform, but obviously it's not the case, exception thrown on the HCatInputFormat.setInput(). When I use Pig -useHCatalog, I'ved been prompted that the usage was wrong, meaning that it didn't know what's -useHCatalog as a parameter.
Didn't thought about this before as have been using HCatalog on Cloudera...
Yes, you need to install and start HCatalog server. HCatalog should come with the latest Hive tar package.
Check here of Apache Hive documentation for details,
Basically you need to,
Setup MySQL database for HCatalog
Run server install script
share/hcatalog/scripts/hcat_server_install.sh -r root -d dbroot -h
hadoop_home -p portnum
Start the HCatalog server
export HIVE_HOME=hive_home
$HIVE_HOME/sbin/hcat_server.sh start
As pointed out, you do not need to install hcatalog separately if you are working with hive 0.12 or later versions.

CDH4 installation using tarball

I have been struggling to install CDH via tarball, there is no document that describes the steps or guides through. I do have root access on the server & wish to install CDH4 via tarball in Pseudo mode. Can anyone help?. On the same server apache hadoop is also installed, i want to install this CDH, without effecting the existing apache hadoop.
It will not work..because in the end CDH4 will use the same ports which your existing apache hadoop is using..It will work ..if you shutdown your existing hadoop cluster and then start your CDH4 cluster. Or else change all the port numbers for namenode,secondary namenode,jobtracker, tasktracker and datanode and their respective web UI's port..which is kind of tedious.. It would be also helpful if you provide some error logs..So I can highlight what exactly is the problem.

Resources