After we install hive-3.2.1 on Hadoop-3.3.0 in Ubuntu, we start the hive services. I am not sure how HIVE identifies hadoop services though we don't give anything related to Hadoop in the HIVE setup process. Does HIVE identify hadoop by the means of HADOOP_HOME environment variable defined in .bashrc file ?
Can someone please confirm my understanding.
Thanks!
Yes, Hive uses HADOOP_HOME/conf to discover the cluster, which could be specified in hive-env.sh
Related
If i install hadoop using 'hadoop' user, and install hive using 'hive' user on same node(Pseudo distribution mode).
How can my hive access hadoop?
when i input 'hive --version', i receive error like this:
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path.
The question is hive user have no right to access hadoop, but i don't know how to fix it.
Thanks a lot.
As the error says, $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path.
So, edit /home/hive/.bash_profile, for example assuming you're on Linux, and add one of those values to set the environment variable to the downloaded Hadoop package
For example
export HADOOP_HOME=/opt/hadoop # example
export PATH=$HADOOP_HOME/bin:$PATH
I am trying to set up hive on my local. I started all Hadoop processes and set up the {hive}/bin path. On command prompt I can run hive commands , create and read tables. My questions are -
1) is hive-site.xml is optional file ?
2) in absence of hive-site.xml file, how hive get information regrading metastore and other configuration?
If you're running Hive queries from your local machine which has Hadoop installed, hive-site.xml is not needed as you are talking directly to hive/bin in the Hive installation directory. You don't need to tell Hive where to find Hive.
If you wanted to run Hive commands from another machine, but interacting with Hive on your local machine, you'd need hive-site.xml.
I have configured hadoop 2.2.0 as single node cluster ( was able to run example jar)
Now I need to make hive perform queries using this hadoop
should I set
mapred.job.tracker
to
yarn.resourcemanager.resource-tracker.address
property?
tried so, but can't see the data loaded into hive tables in hdfs
I don't have enough reputation points to add a comment, so trying to help via an answer.
What are the daemons currently running for Hadoop? Use ps -eaf |
grep "java" to check.
Do you see the JobTracker running or the ResourceManager?
Also, can you elaborate on the steps you performed to install Hive?
I have screen cast, Installing Apache Hive that walks you through installing Hive. Next, you can follow my blog post Apache Hive - Getting Started. Hope this helps.
Can anyone please tell me that, does HCatalog require installation before using? Or it can be used just as a jar file?
I have Cloudera running on a VM, and I can use HCatalog for my MR job, Pig, Hive with no problem. And I thought the same MR code would work with another hadoop installed platform, but obviously it's not the case, exception thrown on the HCatInputFormat.setInput(). When I use Pig -useHCatalog, I'ved been prompted that the usage was wrong, meaning that it didn't know what's -useHCatalog as a parameter.
Didn't thought about this before as have been using HCatalog on Cloudera...
Yes, you need to install and start HCatalog server. HCatalog should come with the latest Hive tar package.
Check here of Apache Hive documentation for details,
Basically you need to,
Setup MySQL database for HCatalog
Run server install script
share/hcatalog/scripts/hcat_server_install.sh -r root -d dbroot -h
hadoop_home -p portnum
Start the HCatalog server
export HIVE_HOME=hive_home
$HIVE_HOME/sbin/hcat_server.sh start
As pointed out, you do not need to install hcatalog separately if you are working with hive 0.12 or later versions.
I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir