access hbase table fron hadoop mapreduce - hadoop

I want to access hbase table from hadoop mapreduce and I'm using windowsXP, cygwin, hadoop-0.20.2 and hbase-0.92.0.
I am able to run mapreduce wordcount successfully on 3 pcs and have verfied that hadoop and hbase are working fine. I can also create table from shell.
I have tried many examples but they are not working, for example when I try to compile it using
javac Example.java
it gives error.....
org.apache.hadoop.hbase.client does not exist
org.apache.hadoop.hbase does not exist
org.apache.hadoop.hbase.io does not exist
please can anyone help me in this......
-plz give me some example code to access hbase from hadoop map reduce
-also guide me how should I compile and execute it?

This website has example hbase/Mapreduce code. I haven't tried it, but it looks OK at first glance. Also what distribution of Hadoop/HBase are you using? Apache? Cloudera?
http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html

Related

Hadoop ResourceManager not show any job's record

I install Hadoop MultiNode cluster based on this link http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/
then I try to run wordcount example in my environment, but when I access to Resource Manager http://HadoopMaster:8088 to see the job's details, no records show in UI.
I also search this problem, one guy give the solution like that Hadoop is not showing my job in the job tracker even though it is running but in my case, I'm just running hadoop's example, in which wordcount also didn't add any extra configuration for yarn.
Anyone has install successfully Hadoop2 Muiltinode and Hadoop web UI works correctly can help me about this issue or can give a link to install correctly.
Whether you got the output of word-count job?

Apache Kylin installation without Sandbox

I was wondering if there are any resources regarding Apache Kylin installation without any sandbox (like cloudera, hortonworks) support. I have managed to do the following:
Install Hadoop 2.6
Install Hive
Install HBase
Then I used the binary from kylin site and so far been able to run it. The problem start when I try to build a cube, the map reduce job gets stuck in step 2. I am thinking if it is still assuming to be in sandbox mode and not submitting job to hadoop at all (there is no entry in hadoop jobtracker).
So I need solution regarding the two:
1. Possible configuration of kylin in pure hadoop setup (no sandbox)
2. somehow enable the kylin setup to submit job to hadoop.
There is no such sandbox or non-sandbox configuration in Kylin. Just make sure the machine where Kylin runs has hadoop setup correctly and you should be fine.
Under the scene, kylin.sh uses hbase classpath and hive -e set | grep 'env:CLASSPATH' to detect hadoop settings. Double check these commands work as expect if you are not sure what cluster Kylin connects to.
If Kylin has problem submitting MR jobs, check two places. First is hadoop resource manager, see if the job has really been submitted or not. Sometimes it's just running slow. Second check kylin.log, see if any exception there. Post the log to kylin dev mailing list and someone will be able to help.
You can install hadoop-2.6 , hive-0.14 ,hbase-0.98.8-hadoop2 with Zookeeper inbuilt or external zookeeper-3.5
Now you can run kylin-v1.1-release on it
If you still face Issues paste the log here

Unable to retain HIVE tables

I have set up a single node hadoop cluster on ubuntu.I have installed hadoop 2.6 version on in my machine.
Problem:
Everytime i create HIVE tables and load data into it , i can see the data by querying on it but once i shut-down my hadoop , tables gets wiped out. Is there any way i can retain them or is there any setting i am missing?
I tried some online solution provided , but nothing worked , kindly help me out with this.
Blockquote
Thanks
B
The hive table data is on the hadoop hdfs, hive just add a meta data and provide users sql like commands to prevent them from writing basic MR jobs.So if you shutdown the hadoop cluster,Hive cant find the data in the table.
But if you are saying when you restart the hadoop cluster, the data is lost.That's another problem.
seems you are using default derby as metastore.configure the hive metastore.i am pointing the link.please fallow it.
Hive is not showing tables

PIG cannot understand hbase table data

I'm running hbase(0.94.13) on a single node for my academic project. After loading data into hbase tables, I'm trying to run pig(0.11.1) scripts on the data using HBaseStorage. However this throws an error saying
IllegalArgumentException: Not a host:port pair: �\00\00\00
here is the load command I'm using in Pig
books = LOAD 'hbase://booksdb' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('details:title','-loadKey
true') AS (ID:chararray,title:chararray);
I thought this might be a problem of hbase being a different version in pig than what my machine has. But can't seem to make it work without downgrading my hbase. Any help?
It seems you are trying to submit a pig job remotely
if so you'd need to add a few settings in the pig.properties file (or set setting_name='values' in your script)
hbase.zookeeper.quorum=<node>
hadoop.job.ugi=username,groupname
fs.default.name=hdfs://<node>:port
mapred.job.tracker=hdfs://<node>:port

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

Resources