Does any one know how to create dataframe in sparkR from hbase table? - sparkr

I am trying to create a spark dataframe in sparkR using data stored in hbase.
Does any one know how to specify the data source parameters in SQLontext or any other way to get around this?

You might want to take a look at this package : http://spark-packages.org/package/nerdammer/spark-hbase-connector.
However, it seems that you can't use it with SparkR yet and the two others packages providing connection between Spark and HBase don't seem to ba as advanced as the first one.
So I guess you won't be able to create a dataframe directly from HBase to SparkR.

Related

HBase Table Data Convert to CSV

How to HBase table data to convert .CSV file, im trying to convert table data to csv format , but i couldn't get any code
hbase001> list
Table
sample_data
Creating an external Hive table mapped on to HBase table using HBaseStorageHandler can solve your problem ,you can now use "select * from table_name" to get data into a csv table (stored as textfile fields terminted by ','). Please refer the below link for reference.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Usage
There are plenty of ways to solve your task. You can use spark, regular mapreduce or special tools like sqoop. This task is rather trivial and you can implement it by yourself if you learn hadoop. The quickest way for starters to do it is probably sqoop. Please get youself familiar with this power tool and play with it.
Good luck!

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

Hive cannot query the tables save by calling saveAsTable in Spark

I was trying to use Hive to query the tables I saved using saveAsTable() provided by Spark DataFrame. Everything works well when I query using hiveContext.sql(). However, when I switch to hive and describe the table, it becomes col, array, something like this and is no longer queryable.
Any ideas how to work it through? Is there a reliable way to make Hive understands the metadata defined in spark instead of explicitly defining the schema?
Sometimes I make use of spark to infer schema from the raw data or read schema from certain file formats like parquet so don't want to create these table that could be inferred automatically.
Thanks a lot for any advice!

Basic thing about Hadoop and Hive

I have started working with Hadoop recently. There is table named Checkout that I access through Hive. And below is the path where the data goes to HDFS and other info. So what information I can get if I have to read the below three lines?
Path Size Record Count Date Loaded
/sys/edw/dw_checkout_trans/snapshot/2012/07/04/00 1.13 TB 9,294,245,800 2012-07-05 07:26
/sys/edw/dw_checkout_trans/snapshot/2012/07/03/00 1.13 TB 9,290,477,963 2012-07-04 09:37
/sys/edw/dw_checkout_trans/snapshot/2012/07/02/00 1.12 TB 9,286,199,847 2012-07-03 07:08
So my question is-
1) Firstly, We are loading the data to HDFS and then through Hive I am querying it to get the result back? Right?
2) Secondly, When you look into the above path and other things, the only thing that I am confuse is, when I will be querying using Hive then I will be getting data from all the three paths above? or the most recent one at the top?
As I am new to these stuff, so I am having lot of problem. Can anyone explain me hive gets the data from where? And we store all the data in HDFS and then we use Hive or Pig to get data back from HDFS? And it will be great if some one give high level knowledge of Hadoop and Hive.
I think you need to get the difference between Hive's native table and Hive's external table.
Hive native table mean that you load data into hive, and it takes care how data is stored in the HDFS. We usually do not care what is directory structure in this case.
Hive External table mean that we put data in some directory (if we forget about partitioning for the moment) and tell to Hive - it is table's data. Please treat is as such. And hive enable us to query it, join with other external or regular table. And it is our responsibility to add data, delete it, etc

Mahout Hive Integration

I want to combine Hadoop based Mahout recommenders with Apache Hive.So that My generated Recommendations are directly stored in to my Hive Tables..Do any one know similar tutorials for this..?
Hadoop based Mahout recommenders can store the results in HDFS directly.
Hive also allows you to create table schema on top of any data using CREATE EXTERNAL TABLE recommend_table which also specifies the location of the data (LOCATION '/home/admin/userdata';).
This way you are ensured that when new data is written to that location - /home/admin/userdata then it is already available to Hive and can be queried by existing Table schema : recommend_table.
I had blogged about it some time back: external-tables-in-hive-are-handy. This solution helps for any kind of map-reduce program output that needs to be available immediately for Hive ad-hoc queries.

Resources