detailed steps for bulk loading in HBase table - hadoop

I am new to HBase. Can someone provide me a detailed example on how bulk loading can be done in a HBase table.
Say for example I have a customer file with 10 columns and 100K rows. I want to load the file in a HBase table.
I have created a HBase table which is managed by HIVE and tried to load the same using LOAD command, but it failed.
Looks like I have to insert the table from HBase only.
hive (Koushik)> CREATE TABLE hive_hbase_emp_sample(eid int, ename string, esal double)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES
> ("hbase.columns.mapping" = ":key,cfstr:enm,cfsal:esl")
> TBLPROPERTIES ("hbase.table.name" = "hive_hbase_emp_sample");
OK
Time taken: 6.404 seconds
hive (Koushik)> load data local inpath '/home/hduser/sample_emp_file' into table hive_hbase_emp_sample;
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD

You cannot direcly use load for targeting a HbaseStorage Handler Non native table instead load data in a staging table and then insert into your Hbase table using select * from staging table

Related

Load Hbase table from hive

I am trying to load the hbase table from hive table, for that I am using the following approach and it works fine if I have only single column family in hbase table, however if I have multiple families it throws error.
Approach
source table
CREATE EXTERNAL TABLE temp.employee_orc(id String, name String, Age int)
STORED AS ORC
LOCATION '/tmp/employee_orc/table';
Create Hive table with Hbase Serde
CREATE TABLE temp.employee_hbase(id String, name String, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,emp:name,emp:Age')
TBLPROPERTIES("hbase.table.name" = "bda:employee_hbase", "hfile.family.path"="/tmp/employee_hbase/emp", "hive.hbase.generatehfiles"="true");
export the hbase files
SET hive.hbase.generatehfiles=true;
INSERT OVERWRITE TABLE temp.employee_hbase SELECT DISTINCT id, name, Age FROM temp.employee_orc CLUSTER BY id;
Load the hbase table
export HADOOP_CLASSPATH=`hbase classpath`
hadoop jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/employee_hbase/ 'bda:employee_hbase'
Error
I am getting following error if I have multiple column family in Hbase table,
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_hbase/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0
is there another way to load Hbase table if not this approach?
Bulk load from hive to hbase, The target table can only have a single column family.
bulk load of hbase
You can use hbase bulkload hbase_bulkload with support multiple column family
Or you can use multiple hive table for each column family

Loading in path file to a partitioned table

I'm trying to load a file locally into Hive by running this command:
LOAD DATA INPATH '/data/work/hive/staging/ExampleData.csv' INTO TABLE tablename;
which gives me the error:
SemanticException [Error 10062]: Need to specify partition columns
because the destination table is partitioned (state=42000,code=10062)
An answer I found suggests creating an intermediate table then letting dynamic partitioning kick in to load into a partitioned table.
I've created a table that matches the data and truncated it:
create table temptablename as select * from tablename;
truncate table temptablename
Then loaded the data using:
LOAD DATA INPATH '/data/work/hive/staging/ExampleData.csv' INTO TABLE temptablename;
How do I 'kick in' dynamic partitioning?
1.Load data into temptablename(without partition)
create table temptablename(col1,col2..);
LOAD DATA INPATH '/data/work/hive/staging/ExampleData.csv' INTO TABLE
temptablename;
now once you have data in intermediate table ,you can kick in dynamic
partitioning using following command.
2.INSERT into tablename PARTITION(partition_column) select * from
temptablename;

Index Hbase data to solr via Hive external table

I have crawled some data via Nutch 2.3.1. Data is stored in Hbase 0.98 table. I have created an external table that import data from hbase table. Now I have to index this data to solr 4.10.3. For that I have followed this well known tutorial. I have created hive table like
create external table if not exists solr_items (
id STRING,
content STRING,
url STRING,
title STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,content,url,title")
tblproperties ("solr.url" = "http://localhost:8983/solr/collection1") ;
There was some problem when I tried to copy data from hbase posted here. Then I just decide to first index some dummy data. For that I have decided to load data from a file like
LOAD DATA LOCAL INPATH 'data.csv3' OVERWRITE INTO TABLE solr_items;
But it gave following error
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
Where is the problem
HADOOP version is 1.2.1
You can't use LOAD DATA for external tables. Hive LanguageManual DML:
Hive does not do any transformation while loading data into tables.
Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
Hive obviously can't just copy data in case of Solr external table because Solr uses it's own internal data presentation.
You can insert though:
insert into table solr_items select * from tempTable;

FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException while inserting data into Hive partitioned table

I have an employee data with 3 departments A,B,C.
I am trying to create partioned table on departments.
I created the table using below command.
create external table Parti_Trail (EmployeeID Int,FirstName
String,Designation String,Salary Int) PARTITIONED BY (Department
String) row format delimited fields terminated by "," location
'/user/sree/HiveTrail';
But this did nt load my table with data in location '/user/sree/HiveTrail'
So I tried to load my table
LOAD DATA INPATH '/user/aibladmin/HiveTrail' OVERWRITE INTO TABLE Parti_SCDTrail PARTITION(department);
But showing
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: department not found in table's partition spec: {department=null}
Why is it so. Am I doing anything wrong?
What happens if we SET hive.exec.dynamic.partition.mode = nonstrict;
While creating partitioned table , do we need to keep data seperated in different folder or whether it automatically get seperated into different partitions
For external tables with partition in Hive you need to run an ALTER statement to update the Metastore for new partitions. Because external tables are not managed by Hive.
Check this link
Hope it helps...!!!

Query HBase table from Hive

Here are the environment details:
Hadoop: 2.4.0
Hive: 0.11.0
HBase: 0.94.18
I created a HBase table and imported 10,000 rows:
hbase(main):008:0> create 'genotype_tbl', 'cf'
Load data to the table.
hbase(main):008:0> count 'hbase_tbl'
10000 row(s) in 176.9310 seconds
I created a Hive table as described in this article (using instructions on this page: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-HiveHBaseIntegration)
CREATE EXTERNAL TABLE hive_tbl(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:info")
TBLPROPERTIES("hbase.table.name" = "hbase_tbl");
However, when I do a count(*) on hive_tbl, it returns 0. There are no errors of any sort. Any help is appreciated.
This issue is resolved. The problem is with the hbase ImportTsv command. columns list was incorrect. Once, that was resolved, I could execute queries from Hive.

Resources