I am trying to load 3 billion records(ORC file) from hive to Hbase using hive-HBase integration.
Hive Create table DDL
CREATE EXTERNAL TABLE cs.account_dim_hbase(`account_number` string,`encrypted_account_number` string,`affiliate_code` string,`alternate_party_name` string, `alternate_party_name` string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,account_dim:encrypted_account_number,account_dim:affiliate_code,account_dim:alternate_party_name")TBLPROPERTIES ("hbase.table.name" = "default:account_dim");
Hive Insert Query to HBase, I am running 128 insert command similar to the below example.
insert into table cs.account_dim_hbase select account_number ,encrypted_account_number , affiliate_code ,alternate_party_name,mod_account_number from cds.account_dim where mod_account_number=1;
When I try to run all 128 inserts at the same time I am getting the below error
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 438 actions: org.apache.hadoop.hbase.RegionTooBusyException: Over memstore limit=2.0G, regionName=jhgjhsdgfjgsdjf, server=cldf0007.com
Help me to fix this and let me know If am doing anything wrong. I am using HDP 3
Loaded the data from hive using MD5 Hashing on the rowkey field and created the HBASE table using region splits. Now the data gets loaded just in 5 min per partition (It was 20 min before with exceptions but now fixed)
create ‘users, ‘usercf’, SPLITS=›
['10000000000000000000000000000000',
'20000000000000000000000000000000',
'30000000000000000000000000000000',
'40000000000000000000000000000000',
'50000000000000000000000000000000',
'60000000000000000000000000000000',
'70000000000000000000000000000000',
'80000000000000000000000000000000',
'90000000000000000000000000000000',
'a0000000000000000000000000000000',
'b0000000000000000000000000000000',
'c0000000000000000000000000000000',
'd0000000000000000000000000000000',
'e0000000000000000000000000000000',
'f0000000000000000000000000000000']
Related
I am trying to load the hbase table from hive table, for that I am using the following approach and it works fine if I have only single column family in hbase table, however if I have multiple families it throws error.
Approach
source table
CREATE EXTERNAL TABLE temp.employee_orc(id String, name String, Age int)
STORED AS ORC
LOCATION '/tmp/employee_orc/table';
Create Hive table with Hbase Serde
CREATE TABLE temp.employee_hbase(id String, name String, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,emp:name,emp:Age')
TBLPROPERTIES("hbase.table.name" = "bda:employee_hbase", "hfile.family.path"="/tmp/employee_hbase/emp", "hive.hbase.generatehfiles"="true");
export the hbase files
SET hive.hbase.generatehfiles=true;
INSERT OVERWRITE TABLE temp.employee_hbase SELECT DISTINCT id, name, Age FROM temp.employee_orc CLUSTER BY id;
Load the hbase table
export HADOOP_CLASSPATH=`hbase classpath`
hadoop jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/employee_hbase/ 'bda:employee_hbase'
Error
I am getting following error if I have multiple column family in Hbase table,
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_hbase/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0
is there another way to load Hbase table if not this approach?
Bulk load from hive to hbase, The target table can only have a single column family.
bulk load of hbase
You can use hbase bulkload hbase_bulkload with support multiple column family
Or you can use multiple hive table for each column family
We have a HBase table with 1 column family and has 1.5 billion records in it.
HBase Row count was retrieved using command
"count '<tablename>'", {CACHE => 1000000}.
And HBase to Hive Mapping was done with the below command.
create external table stagingdata(
rowkey String,
col1 String,
col2 String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key,
n:col1,
n:col2,
')
TBLPROPERTIES('hbase.table.name' = 'hbase_staging_data');
But While we retrieve the Hive Row Count using the below command,
select count(*) from stagingdata;
It only shows up 140 million rows in the Hive Mapped Table.
We have tried the similar approach for Smaller HBase with 100 million records and complete records were shown up in Hive Mapped Table.
My Question is why the complete 1.5 billion records are not showing up in Hive?
Are we missing here anything ?
Your Immediate Answer would be highly appreciated.
Thanks,
Madhu.
What you see in hive is the latest version per key and not all the versions of a key
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp.
Hive HBase Integration
I have crawled some data via Nutch 2.3.1. Data is stored in Hbase 0.98 table. I have created an external table that import data from hbase table. Now I have to index this data to solr 4.10.3. For that I have followed this well known tutorial. I have created hive table like
create external table if not exists solr_items (
id STRING,
content STRING,
url STRING,
title STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,content,url,title")
tblproperties ("solr.url" = "http://localhost:8983/solr/collection1") ;
There was some problem when I tried to copy data from hbase posted here. Then I just decide to first index some dummy data. For that I have decided to load data from a file like
LOAD DATA LOCAL INPATH 'data.csv3' OVERWRITE INTO TABLE solr_items;
But it gave following error
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
Where is the problem
HADOOP version is 1.2.1
You can't use LOAD DATA for external tables. Hive LanguageManual DML:
Hive does not do any transformation while loading data into tables.
Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
Hive obviously can't just copy data in case of Solr external table because Solr uses it's own internal data presentation.
You can insert though:
insert into table solr_items select * from tempTable;
I am new to HBase. Can someone provide me a detailed example on how bulk loading can be done in a HBase table.
Say for example I have a customer file with 10 columns and 100K rows. I want to load the file in a HBase table.
I have created a HBase table which is managed by HIVE and tried to load the same using LOAD command, but it failed.
Looks like I have to insert the table from HBase only.
hive (Koushik)> CREATE TABLE hive_hbase_emp_sample(eid int, ename string, esal double)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES
> ("hbase.columns.mapping" = ":key,cfstr:enm,cfsal:esl")
> TBLPROPERTIES ("hbase.table.name" = "hive_hbase_emp_sample");
OK
Time taken: 6.404 seconds
hive (Koushik)> load data local inpath '/home/hduser/sample_emp_file' into table hive_hbase_emp_sample;
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
You cannot direcly use load for targeting a HbaseStorage Handler Non native table instead load data in a staging table and then insert into your Hbase table using select * from staging table
Here are the environment details:
Hadoop: 2.4.0
Hive: 0.11.0
HBase: 0.94.18
I created a HBase table and imported 10,000 rows:
hbase(main):008:0> create 'genotype_tbl', 'cf'
Load data to the table.
hbase(main):008:0> count 'hbase_tbl'
10000 row(s) in 176.9310 seconds
I created a Hive table as described in this article (using instructions on this page: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-HiveHBaseIntegration)
CREATE EXTERNAL TABLE hive_tbl(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:info")
TBLPROPERTIES("hbase.table.name" = "hbase_tbl");
However, when I do a count(*) on hive_tbl, it returns 0. There are no errors of any sort. Any help is appreciated.
This issue is resolved. The problem is with the hbase ImportTsv command. columns list was incorrect. Once, that was resolved, I could execute queries from Hive.