Load Hbase table from hive - hadoop

I am trying to load the hbase table from hive table, for that I am using the following approach and it works fine if I have only single column family in hbase table, however if I have multiple families it throws error.
Approach
source table
CREATE EXTERNAL TABLE temp.employee_orc(id String, name String, Age int)
STORED AS ORC
LOCATION '/tmp/employee_orc/table';
Create Hive table with Hbase Serde
CREATE TABLE temp.employee_hbase(id String, name String, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,emp:name,emp:Age')
TBLPROPERTIES("hbase.table.name" = "bda:employee_hbase", "hfile.family.path"="/tmp/employee_hbase/emp", "hive.hbase.generatehfiles"="true");
export the hbase files
SET hive.hbase.generatehfiles=true;
INSERT OVERWRITE TABLE temp.employee_hbase SELECT DISTINCT id, name, Age FROM temp.employee_orc CLUSTER BY id;
Load the hbase table
export HADOOP_CLASSPATH=`hbase classpath`
hadoop jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/employee_hbase/ 'bda:employee_hbase'
Error
I am getting following error if I have multiple column family in Hbase table,
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_hbase/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0
is there another way to load Hbase table if not this approach?

Bulk load from hive to hbase, The target table can only have a single column family.
bulk load of hbase

You can use hbase bulkload hbase_bulkload with support multiple column family
Or you can use multiple hive table for each column family

Related

How to create partitioned hive table on dynamic hdfs directories

I am having difficulty in getting hive to discover partitions which are created in HDFS
Here's the directory structure in HDFS
warehouse/database/table_name/A
warehouse/database/table_name/B
warehouse/database/table_name/C
warehouse/database/table_name/D
A,B,C,D being values from a column type
when I create a hive table using the following syntax
CREATE EXTERNAL TABLE IF NOT EXISTS
table_name(`name` string, `description` string)
PARTITIONED BY (`type` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs:///tmp/warehouse/database/table_name'
I am unable to see any records when I query the table.
But when I create directories in HDFS as below
warehouse/database/table_name/type=A
warehouse/database/table_name/type=B
warehouse/database/table_name/type=C
warehouse/database/table_name/type=D
It works and discovers partitions when I check using show partitions table_name
Is there some configuration in hive to able to detect dynamic directories as partitions?
Creating external table on top of some directory is not enough, partitions needs to be mounted also. Discover partitions feature added in Hive 4.0.0. Use MSCK REPAIR TABLE for earlier versions:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
or it's equivalent on EMR:
ALTER TABLE table_name RECOVER PARTITIONS;
And when you creating dynamic partitions using insert overwrite, partition metadata is being created automatically and partition folders are in the form key=value.

Hive query not reading partition field

I created a partitioned Hive table using the following query
CREATE EXTERNAL TABLE `customer`(
`cid` string COMMENT '',
`member` string COMMENT '',
`account` string COMMENT '')
PARTITIONED BY (update_period string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/user/customer'
TBLPROPERTIES (
'avro.schema.url'='/user/schema/Customer.avsc')
I'm writing to the partitioned location using map reduce program. when I read the output files using avro tools it is showing the correct data in json format. But when I use hive query to display the data, nothing is displayed. If I don't use partition field during table creation then the values are displayed in hive. what could be the reason for this ? I specify the output location for the mapreduce program as "/user/customer/update_period=201811".
Do I need to add anything in the mapreduce program configuration to resolve this?
You need to run msck repair table once you have loaded a new partition in HDFS location.
Why we need to run msck Repair table statement everytime after each ingestion?
Hive stores a list of partitions for each table in its metastore. However new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
1.Adding each partition to the table
hive> alter table <db_name>.<table_name> add partition(`date`='<date_value>')
location '<hdfs_location_of the specific partition>';
(or)
2.Run metastore check with repair table option
hive> Msck repair table <db_name>.<table_name>;
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Index Hbase data to solr via Hive external table

I have crawled some data via Nutch 2.3.1. Data is stored in Hbase 0.98 table. I have created an external table that import data from hbase table. Now I have to index this data to solr 4.10.3. For that I have followed this well known tutorial. I have created hive table like
create external table if not exists solr_items (
id STRING,
content STRING,
url STRING,
title STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,content,url,title")
tblproperties ("solr.url" = "http://localhost:8983/solr/collection1") ;
There was some problem when I tried to copy data from hbase posted here. Then I just decide to first index some dummy data. For that I have decided to load data from a file like
LOAD DATA LOCAL INPATH 'data.csv3' OVERWRITE INTO TABLE solr_items;
But it gave following error
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
Where is the problem
HADOOP version is 1.2.1
You can't use LOAD DATA for external tables. Hive LanguageManual DML:
Hive does not do any transformation while loading data into tables.
Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
Hive obviously can't just copy data in case of Solr external table because Solr uses it's own internal data presentation.
You can insert though:
insert into table solr_items select * from tempTable;

FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException while inserting data into Hive partitioned table

I have an employee data with 3 departments A,B,C.
I am trying to create partioned table on departments.
I created the table using below command.
create external table Parti_Trail (EmployeeID Int,FirstName
String,Designation String,Salary Int) PARTITIONED BY (Department
String) row format delimited fields terminated by "," location
'/user/sree/HiveTrail';
But this did nt load my table with data in location '/user/sree/HiveTrail'
So I tried to load my table
LOAD DATA INPATH '/user/aibladmin/HiveTrail' OVERWRITE INTO TABLE Parti_SCDTrail PARTITION(department);
But showing
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: department not found in table's partition spec: {department=null}
Why is it so. Am I doing anything wrong?
What happens if we SET hive.exec.dynamic.partition.mode = nonstrict;
While creating partitioned table , do we need to keep data seperated in different folder or whether it automatically get seperated into different partitions
For external tables with partition in Hive you need to run an ALTER statement to update the Metastore for new partitions. Because external tables are not managed by Hive.
Check this link
Hope it helps...!!!

Resources