Can't read data in Presto - can in Hive

Can't read data in Presto - can in Hive - hadoop

I have a Hive DB - I created a table, compatible to Parquet file type.
CREATE EXTERNAL TABLE `default.table`(
`date` date,
`udid` string,
`message_token` string)
PARTITIONED BY (
`dt` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://Bucket/Folder')
I added partitions to this table, but I can't query the data.
In Hive: I can see the partitions when using "Show partitions from default.table", and I get the number of queries when using "Select count(*) from default.table".
In Presto: I can see the partitions when using "Show partitions from default.table", but when I try to query the data itself - it looks like there's no data - empty return with "select *", and 0 when trying "select count(*)".
Hive cluster is AWS EMR, version: emr-5.9.0, Applications: Hive 2.3.0, Presto 0.184, instance type: r3.2xlarge.
Does someone know why I get these differences between Hive and Presto?
Thanks!

Related

Hive query not reading partition field

I created a partitioned Hive table using the following query
CREATE EXTERNAL TABLE `customer`(
`cid` string COMMENT '',
`member` string COMMENT '',
`account` string COMMENT '')
PARTITIONED BY (update_period string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/user/customer'
TBLPROPERTIES (
'avro.schema.url'='/user/schema/Customer.avsc')
I'm writing to the partitioned location using map reduce program. when I read the output files using avro tools it is showing the correct data in json format. But when I use hive query to display the data, nothing is displayed. If I don't use partition field during table creation then the values are displayed in hive. what could be the reason for this ? I specify the output location for the mapreduce program as "/user/customer/update_period=201811".
Do I need to add anything in the mapreduce program configuration to resolve this?

You need to run msck repair table once you have loaded a new partition in HDFS location.
Why we need to run msck Repair table statement everytime after each ingestion?
Hive stores a list of partitions for each table in its metastore. However new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
1.Adding each partition to the table
hive> alter table <db_name>.<table_name> add partition(`date`='<date_value>')
location '<hdfs_location_of the specific partition>';
(or)
2.Run metastore check with repair table option
hive> Msck repair table <db_name>.<table_name>;
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.

Load Hbase table from hive

I am trying to load the hbase table from hive table, for that I am using the following approach and it works fine if I have only single column family in hbase table, however if I have multiple families it throws error.
Approach
source table
CREATE EXTERNAL TABLE temp.employee_orc(id String, name String, Age int)
STORED AS ORC
LOCATION '/tmp/employee_orc/table';
Create Hive table with Hbase Serde
CREATE TABLE temp.employee_hbase(id String, name String, age int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,emp:name,emp:Age')
TBLPROPERTIES("hbase.table.name" = "bda:employee_hbase", "hfile.family.path"="/tmp/employee_hbase/emp", "hive.hbase.generatehfiles"="true");
export the hbase files
SET hive.hbase.generatehfiles=true;
INSERT OVERWRITE TABLE temp.employee_hbase SELECT DISTINCT id, name, Age FROM temp.employee_orc CLUSTER BY id;
Load the hbase table
export HADOOP_CLASSPATH=`hbase classpath`
hadoop jar /usr/hdp/current/hbase-client/lib/hbase-server.jar completebulkload /tmp/employee_hbase/ 'bda:employee_hbase'
Error
I am getting following error if I have multiple column family in Hbase table,
java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs://hadoopdev/apps/hive/warehouse/temp.db/employee_hbase/_temporary/0/_temporary/attempt_1527799542731_1180_r_000000_0
is there another way to load Hbase table if not this approach?

Bulk load from hive to hbase, The target table can only have a single column family.
bulk load of hbase

You can use hbase bulkload hbase_bulkload with support multiple column family
Or you can use multiple hive table for each column family

HBase to Hive Mapping table is not showing up complete data

We have a HBase table with 1 column family and has 1.5 billion records in it.
HBase Row count was retrieved using command
"count '<tablename>'", {CACHE => 1000000}.
And HBase to Hive Mapping was done with the below command.
create external table stagingdata(
rowkey String,
col1 String,
col2 String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key,
n:col1,
n:col2,
')
TBLPROPERTIES('hbase.table.name' = 'hbase_staging_data');
But While we retrieve the Hive Row Count using the below command,
select count(*) from stagingdata;
It only shows up 140 million rows in the Hive Mapped Table.
We have tried the similar approach for Smaller HBase with 100 million records and complete records were shown up in Hive Mapped Table.
My Question is why the complete 1.5 billion records are not showing up in Hive?
Are we missing here anything ?
Your Immediate Answer would be highly appreciated.
Thanks,
Madhu.

What you see in hive is the latest version per key and not all the versions of a key
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp.
Hive HBase Integration

Unable to select from hive table with custom inputformat

We have developed custom input format to process edi files. After the recent upgrade to 5.8, select * from table doesn't return any rows.
Hive script:
*create external table CustomInputTest
(
all_cols String
)
STORED AS INPUTFORMAT 'parser.mapred.X12InputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/data/EDI'
TBLPROPERTIES ('edi.schema.hdfs.path' = '/user/data/layout/edi.xsl');*
*set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
select * from CustomInputTest;*
The same script returns expected output on Hive 1.1.0-cdh5.4.9.
on CDH 5.8, if the hive fetch task is disabled to force the query to generate MapReduce, the select query is working fine.
*set hive.fetch.task.conversion=none*;
I've checked hive server logs, i don't see any errors.
How to fix the issue, such that hive fetch task works in the new version[5.8]

Query HBase table from Hive

Here are the environment details:
Hadoop: 2.4.0
Hive: 0.11.0
HBase: 0.94.18
I created a HBase table and imported 10,000 rows:
hbase(main):008:0> create 'genotype_tbl', 'cf'
Load data to the table.
hbase(main):008:0> count 'hbase_tbl'
10000 row(s) in 176.9310 seconds
I created a Hive table as described in this article (using instructions on this page: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-HiveHBaseIntegration)
CREATE EXTERNAL TABLE hive_tbl(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:info")
TBLPROPERTIES("hbase.table.name" = "hbase_tbl");
However, when I do a count(*) on hive_tbl, it returns 0. There are no errors of any sort. Any help is appreciated.

This issue is resolved. The problem is with the hbase ImportTsv command. columns list was incorrect. Once, that was resolved, I could execute queries from Hive.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can't read data in Presto - can in Hive - hadoop

Related

Hive query not reading partition field

Load Hbase table from hive

HBase to Hive Mapping table is not showing up complete data

Unable to select from hive table with custom inputformat

Query HBase table from Hive

Categories

Resources