Unable to select from hive table with custom inputformat - hadoop

We have developed custom input format to process edi files. After the recent upgrade to 5.8, select * from table doesn't return any rows.
Hive script:
*create external table CustomInputTest
(
all_cols String
)
STORED AS INPUTFORMAT 'parser.mapred.X12InputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/data/EDI'
TBLPROPERTIES ('edi.schema.hdfs.path' = '/user/data/layout/edi.xsl');*
*set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
select * from CustomInputTest;*
The same script returns expected output on Hive 1.1.0-cdh5.4.9.
on CDH 5.8, if the hive fetch task is disabled to force the query to generate MapReduce, the select query is working fine.
*set hive.fetch.task.conversion=none*;
I've checked hive server logs, i don't see any errors.
How to fix the issue, such that hive fetch task works in the new version[5.8]

Related

Hive query not reading partition field

I created a partitioned Hive table using the following query
CREATE EXTERNAL TABLE `customer`(
`cid` string COMMENT '',
`member` string COMMENT '',
`account` string COMMENT '')
PARTITIONED BY (update_period string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'hdfs://nameservice1/user/customer'
TBLPROPERTIES (
'avro.schema.url'='/user/schema/Customer.avsc')
I'm writing to the partitioned location using map reduce program. when I read the output files using avro tools it is showing the correct data in json format. But when I use hive query to display the data, nothing is displayed. If I don't use partition field during table creation then the values are displayed in hive. what could be the reason for this ? I specify the output location for the mapreduce program as "/user/customer/update_period=201811".
Do I need to add anything in the mapreduce program configuration to resolve this?
You need to run msck repair table once you have loaded a new partition in HDFS location.
Why we need to run msck Repair table statement everytime after each ingestion?
Hive stores a list of partitions for each table in its metastore. However new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
1.Adding each partition to the table
hive> alter table <db_name>.<table_name> add partition(`date`='<date_value>')
location '<hdfs_location_of the specific partition>';
(or)
2.Run metastore check with repair table option
hive> Msck repair table <db_name>.<table_name>;
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.

Can't read data in Presto - can in Hive

I have a Hive DB - I created a table, compatible to Parquet file type.
CREATE EXTERNAL TABLE `default.table`(
`date` date,
`udid` string,
`message_token` string)
PARTITIONED BY (
`dt` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://Bucket/Folder')
I added partitions to this table, but I can't query the data.
In Hive: I can see the partitions when using "Show partitions from default.table", and I get the number of queries when using "Select count(*) from default.table".
In Presto: I can see the partitions when using "Show partitions from default.table", but when I try to query the data itself - it looks like there's no data - empty return with "select *", and 0 when trying "select count(*)".
Hive cluster is AWS EMR, version: emr-5.9.0, Applications: Hive 2.3.0, Presto 0.184, instance type: r3.2xlarge.
Does someone know why I get these differences between Hive and Presto?
Thanks!

Spark Sql 1.5 dataframe saveAsTable how to add hive table properties

I am running spark sql on hive. I need to add auto.purge table properties while creating new hive table. I tried below code to add options while calling saveAsTable method :
inputDF.write.option("auto.purge" -> "true").saveAsTable(hiveTableName)
Above line of code added a property under WITH SERDEPROPERTIES of table.
I need to add this property under TBLPROPERTIES section of hive DDL.
Finally i found a solution, I am not sure if this is the best solution.
Unfortunately Spark 1.5 sql saveAsTable method doesn't support table property as input.They are creating new tableProperties map before hive table creation.
check out below code:
https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
To add table properties to existing hive table use alter table command.
ALTER TABLE table_name SET TBLPROPERTIES ('auto.purge'='true');
Above command will add table property to hive meta store.
To drop existing table inside encryption zone run above command before drop command.

Index Hbase data to solr via Hive external table

I have crawled some data via Nutch 2.3.1. Data is stored in Hbase 0.98 table. I have created an external table that import data from hbase table. Now I have to index this data to solr 4.10.3. For that I have followed this well known tutorial. I have created hive table like
create external table if not exists solr_items (
id STRING,
content STRING,
url STRING,
title STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,content,url,title")
tblproperties ("solr.url" = "http://localhost:8983/solr/collection1") ;
There was some problem when I tried to copy data from hbase posted here. Then I just decide to first index some dummy data. For that I have decided to load data from a file like
LOAD DATA LOCAL INPATH 'data.csv3' OVERWRITE INTO TABLE solr_items;
But it gave following error
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
Where is the problem
HADOOP version is 1.2.1
You can't use LOAD DATA for external tables. Hive LanguageManual DML:
Hive does not do any transformation while loading data into tables.
Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
Hive obviously can't just copy data in case of Solr external table because Solr uses it's own internal data presentation.
You can insert though:
insert into table solr_items select * from tempTable;

detailed steps for bulk loading in HBase table

I am new to HBase. Can someone provide me a detailed example on how bulk loading can be done in a HBase table.
Say for example I have a customer file with 10 columns and 100K rows. I want to load the file in a HBase table.
I have created a HBase table which is managed by HIVE and tried to load the same using LOAD command, but it failed.
Looks like I have to insert the table from HBase only.
hive (Koushik)> CREATE TABLE hive_hbase_emp_sample(eid int, ename string, esal double)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES
> ("hbase.columns.mapping" = ":key,cfstr:enm,cfsal:esl")
> TBLPROPERTIES ("hbase.table.name" = "hive_hbase_emp_sample");
OK
Time taken: 6.404 seconds
hive (Koushik)> load data local inpath '/home/hduser/sample_emp_file' into table hive_hbase_emp_sample;
FAILED: SemanticException [Error 10101]: A non-native table cannot be used as target for LOAD
You cannot direcly use load for targeting a HbaseStorage Handler Non native table instead load data in a staging table and then insert into your Hbase table using select * from staging table

Resources