How to create external tables from parquet files in s3 using hive 1.2? - hadoop

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.
I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue.
And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.
I checked the parquet files and was able to read the data using parquet-tools:
**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}
**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
col2 int,
col3 string
)
PARTITIONED BY (col1 date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1 col_2 col_3
5/3/19 NULL NULL
5/4/19 NULL NULL
5/5/19 NULL NULL
5/6/19 NULL NULL
Expected Result:
col_1 col_2 col_3
5/3/19 1234 ABC
5/4/19 124 FHK
5/5/19 12515 UPO
5/6/19 1234 ABC

Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)
How was the data created?
Spark supports case-sensitive schema.
When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:
scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
|-- iD: integer (nullable = true)
|-- NaMe: string (nullable = true)
Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.
object HiveCaseSensitiveInferenceMode extends Enumeration {
val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
.doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
"table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
"formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
"any table backed by files containing case-sensitive field names or queries may not return " +
"accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
"case-sensitive schema from the underlying data files and write it back to the table " +
"properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
"properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
"instead of inferring).")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files.
The default value of the property is INFER_AND_SAVE since Spark 2.2.0.
We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema)
4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.

Related

Hive with Hbase integration null

everyone.I try use hbase integration but had a problem.the timestamp field query by hive is null.
my sql is:
CREATE EXTERNAL TABLE hbase_data(nid string, dillegaldate timestamp,
coffense string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES(
"hbase.columns.mapping"=":key,0:DILLEGALTIMESTAMP,0:COFFENSE")
TBLPROPERTIES("hbase.table.name" = "ILLEGAL_DATA");
excute success,but query through hive
select * from hbase_data limit 10;
the column dillegaldate is null,I googled for it a lot of time but still not find the problem.Can anyone tell me how to solve it?thank you very much
0:DILLEGALTIMESTAMP replace by 0:DILLEGALTIMESTAMP#b
CREATE EXTERNAL TABLE hbase_data(nid string, dillegaldate timestamp, coffense string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES( "hbase.columns.mapping"=":key,0:DILLEGALTIMESTAMP#b,0:COFFENSE") TBLPROPERTIES("hbase.table.name" = "ILLEGAL_DATA");
a mapping entry must be either :key, :timestamp or of the form column-family-name:[column-name][#(binary|string) (the type specification that delimited by # was added in Hive 0.9.0, earlier versions interpreted everything as strings)
If no type specification is given the value from hbase.table.default.storage.type will be used
Any prefixes of the valid values are valid too (i.e. #b instead of #binary)
If you specify a column as binary the bytes in the corresponding HBase cells are expected to be of the form that HBase's Bytes class yields.
HBaseIntegration

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Not able to query(from Hive) Parquet file created in Pig

I have create a Parquet file in Pig(in the directory outputset)
grunt> STORE extracted INTO './outputset' USING ParquetStorer;
The file has 1 Record as shown below,
grunt> mydata = LOAD './outputset/part-r-00000.parquet' using ParquetLoader;
grunt> dump mydata;
(val1,val2,val3)
grunt> describe mydata;
mydata: {val_0: chararray,val_1: chararray,val_2: chararray}
After this, I have created an external table in Hive to read this file,
CREATE EXTERNAL TABLE parquet_test (
field1 string,
field2 string,
field3 string)
STORED AS PARQUET
LOCATION '/home/.../outputset';
When I query the table I am able to retrieve the 1 Record, but all the fields are NULL as show below,
hive> select * from parquet_test;
NULL NULL NULL
What am I missing here?
PS :
Pig version : 0.15.0
Hive version : 1.2.1
You need to match exact field name in pig with column in hive.
So your hive should look like
CREATE EXTERNAL TABLE parquet_test (
val1 string,
val2 string,
val3 string)
STORED AS PARQUET
LOCATION '/home/.../outputset';

HIVE External Table - Set Empty Strings to NULL

Currently I have a HIVE 0.7 instance on Amazon EMR. I am trying to create a duplicate of this instance on a new EMR cluster using Hive 0.11.
In my 0.7 instance I have an external table that will set empty strings to NULL. Here is how I create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS tablename
(column1 string,
column2 string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES ('serialization.null.format' = '');
Data is added to the table like this:
ALTER TABLE tablename
ADD PARTITION (year = '2013', month = '10', day='01')
LOCATION '/location_in_hdfs';
This works great in 0.7 but in 0.11 it doesn't seem to be evaluating my empty strings as NULLS. Interestingly, creating a normal table with the same data and table definition seems to evaluate empty strings as NULLs as expected.
Is there are different way to do this with an external table in 0.11?
Hive default partition properties overriding the table properties. Include SERDE properties in your alter statement:
ALTER TABLE tablename ADD PARTITION (year = '2013', month = '10', day='01') SET
SERDEPROPERTIES ('serialization.null.format' = '');

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

Resources