Hive with Hbase integration null - hadoop

everyone.I try use hbase integration but had a problem.the timestamp field query by hive is null.
my sql is:
CREATE EXTERNAL TABLE hbase_data(nid string, dillegaldate timestamp,
coffense string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES(
"hbase.columns.mapping"=":key,0:DILLEGALTIMESTAMP,0:COFFENSE")
TBLPROPERTIES("hbase.table.name" = "ILLEGAL_DATA");
excute success,but query through hive
select * from hbase_data limit 10;
the column dillegaldate is null,I googled for it a lot of time but still not find the problem.Can anyone tell me how to solve it?thank you very much

0:DILLEGALTIMESTAMP replace by 0:DILLEGALTIMESTAMP#b
CREATE EXTERNAL TABLE hbase_data(nid string, dillegaldate timestamp, coffense string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES( "hbase.columns.mapping"=":key,0:DILLEGALTIMESTAMP#b,0:COFFENSE") TBLPROPERTIES("hbase.table.name" = "ILLEGAL_DATA");
a mapping entry must be either :key, :timestamp or of the form column-family-name:[column-name][#(binary|string) (the type specification that delimited by # was added in Hive 0.9.0, earlier versions interpreted everything as strings)
If no type specification is given the value from hbase.table.default.storage.type will be used
Any prefixes of the valid values are valid too (i.e. #b instead of #binary)
If you specify a column as binary the bytes in the corresponding HBase cells are expected to be of the form that HBase's Bytes class yields.
HBaseIntegration

Related

How to create external tables from parquet files in s3 using hive 1.2?

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.
I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue.
And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.
I checked the parquet files and was able to read the data using parquet-tools:
**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}
**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
col2 int,
col3 string
)
PARTITIONED BY (col1 date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
Result:
col_1 col_2 col_3
5/3/19 NULL NULL
5/4/19 NULL NULL
5/5/19 NULL NULL
5/6/19 NULL NULL
Expected Result:
col_1 col_2 col_3
5/3/19 1234 ABC
5/4/19 124 FHK
5/5/19 12515 UPO
5/6/19 1234 ABC
Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)
How was the data created?
Spark supports case-sensitive schema.
When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:
scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
|-- iD: integer (nullable = true)
|-- NaMe: string (nullable = true)
Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.
object HiveCaseSensitiveInferenceMode extends Enumeration {
val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}
val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
.doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
"table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
"formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
"any table backed by files containing case-sensitive field names or queries may not return " +
"accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
"case-sensitive schema from the underlying data files and write it back to the table " +
"properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
"properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
"instead of inferring).")
.stringConf
.transform(_.toUpperCase(Locale.ROOT))
.checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files.
The default value of the property is INFER_AND_SAVE since Spark 2.2.0.
We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema)
4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.

Import multiple column families from hbase to hive

I am trying to move hbase table having two column family into hive table. I am able to move one column family but how can i move another one in same hive table.
Edit:
I moved one column family ushing below code.
CREATE TABLE hbase_hive(key string, firstname string, age string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age")
TBLPROPERTIES(“hbase.table.name” = “hl”);
but i am having one more column family with name hb and having three columns. How to achive this.
Update:
I also tried adding column name of different column family below is my code.
CREATE TABLE hbase_hive(key string, firstname string, age string, testname string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age,pd:name")
TBLPROPERTIES(“hbase.table.name” = “hl”);
but i am getting below result:
819215975 19391121 625678921720 NULL
819617215 19570622 625116365890 NULL
820333876 19640303 623221670810 NULL
824794938 19531211 625278010070 NULL
828093442 19420803 625284904860 NULL
828905771 19320209 625078004220 NULL
829832017 19630722 625178010070 NULL
Instead of values i am getting null.
Update:
I tried creating hbase table using below command in hbase shell
create ‘hl’,’id’
then i created one more column family using below command
alter ‘hl’,’pd’
In your HiveQL, you select two columns in column family "id" from hbase table "hl" into hive table. If you want to add more columns (even from other column families), you just need to add them to table schema and hbase.columns.mapping. For example:
CREATE TABLE hbase_hive(key string, firstname string, age string, a string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age,hb:a")
TBLPROPERTIES(“hbase.table.name” = “hl”);
see https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-MultipleColumnsandFamilies
I see a couple of issues (more or less serious) with what you wrote:
First of all, I would create an EXTERNAL TABLE instead
You are creating a Hive table with only 3 columns but expecting 4 in the end
You are not explicitly mapping the :key
Your data for 'firstname' and 'age' looks like wild random numbers! :|
I could not test it but the following should be a better starting point:
CREATE EXTERNAL TABLE hbase_hive_hl(key string, firstname string, age string, name string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,id:firstname,id:age,pd:name")
TBLPROPERTIES(“hbase.table.name” = “hl”);

loading data into HIve table from notepad

I have loaded the data into hive table from the notepad, it is showing data is copied but when i run the select query it is showing null, please let us know what could be the reason
hive> create table test_sq(k string, v string) stored as sequencefile;
hive> load data local inpath '/tmp/input.txt' into table test_sq;
OK
hive> select * from tesst_t;
OK
NULL NULL
NULL NULL
Notepad : Assuming it is text. Whereas you have specified it as sequencefile.
Your create table script should be:
create table test_sq(k string, v string) row format delimited fields terminated by '';
I m not sure, if it is just a typo but you are trying to query on other table (tesst_t) instead of table that you loaded (test_sq)
Can you provide a sample line from your text file.
If you are using tab as delimiter then you can just use create table test_sq(k string, v string); .In other cases , as venkat has mentioned , use create table test_sq(k string, v string) row format delimited fields terminated by 'single_character_delimiter' . This will work even with tab delimiter('\t').

AWS EMR Hive partitioning unable to recognize any type of partitions

I am trying to process some log files on a bucket in amazon s3.
I create the table :
CREATE EXTERNAL TABLE apiReleaseData2 (
messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';
Then I run the following HiveQL statement and get my desired output in the file without any issues. My directories are setup in the following manner :
s3://apireleasecandidate1/regression/transferstatistics/2013/12/31/ < All the log files for this day >
What I want to do is that I specify the LOCATION up to the 's3://apireleasecandidate1/regression/transferstatistics/' and then call the
ALTER TABLE <Table Name> ADD PARTITION (<path>)
statement or the
ALTER TABLE <Table Name> RECOVER PARTITIONS ;
statement to access the files in the subdirectories. But when I do this there is no data in my table.
I tried the following :
CREATE EXTERNAL TABLE apiReleaseDataUsingPartitions (
messageId string, hostName string, timestamp string, macAddress string, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/';
and then I run the following ALTER command :
ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31');
But running the Select statement on the table gives out no results.
Can someone please guide me what I am doing wrong ?
Am I missing something Important ?
Cheers
Tanzeel
In HDFS anyway, the partitions manifest in a key/value format like this:
hdfs://apireleasecandidate1/regression/transferstatistics/year=2013/month=12/day=31
I can't vouch for S3 but an easy way to check would be to write some data into a dummy partition and see where it creates the file.
ADD PARTITION supports an optional LOCATION parameter, so you might be able to deal with this by saying
ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31') LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';
Again I've not dealt with S3 but would be interested to hear if this works for you.

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

Resources