Hive SerDe initialize() function called 8 times - hadoop

I'm writing a custom Hive SerDe that can read data stored as ORC.
The underlying table contains an event_map column MAP. The SerDe reads data from the event_map column and explodes them into separate columns after some additional processing. Eg: The map would be {'key1': 'val1', 'key2': 'val2'}.
Then i'll create an external table like:
ADD JAR hdfs:///user/my-serde.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.expand_data_using_serde (
key1 STRING,
key2 STRING
) PARTITIONED BY (
ds STRING
)
ROW FORMAT SERDE 'com.....MySerDeClass'
WITH SERDEPROPERTIES (
"key1" = "extract_json",
"key2" = "multiply_by_5"
)
STORED AS ORC
LOCATION 'hdfs://dev.db/underlying_table_stored_as_orc';
USE dev_db;
MSCK REPAIR TABLE expand_data_using_serde;
Then the SerDe would find the 'key1' and 'key2' columns and extract it from the underlying map and expose them as separate columns instead of a MAP. I know we can explode this column through a hive query but providing this example for simplicity. i.e. The SerDe would check if the table DDL has key1 as a column, and would extract key1 from the event_map and use it for that column.
Majority of the processing is in the deserialize() function of the serde. But, i notice that my initialize() method is being called 8 times.
Why is that? The deserialize is called once per row which makes sense but the initialize is called repeatedly.
Also, the SerDe has access to the output table schema through the "columns" property, can it access the query as well?
Eg: If my query is
SELECT key1 FROM dev_db.expand_data_using_serde;
Can the serde know that we only need to read key1 and skip the processing for key2?

Related

HIVE - create external tables where string itself contains commas

I am new to Hive and am creating external tables on csv file. One of the issues I am coming across are values that contain multiple commas within string itself. For example, the csv file contains the following:
CSV File
When I create an external table in Hive, because there are columns within the "name" column, it shifts the first name to the right adding another column. This throws all of the data off when you view the table in Hive.
External Table result in Hive
Is there anything I can add to my script to keep the commas but also keep first and last name in the same column when the external table is created? Thank you all in advance - I am very new to Hive.
CREATE EXTERNAL TABLE database.table name (
ID INT,
Name String,
City String,
State String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/xyz/xyz/database/directory/'
TBLPROPERTIES ("skip.header.line.count"="1");
Check this solution - you need to add this line : ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://community.cloudera.com/t5/Support-Questions/comma-in-between-data-of-csv-mapped-to-external-table-in/td-p/220193
Complete DDL example:
create table hcc(field1 string,
field2 string,
field3 string,
field4 string,
field5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"");

How to make hive table match data using column names and not using ordinal positions

If I have a csv like -
colName1,colName2
col1Value,col2Value
and a hive ddl like -
CREATE EXTERNAL TABLE tableName (
col2 STRING,
col1 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs://location/to/testcsv/directory'
tblproperties ("skip.header.line.count"="1");
//select col2 from tableName; gives col1Value
This is obviously because in case of text files hive matches column to data field by ordinal position matching. If the underlying file is parquet then the match is done using column names.
I was wondering is there is a hive SerDe someone has written or maybe some SerDe property I am missing that tells hive to map data field names with hive table column names, such that in above example it would return "col2Value" when col2 is queried, even though ordinal position of col2 in hive table and data file does not match.
Thanks in advance!

Use Comma(,) as delimiter in rows in ORC file

I am creating an ORC file in Java. For each row I want the fields to be comma delimited. Here is my java code:
ObjectInspector inspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class,ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
this.mWriter = OrcFile.createWriter(fs, fsPath, config, inspector, stripSize, CompressionKind.ZLIB, bufferSize,0);
this.mWriter.addRow(new Text("shekhar,saha"));
this.mWriter.addRow(new Text("ram,shyam"));
this.mWriter.addRow(new Text("jhon,cena"));
this.mWriter.close();
Is this right way of creating it?
I am trying to load data in a Hive table. This is how I have created my table:
create table demo ( name1 STRING,name2 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="ZLIB");
But I am not able to load data. When I am reading data from the table its throwing class caste exception org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text

How do I Insert data from text table (using MultiDelimitSerDe) to Avro Table?

I noticed that I can use an insert into statement from text table to avro table when not using the MultiDelimitSerDe. It also works with ROW FORMAT DELIMITED FIELDS TERMINATED BY "," i.e. a single character.
I create 2 tables - 1 text table and 1 avro table:
CREATE TABLE example1 ( example STRING, example2 STRING, example3
STRING ) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES ("field.delim"="**") STORED AS TEXTFILE ;
CREATE TABLE example2 ( example STRING, example2 STRING, example3
STRING ) STORED AS AVRO;
I then load data into example1 table (file delimited by "**")i.e.
LOAD DATA INPATH 'HDFS-path' INTO TABLE example1;
example1 now has data inside it. I want to insert the data from example1 to example2.
INSERT INTO TABLE example2 SELECT * from example1;
This however, gives a "return code 2" error. I have no idea why I am unable to insert the data using the MultiDelimitSerDe but I am able to do this with "ROW FORMAT DELIMITED FIELDS TERMINATED BY". But, I need to use a multi-delimiter.
Could anyone help me please?
Have you added the required JAR file?
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' - Make sure you have the required JAR file for this (hive_contrib.jar).

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

Resources