Mapping generic JSON field with Redshift Spectrum - hadoop

I am trying to cast a variable type JSON field in Redshift Spectrum as a plane string but keep getting column type VARCHAR for column STRUCT is incompatible.
The JSON data I am trying to query has several fields which structure is fixed and expected. However there is one field with metadata which is a JSON with no specific format (anything is valid). For example:
{"fixed_integer": 1, "fixed_date": "2019-01-01", "metadata": {"one": "two", "three": 4}}
{"fixed_integer": 1, "fixed_date": "2019-01-01", "metadata": {"five": [1, 2], "six": false}}
I can map the code with the DDL as
CREATE EXTERNAL TABLE my_data(
fixed_integer int,
fixed_date varchar,
metadata varchar
)
without complaints, but when I try to query the data with a simple SELECT metadata FROM my_data I get
declared column type VARCHAR for column STRUCT is incompatible.
I could not find a workaround for it so far.
Have anyone faced a this or a similar problem?

Metadata field it's not a valid varchar, to be a valid varchar field it should look like this
"metadata": '{"one": "two", "three": 4}}'
which is not a correct json format
I think if you create your external table with metadata as struct you can query it
CREATE EXTERNAL TABLE my_data(
fixed_integer int,
fixed_date varchar,
metadata struct <details:varchar(4000)>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties (
'dots.in.keys' = 'true',
as location '<s3 location>'
While you query metadata field you need to specify .
SELECT metadata.details FROM my_data
Let me know if this works for you.

Related

Hive SerDe initialize() function called 8 times

I'm writing a custom Hive SerDe that can read data stored as ORC.
The underlying table contains an event_map column MAP. The SerDe reads data from the event_map column and explodes them into separate columns after some additional processing. Eg: The map would be {'key1': 'val1', 'key2': 'val2'}.
Then i'll create an external table like:
ADD JAR hdfs:///user/my-serde.jar;
CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.expand_data_using_serde (
key1 STRING,
key2 STRING
) PARTITIONED BY (
ds STRING
)
ROW FORMAT SERDE 'com.....MySerDeClass'
WITH SERDEPROPERTIES (
"key1" = "extract_json",
"key2" = "multiply_by_5"
)
STORED AS ORC
LOCATION 'hdfs://dev.db/underlying_table_stored_as_orc';
USE dev_db;
MSCK REPAIR TABLE expand_data_using_serde;
Then the SerDe would find the 'key1' and 'key2' columns and extract it from the underlying map and expose them as separate columns instead of a MAP. I know we can explode this column through a hive query but providing this example for simplicity. i.e. The SerDe would check if the table DDL has key1 as a column, and would extract key1 from the event_map and use it for that column.
Majority of the processing is in the deserialize() function of the serde. But, i notice that my initialize() method is being called 8 times.
Why is that? The deserialize is called once per row which makes sense but the initialize is called repeatedly.
Also, the SerDe has access to the output table schema through the "columns" property, can it access the query as well?
Eg: If my query is
SELECT key1 FROM dev_db.expand_data_using_serde;
Can the serde know that we only need to read key1 and skip the processing for key2?

Create AWS Athena table from Parquet file with an array of structs as a column

I am trying to create an AWS Athena table from a Parquet file stored in S3 using the following declaration, for example:
create table "db"."fufu" (
foo array<
struct<
bar: int,
bam: int
>
>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://yada/yada/'
TBLPROPERTIES ('has_encrypted_data'='false');
I consistently getting the following error:
line 3:11: mismatched input '<' expecting {'(', 'array', '>'} (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: ...)
The syntax seems legit, and the file loads perfectly fine using spark's parquet lib, with a struct field of array type of struct type.
Any idea what can cause this error?
You need to remove double quotes from the database name and from the table name. You also need to add external before table.
create external table db.fufu (
foo array<
struct<
bar: int,
bam: int
>
>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://eth-test-ds/test/'
TBLPROPERTIES ('has_encrypted_data'='false');

Import multiple column families from hbase to hive

I am trying to move hbase table having two column family into hive table. I am able to move one column family but how can i move another one in same hive table.
Edit:
I moved one column family ushing below code.
CREATE TABLE hbase_hive(key string, firstname string, age string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age")
TBLPROPERTIES(“hbase.table.name” = “hl”);
but i am having one more column family with name hb and having three columns. How to achive this.
Update:
I also tried adding column name of different column family below is my code.
CREATE TABLE hbase_hive(key string, firstname string, age string, testname string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age,pd:name")
TBLPROPERTIES(“hbase.table.name” = “hl”);
but i am getting below result:
819215975 19391121 625678921720 NULL
819617215 19570622 625116365890 NULL
820333876 19640303 623221670810 NULL
824794938 19531211 625278010070 NULL
828093442 19420803 625284904860 NULL
828905771 19320209 625078004220 NULL
829832017 19630722 625178010070 NULL
Instead of values i am getting null.
Update:
I tried creating hbase table using below command in hbase shell
create ‘hl’,’id’
then i created one more column family using below command
alter ‘hl’,’pd’
In your HiveQL, you select two columns in column family "id" from hbase table "hl" into hive table. If you want to add more columns (even from other column families), you just need to add them to table schema and hbase.columns.mapping. For example:
CREATE TABLE hbase_hive(key string, firstname string, age string, a string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “id:firstname,id:age,hb:a")
TBLPROPERTIES(“hbase.table.name” = “hl”);
see https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-MultipleColumnsandFamilies
I see a couple of issues (more or less serious) with what you wrote:
First of all, I would create an EXTERNAL TABLE instead
You are creating a Hive table with only 3 columns but expecting 4 in the end
You are not explicitly mapping the :key
Your data for 'firstname' and 'age' looks like wild random numbers! :|
I could not test it but the following should be a better starting point:
CREATE EXTERNAL TABLE hbase_hive_hl(key string, firstname string, age string, name string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,id:firstname,id:age,pd:name")
TBLPROPERTIES(“hbase.table.name” = “hl”);

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

HIVE order by messes up data

In Hive 0.8 with Hadoop 1.03 consider this table:
CREATE TABLE table (
key int,
date timestamp,
name string,
surname string,
height int,
weight int,
age int)
CLUSTERED BY(key) INTO 128 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Then I tried:
select *
from table
where key=xxx
order by date;
The result is sorted but everything after the column name is wrong. In fact, all the rows have the exact same values in the respective fields and the surname column is missing. I also have a bitmap index on name and surname and an index on key.
Is there something wrong with my query or should I be looking into bugs about order by (I cant find anything specific).
Seems like there has been an error in loading data into hive. Make sure you don't have any special characters in your CSV File that might interfere with the insertion.
And you have clustered by the key property. Where does this key come from the CSV? or some other source? Are you sure that this is unique?

Resources