XML Serde for Hadoop/Hive - hadoop

I used JSONSerde to process huge amounts of JSON data stored on S3 using Amazon EMR. One of my clients has a requirement to process massive XML data but I couldn't find any XML Serde to use with HIVE.
Have you folks processed XML with hive? I would appreciate your suggestions and comments regarding this before I start building my own XML Serde.

I use the following for XML parsing serde in hive ---
CREATE EXTERNAL TABLE XYZ(
X STRING,
Y STRING,
Z ARRAY<STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.X"="/XX/#X",
"column.xpath.Y"="/YY/#Y"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/XXX'
TBLPROPERTIES (
"xmlinput.start"="<xml start",
"xmlinput.end"="</xml end>"
);

Link to download the xmlserde is
http://central.maven.org/maven2/com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.0.0/hivexmlserde-1.0.0.0.jar
Put this jar file in path /usr/lib/hive/lib
Once you done with this, you can use this xml serde:
CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics
map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/#customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);

Related

Hive SerDe with "\u0000" as delimiter - can't get it to work

I have a dataset similar to this:
SerDe sits on top of S3 location, and looks something similar to this:
CREATE EXTERNAL TABLE `default.ga_serde_test`(
column1 string,column2 string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3a://xxxxxxx/inbound/xxx'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'numFiles'='0',
'numRows'='-1',
'quoteChar'='\"',
'rawDataSize'='-1',
'separatorChar'="\000",
'totalSize'='0',
)
I tried \000, \0, ^#, NULL as seperatorChars - neither worked. Data is all loaded to the first column leaving second column empty.
Could anyone advise?

msck repair table not working on unpartitioned table - hive config issue

I have an unpartitioned EXTERNAL table:
CREATE EXTERNAL TABLE `db.tableName`(
`sid` string,
`uid` int,
`t1` timestamp,
`t2` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<db_location>/tableName'
TBLPROPERTIES (
'serialization.null.format'='',
'transient_lastDdlTime'='1551121065')
When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero.
There are 10 rows in the CSV and I expect to get the count back as 10.
Any help is appreciated.

Replicating table setup from ORC to Parquet

I have the following table definition with ORC that I would like to replicate to Parquet (there are more fields I am not showing):
CREATE EXTERNAL TABLE `test_a`(
`some_id` int,
`sha_sum` string,
`parent_sha_sum` string,
`md5_sum` string
)
PARTITIONED BY (
`server_date` date
)
CLUSTERED BY (
sha_sum
)
SORTED BY (
sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluster/user/myuser/test_a'
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.create.index'='true',
'orc.stripe.size'='130023424',
'orc.row.index.stride'='64000',
'orc.create.index'='true';
I was wondering how can I replicate this to Parquet. I would like to use ZLIB or something like that for compression, I would like to have indexes and potentially tune some of the TBLPROPERTIES for Parquet.
CREATE EXTERNAL TABLE `test_b`(
`some_id` int,
`sha_sum` string,
`parent_sha_sum` string,
`md5_sum` string
)
PARTITIONED BY (
`server_date` date
)
CLUSTERED BY (
sha_sum
)
SORTED BY (
sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
STORED AS PARQUET
LOCATION 'hdfs://cluster/user/myuser/test_b'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true'
)
Is there a list of all of the options available for Parquet through TBLPROPERTIES?

what is the use of serde in HIVE

Hi I'm beginner to hive and I found the below from one of the sample code, can some one help me in understanding the below piece of code :
CREATE EXTERNAL TABLE emp (
id bigint,
name string,
dept bigint,
salary bigint)
partitioned by (yearofjoining string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='|',
'serialization.format'='|')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3n://xxx/xxxx/xxx/xxx/xx'

Hive array with hbase in binary type with RegexSerDe

I tried to create a table using RegexSerDe because my data is bytes and my bytes are in collision with default delimiter.
CREATE External TABLE f10(key string, arr array<string> )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES("field.delimited"="[,]")
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:arr2" )
TBLPROPERTIES ("hbase.table.name"="f");
but some error :
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.contrib.serde2.RegexSerDe only accepts string columns, but column[1] named arr has type array<string>)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Any idea ?
Any good delimiter?
Or Any good SERDE for this.
hive version 11
A storage handler comes with its own serde and input/output formats. I am not sure if specifying your own serde along with a storage handler will work.
I am trying to find an answer for a similar problem. Multi-byte delimiter in hbase table key or value are difficult to manage with hive.

Resources