how to specify column description in parquet schema definition - hadoop

I am using cascading to convert Text Delimited to parquet & avro files. I am not able to provide description for columns in parquet metadata the same way way Avro has it. This will be helpful when anyone is using the data set to get some description about the field in the data set itself.
Below is the Parquet Schema:
message LaunchApplication {
required int field1;
required binary field2;
optional binary field3;
required binary field4;
}
Below is the avro schema:
{ "type":"record", "name":"CascadingAvroSchema", "namespace":"", "fields":[
{"name":"field1","type":"int","doc":"10,NOT NULL, KeyField"},
{"name":"field2","type":"string","doc":"5,NOT NULL, FLAG, Indicator},
{"name":"field3","type":["null","string"],"doc":"20,NULL, System Field."},
{"name":"field4","type":"string","doc":"20,NOT NULL,MM/DD/YYYY,Record Changed Date."} ]
}
How do i keep track of the "doc" section in the avro file in parquet as well ?

Actually Parquet supports Avro schemas as well. If you use an Avro schema, Parquet will infer the Parquet schema from it and also store the Avro schema in the metadata.

Related

ADF force format stored in parquet from copy activity

I've created an ADF pipeline that converts a delimited file to parquet in our datalake. I've added an additional column and set the value using the following expression #convertfromutc(utcnow(),'GMT Standard Time','o'). The problem I am having is when I look at the parquet file it is coming back in the US format.
eg 11/25/2021 14:25:49
Even if I use #if(pipeline().parameters.LoadDate,json(concat('[{"name": "LoadDate" , "value": "',formatDateTime(convertfromutc(utcnow(),'GMT Standard Time','o')),'"}]')),NULL) to try to force the format on the extra column it still comes back in the parquet in the US format.
Any idea why this would be and how I can get this to output into parquet as a proper timestamp?
Mention the format pattern while using convertFromUtc function as shown below.
#convertFromUtc(utcnow(),’GMT Standard Time’,’yyyy-MM-dd HH:mm:ss’)
Added date1 column in additional columns under source to get the required date format.
Preview of source data in mappings. Here data is previewed as giving format in convertFromUtc function.
Output parquet file:
Data preview of the sink parquet file after copying data from the source.

Can ParquetWriter or AvroParquetWriter store the schema separately?

Do you know, can ParquetWriter or AvroParquetWriter store the schema separately without data?
Now schema is written into parquet file:
AvroParquetWriter.Builder builder = AvroParquetWriter.<GenericRecord>builder(new Path(file.getName()))
.withSchema(payload.getSchema())
.build90;
Do you know is possible write only data without schema into parquet file?
Thank you!
#ЭльфияВалиева. No, the parquet metadata (schema) in the footer is necessary to provide parquet readers the necessary schema to read the parquet data.

avro data validation with the schema

I'm new to this data validation and related concept, please excuse me if it is a simple question, help me with the steps to achieve this.
Use case: Validating AVRO file (Structure and Data)
Inputs:
We are going to receive a AVRO file’s
We will have a schema file in a note pad (ex- field name, data type and size etc)
Validation:
Need to validate AVRO file with structure (schema-i.e field, data type, size etc)
Need to validate number and decimal format while viewing from Hive
So far I'm able to achieve is to get the schema from the avro file using the avro jar.

How do i use Sqoop to save data in a parquet-avro file format?

I need to move my data from a relational database to HDFS but i would like to save the data to a parquet-avro file format. Looking at the sqoop documentation it seems like my options are --as-parquetfile or --as-avrodatafile, but not a mix of both. From my understanding of this blog/picture below, the way parquet-avro works is that it is a parquet file with the avro schema embedded and a converter to convert and save an avro object to a parquet file and vise versa.
My initial assumption is that if i use the sqoop option --as-parquetfile then the data being saved to the parquet file will be missing the avro schema and the converter won't work. However upon looking at the sqoop code that saves the data to a parquet file format it does seem to be using a util related to avro but i'm not sure what's going on. Could someone clarify? If i cannot do this with sqoop, what other options do i have?
parquet-avro is mainly a convenience layer so that you can read/write data that is stored in Apache Parquet into Avro object. When you read the Parquet again with parquet-avro, the Avro schema is inferred from the Parquet schema (alternatively you should be able to specify an explicit Avro schema). Thus you should be fine with --as-parquetfile.

Flume Hive sink failed to serialize JSON with array

I am trying to load JSON data to Hive via Hive Sink.
But it fails with the following error:
WARN org.apache.hive.hcatalog.data.JsonSerDe: Error [java.io.IOException: Field name expected] parsing json text [{"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}].
INFO org.apache.flume.sink.hive.HiveWriter: Parse failed : Unable to convert byte[] record into Object : {"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}
Example of data:
{"id": "12345", "url": "https://mysite", "title": ["MyTytle"]}
Description of Hive table:
id string
url string
title array<string>
time string
# Partitions
time string
And the same way it works fine if JSON data doesn't contain arrays (and Hive table either).
Flume version: 1.7.0 (Cloudera CDH 5.10)
Does it possible to load JSON data with arrays via Flume Hive sink?
Is it possible to load JSON data with arrays via Flume Hive sink?
I assume it is possible, despite I never tried myself. From:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_HDP_RelNotes/content/ch01s08s02.html
Following serializers are provided for Hive sink:
JSON: Handles UTF8 encoded Json (strict syntax) events and requires no
configuration. Object names in the JSON are mapped directly to columns
with the same name in the Hive table. Internally uses
org.apache.hive.hcatalog.data.JsonSerDe but is independent of the
Serde of the Hive table. This serializer requires HCatalog to be
installed.
So maybe you are implementing something wrong in the SerDe. This user solved the problem of serialising a JSON with arrays by performing a previous regexp:
Parse json arrays using HIVE
Another thing you may try is to change the SerDe. At least you have this two options (maybe there are some more):
'org.apache.hive.hcatalog.data.JsonSerDe'
'org.openx.data.jsonserde.JsonSerDe'
(https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master)

Resources