Customize the AVRO file generated by Sqoop

Customize the AVRO file generated by Sqoop - sqoop

I am successfully able to generate an avro file from sqoop directly.
However when I look at the schema definition of the generated avro file I see
{
"type" : "record",
"name" : "sqoop_import_QueryResult",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "blabla",
"type" : [ "string", "null" ],
"columnName" : "blabla",
"sqlType" : "12"
}, {
"name" : "blabla",
"type" : [ "string", "null" ],
"columnName" : "blabla",
"sqlType" : "12"
}
}
I wonder if I could change the name and the type to something more meaningful than sqoop_import_queryResult and Sqoop import of QueryResult.
is this possible?

I found out that this is easy and its quote possible. in your sqoop command just pass a "--classname" whatever you put after this will become the name of the avro schama.

Related

Apache NiFi from unix timestamp to actual date not working

I have the following NiFi Flow, with which I am struggling to generate a date, out of a unix timestamp. And I was not able to find a solution since last year :(
First of all, I receive a file from a Kafka Processor.
The data comes as a text and it looks as follows:
exclsns1,1671785280,1671785594,1671785608.
The next step is to use a ConvertRecord and generate a Parquet File out of these incoming files.
For that, I have generated the following schemas:
Record Reader --> CSV Reader:
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "excelReader",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "time",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "starttime",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "endtime",
"type" : [ "null", "string" ],
"default" : null
} ]
}
Record Writer --> Parquet Record Set Writer
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "excelReader",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "time",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "starttime",
"type": { "type":"int", "logicalType":"date"},
"default" : null
}, {
"name" : "endtime",
"type": { "type":"long", "logicalType":"timestamp-millis"},
"default" : null
} ]
}
Notice that I have tried different types for the data, but none of which solved my issue.
The next step is to go into a PartitionRecord Processor, in which I use a ParquetReader and the same Parquet Record Set Writer controllers.
Beside that, I have defined 6 properties to help me identify why the data is not converted as expected:
a_endtime --> /endtime
a_endtime_converted --> format(/endtime, "yyyy/MM/dd/HH", "GMT")
a_startime --> /starttime
a_startime_converted --> format(/starttime, "yyyy/MM/dd/HH", "GMT")
a_time --> /time
a_time_converted --> format(/time, "yyyy/MM/dd/HH", "GMT")
However, once the flowfile gets on the Success Queue after PartitionRecord, I have the following values:
a_endtime
1671785608
a_endtime_converted
1970/01/20/08
a_startime
1671785594
a_startime_converted
1970/01/20/08
a_time
1671785280
a_time_converted
1970/01/20/08
1671785608 = Friday, December 23, 2022 8:53:28 AM
1671785594 = Friday, December 23, 2022 8:53:14 AM
1671785280 = Friday, December 23, 2022 8:48:00 AM
What am I doing wrong and having the same date generated for every value? Has anybody else faced a similar issue and might give me a hint on what to do to solve my issue?
Thank you :)

Unix time counted in seconds since 1/1/1970
Nifi based on java, and java time counted in milliseconds since 1/1/1970
So, you have just multiply your value by 1000 before formatting to date

Sqoop Import failing while imporing AVRO data from SQL Server to HDFS

I am new to AVRO and I am trying to import AVRO format data from SQL Server to HDFS.
Error: org.kitesdk.data.DatasetOperationException: Failed to append {"id": "D22C2475", "create_date": "2020-08-22 14:34:06.0", "modified_date": "2020-08-22 14:34:06.0"} to ParquetAppender{path=job_1597813536070/mr/attempt_1597813536070_m_000000_0/.d55262cf-e49b-4378-addc-0f85698efb47.parquet.tmp">hdfs://nameservice1/tmp/schema/.temp/job_1597813536070/mr/attempt_1597813536070_m_000000_0/.d55262cf-e49b-4378-addc-0f85698efb47.parquet.tmp, schema={"type":"record","name":"AutoGeneratedSchema","doc":"Sqoop import of QueryResult","fields":[{"name":"id","type":["null","string"],"default":null,"columnName":"id","sqlType":"1"},{"name":"create_date","type":["null","long"],"default":null,"columnName":"create_date","sqlType":"93"},{"name":"modified_date","type":["null","long"],"default":null,"columnName":"modified_date","sqlType":"93"}],"tableName":"QueryResult"}, fileSystem=DFS[DFSClient[clientName=DFSClient_attempt_1597813536070_m_000000_0_960843231_1, ugi=username (auth:SIMPLE)]], avroParquetWriter=parquet.avro.AvroParquetWriter#7b122839}
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Number
TABLE -
CREATE TABLE “ticket”(
id string,
create_date string,
modified_date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'avro.schema.url'='hdfs://nameservice1/user/hive/warehouse/schema.db/ticket/.metadata/schemas/1.avsc',
'kite.compression.type'='snappy');
AVRO file metadata - hdfs://nameservice1/user/hive/warehouse/schema.db/ticket/.metadata/schemas/1.avsc'
{
"type" : "record",
"name" : "AutoGeneratedSchema",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "id",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "id",
"sqlType" : "1"
}, {
"name" : "create_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "create_date",
"sqlType" : "93"
}, {
"name" : "modified_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "modified_date",
"sqlType" : "93"
}],
"tableName" : "QueryResult"
}

I fixed the issue. There was some issue with my AVRO metadata file. I recreated it and add it in Hive table with below command.
alter table table_name set serdeproperties ('avro.schema.url' = 'hdfs://user/hive/warehouse/schema.db/table_name/1.avsc');

CSV to nested JSON using NiFi

I want to create nested json from csv using nifi -
CSV file:
"Foo",12,"newyork","North avenue","123213"
"Foo1",12,"newyork","North avenue","123213"
"Foo2",12,"newyork","North avenue","123213"
Required Json:
{
"studentName":"Foo",
"Age":"12",
"address__city":"newyork",
"address":{
"address__address1":"North avenue",
"address__zipcode":"123213"
}
}
I am using nifi 1.4 convertRecord Processor by applying avro schema but not able to get the nested json.
Avro schema:
{
"type" : "record",
"name" : "MyClass",
"namespace" : "com.test.avro",
"fields" : [ {
"name" : "studentName",
"type" : "string"
}, {
"name" : "Age",
"type" : "string"
}, {
"name" : "address__city",
"type" : "string"
}, {
"name" : "address",
"type" : {
"type" : "record",
"name" : "address",
"fields" : [ {
"name" : "address__address1",
"type" : "string"
}, {
"name" : "address__zipcode",
"type" : "string"
} ]
}
} ]
}

You will need to:
Split your flowfile into individual records using SplitRecord
Convert from CSV to flat JSON files using ConvertRecord
Use the JOLT transform processor to transform your flat JSON into nested JSON objects in your desired format.

How should a TSV file be formatted in DRUID?

I am curious how a TSV file should look when we are ingesting data from a local TSV file using DRUID.
Should it just be like:
Please note this is just for testing:
quickstart/sample_data.tsv file:
name lastname email time Bob Jones bobj#gmail.com 1468839687 Billy Jones BillyJ#gmail.com 1468839769
Where this part is my dimensions: name lastname email
And this part is my actual data: Bob Jones bobj#gmail.com 1468839687 Billy Jones BillyJ#gmail.com 1468839769
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "quickstart/sample_data.tsv"
}
},
"dataSchema" : {
"dataSource" : "local",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none",
"intervals" : ["2016-07-18/2016-07-18"]
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "tsv",
"dimensionsSpec" : {
"dimensions" : [
"name",
"lastname",
"email"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "added",
"type" : "longSum",
"fieldName" : "deleted"
}
]
}
}
}
I had some questions about my spec file as well since I was not able to find the answers to them on the doc. I would appreciate it if someone can answer them for me :)!
1) I noticed in the example spec they added the line "type" : "index_hadoop" at the very top. What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory? Also where can I read about the different values I should put for this "type" key in the docs? I didn't get a explanation for that.
2) Again there is a type variable in the ioConfig: "type" : "hadoop". What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory?
3) For the timestampSpec, the time in my TSV file is in GMT. Is there any way I can use this as the format. Since I read you should convert it to UTC and is there a way to convert to UTC during the process of posting the data to the Overlord? Or will I have to change all of those GMT time formats to UTC similar to this: "time":"2015-09-12T00:46:58.771Z".

Druid supports two ways of ingesting batch data
Hadoop Index Task
Index Task
The spec you are referring to is of a Hadoop Index Task hence "type" is "index_hadoop" and also ioconfig type is "hadoop".
Here is a sample spec for a index task which can read from local file:
{
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "wikipedia",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": ["page", "language"]
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
}, {
"type": "doubleSum",
"name": "added",
"fieldName": "added"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "NONE",
"intervals": ["2013-08-31/2013-09-01"]
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": "examples/indexing/",
"filter": "wikipedia_data.json"
}
}
}
}

How do I flatten nested Avro records in a Pig query?

Avro schema looks like this:
{
"type" : "record",
"name" : "name1",
"fields" :
[
{
"name" : "f1",
"type" : "string"
},
{
"name" : "f2",
"type" :
{
"type" : "array",
"items" :
{
"type" : "record",
"name" : "name2",
"fields" :
[
{
"name" : "time",
"type" : [ "float", "int", "double", "long" ]
},
]
}
}
}
]
}
After reading it in Pig:
grunt> A = load 'data' using AvroStorage();
grunt> DESCRIBE A;
A: {f1: chararray,f2: {ARRAY_ELEM: (time: (FLOAT: float,INT: int,DOUBLE: double,LONG: long))}}
What I want is probably a bag of (f1:chararray, timestamp:double). This is what I did:
grunt> B = FOREACH A GENERATE f1, f2.time AS timestamp;
grunt> DESCRIBE B;
B: {f1: chararray,timestamp: {(time: (FLOAT: float,INT: int,DOUBLE: double,LONG: long))}}
So how do I flatten this record?
I'm new to Pig, Avro and don't know what I'm trying to do even makes sense. Thanks for your help.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Customize the AVRO file generated by Sqoop - sqoop

I found out that this is easy and its quote possible. in your sqoop command just pass a "--classname" whatever you put after this will become the name of the avro schama.

Related

Apache NiFi from unix timestamp to actual date not working

Sqoop Import failing while imporing AVRO data from SQL Server to HDFS

CSV to nested JSON using NiFi

How should a TSV file be formatted in DRUID?

How do I flatten nested Avro records in a Pig query?

Categories

Resources