Apache NiFi from unix timestamp to actual date not working - apache-nifi

I have the following NiFi Flow, with which I am struggling to generate a date, out of a unix timestamp. And I was not able to find a solution since last year :(
First of all, I receive a file from a Kafka Processor.
The data comes as a text and it looks as follows:
exclsns1,1671785280,1671785594,1671785608.
The next step is to use a ConvertRecord and generate a Parquet File out of these incoming files.
For that, I have generated the following schemas:
Record Reader --> CSV Reader:
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "excelReader",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "time",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "starttime",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "endtime",
"type" : [ "null", "string" ],
"default" : null
} ]
}
Record Writer --> Parquet Record Set Writer
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "excelReader",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "time",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "starttime",
"type": { "type":"int", "logicalType":"date"},
"default" : null
}, {
"name" : "endtime",
"type": { "type":"long", "logicalType":"timestamp-millis"},
"default" : null
} ]
}
Notice that I have tried different types for the data, but none of which solved my issue.
The next step is to go into a PartitionRecord Processor, in which I use a ParquetReader and the same Parquet Record Set Writer controllers.
Beside that, I have defined 6 properties to help me identify why the data is not converted as expected:
a_endtime --> /endtime
a_endtime_converted --> format(/endtime, "yyyy/MM/dd/HH", "GMT")
a_startime --> /starttime
a_startime_converted --> format(/starttime, "yyyy/MM/dd/HH", "GMT")
a_time --> /time
a_time_converted --> format(/time, "yyyy/MM/dd/HH", "GMT")
However, once the flowfile gets on the Success Queue after PartitionRecord, I have the following values:
a_endtime
1671785608
a_endtime_converted
1970/01/20/08
a_startime
1671785594
a_startime_converted
1970/01/20/08
a_time
1671785280
a_time_converted
1970/01/20/08
1671785608 = Friday, December 23, 2022 8:53:28 AM
1671785594 = Friday, December 23, 2022 8:53:14 AM
1671785280 = Friday, December 23, 2022 8:48:00 AM
What am I doing wrong and having the same date generated for every value? Has anybody else faced a similar issue and might give me a hint on what to do to solve my issue?
Thank you :)

Unix time counted in seconds since 1/1/1970
Nifi based on java, and java time counted in milliseconds since 1/1/1970
So, you have just multiply your value by 1000 before formatting to date

Related

Elasticsearch conditional query : search nested if not exists check default value on parent

I'm new to Elasticsearch. I have a question about creating conditional search queries between nested and parent types.
I have a mapping below named "homes"
PUT /homes/_mapping
{
"properties": {
"name": { "type": "text" },
"ownerID": { "type": "integer" },
"status": { "type": "text" }, // active or inactive
"minDays": { "type": "integer" },
"dayData": {
"type": "nested",
"properties": {
"date": { "type": "date" },
"minDays": { "type": "integer" },
"closed": { "type": "boolean" }
}
}
}
}
Homeowners can define custom minDays for homes day by day. Of course, there is a default value for minDays in the parent.
I want to make a search query by date range
For example, users can search in homes by start and end date. If there is dayData defined with minDays on the start date, it will take the value here, if not, the parent value will be taken for filtering.
How can I achieve that?
Update
I have documents like below
For example if user is searching avaibility for the dates between "2021-01-01" and 2021-01-04. Total day count is "3" for requested dates. We need search for homes which has minDays value greater or equeal than "3".
In this situation:
I need to get docs with status is "active"
The min day value for _doc/1 is "2", cause there is a custom dayData value with minDays. So this property won't be available.
The min day value for _doc/2 is "4". We get default value from parent, cause there is no custom dayData value. This property will be available.
If they dayData includes a "closed" value with "true" between searched dates, home won't be available.
I have tried several search queries to achieve that but failed for this context.
Thank you in advance.
PUT homes/_doc/1
{
"name" : "Home Test",
"status" : "active",
"ownerID": 1,
"minDays" : "3",
"dayData" : [
{
"date" : "2021-01-01",
"minDays" : 2
},
{
"date" : "2021-01-04",
"minDays" : 5
},
{
"date" : "2021-01-05",
"closed": true
}
]
}
PUT homes/_doc/2
{
"name" : "Home Test 2",
"status" : "active",
"ownerID": 2,
"minDays" : "4",
"dayData" : [
{
"date" : "2021-05-02",
"minDays" : 2
},
{
"date" : "2021-05-03",
"minDays" : 2
},
{
"date" : "2021-05-03",
"minDays" : 2,
"closed" :true
},
{
"date" : "2021-05-10",
"minDays" : 5
}
]
}

CSV to nested JSON using NiFi

I want to create nested json from csv using nifi -
CSV file:
"Foo",12,"newyork","North avenue","123213"
"Foo1",12,"newyork","North avenue","123213"
"Foo2",12,"newyork","North avenue","123213"
Required Json:
{
"studentName":"Foo",
"Age":"12",
"address__city":"newyork",
"address":{
"address__address1":"North avenue",
"address__zipcode":"123213"
}
}
I am using nifi 1.4 convertRecord Processor by applying avro schema but not able to get the nested json.
Avro schema:
{
"type" : "record",
"name" : "MyClass",
"namespace" : "com.test.avro",
"fields" : [ {
"name" : "studentName",
"type" : "string"
}, {
"name" : "Age",
"type" : "string"
}, {
"name" : "address__city",
"type" : "string"
}, {
"name" : "address",
"type" : {
"type" : "record",
"name" : "address",
"fields" : [ {
"name" : "address__address1",
"type" : "string"
}, {
"name" : "address__zipcode",
"type" : "string"
} ]
}
} ]
}
You will need to:
Split your flowfile into individual records using SplitRecord
Convert from CSV to flat JSON files using ConvertRecord
Use the JOLT transform processor to transform your flat JSON into nested JSON objects in your desired format.

time-based when configure an index pattern not working

Hi!
I have an issue about set a date field as time-based when I configure my index pattern. When I choose my date filed on the timefield name, I cannot Vizualise any data on the Discover part.
However, when I uncheck the box named Index contains time-based events, all data appears:
Maybe I forgot something during my mapping ? There is the mapping I've set for this index:
"index_test" : {
"mappings": {
"tr": {
"_source": {
"enabled":true
},
"properties" : {
"id" : { "type" : "integer" },
"volume" : { "type" : "integer" },
"high" : { "type" : "float" },
"low" : { "type" : "float" },
"timestamp" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss" }
}
}
}'
}
I am currently try to use timelion also, and it seems to not found any data to show. I think it cannot because of this time-based unchecked... Any idea about how set this timestamp as time-based without loose the data access on the Discover part ?
Simple question with simple answer... I just forgot to set the timepicker in the Right-top of the Discover part to show past data:

How should a TSV file be formatted in DRUID?

I am curious how a TSV file should look when we are ingesting data from a local TSV file using DRUID.
Should it just be like:
Please note this is just for testing:
quickstart/sample_data.tsv file:
name lastname email time Bob Jones bobj#gmail.com 1468839687 Billy Jones BillyJ#gmail.com 1468839769
Where this part is my dimensions: name lastname email
And this part is my actual data: Bob Jones bobj#gmail.com 1468839687 Billy Jones BillyJ#gmail.com 1468839769
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "quickstart/sample_data.tsv"
}
},
"dataSchema" : {
"dataSource" : "local",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none",
"intervals" : ["2016-07-18/2016-07-18"]
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "tsv",
"dimensionsSpec" : {
"dimensions" : [
"name",
"lastname",
"email"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "added",
"type" : "longSum",
"fieldName" : "deleted"
}
]
}
}
}
I had some questions about my spec file as well since I was not able to find the answers to them on the doc. I would appreciate it if someone can answer them for me :)!
1) I noticed in the example spec they added the line "type" : "index_hadoop" at the very top. What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory? Also where can I read about the different values I should put for this "type" key in the docs? I didn't get a explanation for that.
2) Again there is a type variable in the ioConfig: "type" : "hadoop". What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory?
3) For the timestampSpec, the time in my TSV file is in GMT. Is there any way I can use this as the format. Since I read you should convert it to UTC and is there a way to convert to UTC during the process of posting the data to the Overlord? Or will I have to change all of those GMT time formats to UTC similar to this: "time":"2015-09-12T00:46:58.771Z".
Druid supports two ways of ingesting batch data
Hadoop Index Task
Index Task
The spec you are referring to is of a Hadoop Index Task hence "type" is "index_hadoop" and also ioconfig type is "hadoop".
Here is a sample spec for a index task which can read from local file:
{
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "wikipedia",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": ["page", "language"]
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
}, {
"type": "doubleSum",
"name": "added",
"fieldName": "added"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "NONE",
"intervals": ["2013-08-31/2013-09-01"]
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": "examples/indexing/",
"filter": "wikipedia_data.json"
}
}
}
}

Customize the AVRO file generated by Sqoop

I am successfully able to generate an avro file from sqoop directly.
However when I look at the schema definition of the generated avro file I see
{
"type" : "record",
"name" : "sqoop_import_QueryResult",
"doc" : "Sqoop import of QueryResult",
"fields" : [ {
"name" : "blabla",
"type" : [ "string", "null" ],
"columnName" : "blabla",
"sqlType" : "12"
}, {
"name" : "blabla",
"type" : [ "string", "null" ],
"columnName" : "blabla",
"sqlType" : "12"
}
}
I wonder if I could change the name and the type to something more meaningful than sqoop_import_queryResult and Sqoop import of QueryResult.
is this possible?
I found out that this is easy and its quote possible. in your sqoop command just pass a "--classname" whatever you put after this will become the name of the avro schama.

Resources