Why does Kafka Connect treat timestamp columns differently? - jdbc

I have a Kafka Connect configuration set up to pull data from DB2. I'm not using Avro, just the out-of-the-box json. Among the colums in the db are several timestamp columns, and when they are streamed, they come out like this:
"Process_start_ts": 1578600031762,
"Process_end_ts": 1579268248183,
"created_ts": 1579268247984,
"updated_ts": {
"long": 1579268248182
}
}
The last column is rendered with this sub-element, though the other 3 are not. (This will present problems for the consumer.)
The only thing I can see is that in the DB, that column alone has a default value of null.
Is there some way I can force this column to render in the message as the prior 3?

Try to flatten your message using Kafka Connect Transformations.
The configuration snippet below shows how to use Flatten to concatenate field names with the period . delimiter character (you have to add these lines to the connector config):
"transforms": "flatten",
"transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.flatten.delimiter": "."
As a result, your JSON message should look like this:
{
"Process_start_ts": 1578600031762,
"Process_end_ts": 1579268248183,
"created_ts": 1579268247984,
"updated_ts.long": 1579268248182
}
See example Flatten SMT for JSON.

I'm not sure created is any different than the first two. The value is still a long, only the key is all lower case. It's not clear how it would know what the default should be - are you using sure you're not using AvroConverter? If not, it's not clear what fields would have defaults
The updated time is nested like that, based on the Avro or structured JSON Kafka Connect specifications that say the type name is included as part of the record to explicitly denote the type of a nullable field

Related

How to use the field cardinality repeating in Render-CSV BW step?

I am building a generic CSV output module with a variable number of columns. The DataFormat in BW (5.14) lets you define repeating item and thus offers a list of items that I could use to map data to in the RenderCSV step.
But when I run this with data for >> 1 column (and loopings) only one column is generated.
Is the feature broken or do I use it wrongly?
Alternatively I defined "enough" optional columns in the data format and map each field separately - no really generic solution.
Looks like In BW 5, when using Data Format and Parse Data to parse text, repeating elements isn’t supported.
Please see https://support.tibco.com/s/article/Tibco-KnowledgeArticle-Article-27133
The workaround is to use Data Format resource, Parse Data and Mapper
activities together. First use Data Format and Parse Data to parse the
text into the xml where every element represents one line of the text.
Then use Mapper activity and tib:tokenize-allow-empty XSLT function to
tokenize every line and get sub-elements for each field in the lines.
The link has also attached workaround implementation

Migrating table with PutDatabaseRecord with different column name at the target table

I need to migrate the data from a db2 table to a mssql table but one column has a different name, but the same datatype.
Db2 table:
NROCTA,NUMRUT,DIASMORA2
MSSQL table:
NROCTA,NUMRUT,DIAMORAS
As you see DIAMORAS is different.
Im using the following flow:
ExecuteSQL -> SplitAvro -> PutDatabaseRecord
In PutDataBaseRecord I have as RecordReader an AvroReader configured in this way:
Schema Acesss Strategy: Use Embedded Avro Schema.
Schema Text: ${avro.schema}
The flow just insert the two first columns.¿How I can do the mapping between DIASMORA2 and DIAMORAS columns ?
Thanks in advance!
First thing, you probably don't need SplitAvro in your flow at all, unless there's some logical subset of rows that you are trying to send as individual transactions.
For the column name change, use UpdateRecord and set the field /DIASMORAS to the record path /DIASMORA2, and change the name of the field in the AvroRecordSetWriter's schema from DIASMORA2 to DIASMORAS.
That last part is a little trickier since you are using the embedded schema in your AvroReader. If the schema will always be the same, you can stop the UpdateRecord processor and put in an ExtractAvroMetadata processor to extract the avro.schema attribute. That will put the embedded schema in the flowfile's avro.schema attribute.
Then before you start UpdateRecord, start the ExecuteSQL and ExtractAvroMetadata processors, then inspect a flow file in the queue to copy the schema out of the avro.schema attribute. Then in your AvroRecordSetWriter in ConvertRecord, instead of Inheriting the schema, you can choose to Use Schema Text, then paste in the schema from the attribute, changing DIASMORA2 to DIASMORAS. This approach puts values from the DIASMORA2 field into the DIASMORAS field, but since DIASMORA2 is not in the output schema, it is ignored, thereby effectively renaming the field (although under the hood it is a copy-and-remove).

Nifi add attribute from DB

I am currently getting files from FTP in Nifi, but I have to check some conditions before I fetch the file. The scenario goes some thing like this.
List FTP -> Check Condition -> Fetch FTP
In the Check Condition part, I have fetch some values from DB and compare with the file name. So can I use update attribute to fetch some records from DB and make it like this?
List FTP -> Update Attribute (from DB) -> Route on Attribute -> Fetch FTP
I think your flow looks something like below
Flow:
1.ListFTP //to list the files
2.ExecuteSQL //to execute query in db(sample query:select max(timestamp) db_time from table)
3.ConvertAvroToJson //convert the result of executesql to json format
4.EvaluateJsonPath //keep destination as FlowfileAttribute and add new property as db_time as $.db_time
5.ROuteOnAttribute //perform check filename timestamp vs extracted timestamp by using nifi expresson language
6.FetchFile //if condition is true then fetch the file
RouteOnAttribute Configs:
I have assumed filename is something like fn_2017-08-2012:09:10 and executesql has returned 2017-08-2012:08:10
Expression:
${filename:substringAfter('_'):toDate("yyyy-MM-ddHH:mm:ss"):toNumber()
:gt(${db_time:toDate("yyyy-MM-ddHH:mm:ss"):toNumber()})}
By using above expression we are having filename value same as ListFTP filename and db_time attribute is added by using EvaluateJsonPath processor and we are changing the time stamp to number then comparing.
Refer to this link for more details regards to NiFi expression language.
So if I understand your use case correctly, it is like you are using the external DB only for tracking purpose. So I guess only the latest processed timestamp is enough. In that case, I would suggest you to use DistributedCache processors and ControllerServices offered by NiFi instead of relying on an external DB.
With this method, your flow would be like:
ListFile --> FetchDistributedMapCache --(success)--> RouteOnAttribute -> FetchFile
Configure FetchDistributedMapCache
Cache Entry Identifier - This is the key for your Cache. Set it to something like lastProcessedTime
Put Cache Value In Attribute - Whatever name you give here will be added as a FlowFile attribute with its value being the Cache value. Provide a name, like latestTimestamp or lastProcessedTime
Configure RouteOnAttribute
Create a new dynamic relationship by clicking the (+) button in the Properties tab. Give it a name, like success or matches. Let's assume, your filenames are of the format somefile_1534824139 i.e. it has a name and an _ and the epoch timestamp appended.
In such case, you can leverage NiFi Expression Language and make use of the functions it offer. So for the new dynamic relation, you can have an expression like:
success - ${filename:substringAfter('_'):gt(${lastProcessedTimestamp})}
This is with the assumption that, in FetchDistributedMapCache, you have configured the property Put Cache Value In Attribute with the value lastProcessedTimestamp.
Useful Links
https://community.hortonworks.com/questions/83118/how-to-put-data-in-putdistributedmapcache.html
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates

retrieve a record using Dynamics 365 Web API with empty alternate key

Problem
I'm trying to query records from Dynamics 365 using the Web API where a key attribute is null.
Approach
For testing purposes I created an entity with a string attribute, an integer attribute, and a decimal attribute. I then created an alternate key and put those three attributes in it, making this combination of attributes unique. I then created some records.
Querying the data using the Web API works as expected. I created a record with those values:
{
"my_string_key": "s1",
"my_integer_key": 1,
"my_decimal_key": 1.23
}
And query it like this:
/my_entities(my_string_key='s1',my_integer_key=1,my_decimal_key=1.23)
This returns the desired record.
However I can't get it to work when any of the key fields is empty, it always returns a 400 with message "Bad Request - Error in query syntax". Just for clarification: I purposely created records where one of the key attributes is empty, like this:
{
"my_integer_key": 1,
"my_decimal_key": 1.23
}
{
"my_string_key": "s1",
"my_decimal_key": 1.23
}
{
"my_string_key": "s1",
"my_integer_key": 1
}
Notice how each record is missing one of the key attributes, which effectively leaves it at null.
I tried using =null, =empty, ='', = and =Microsoft.OData.Core.ODataNullValue but nothing works. I tried to query them like this:
/my_entities(my_string_key=null,my_integer_key=1,my_decimal_key=1.23)
/my_entities(my_string_key='s1',my_integer_key=null,my_decimal_key=1.23)
/my_entities(my_string_key='s1',my_integer_key=1,my_decimal_key=null)
I also tried omitting the attribute with null in it, like this:
/my_entities(my_integer_key=1,my_decimal_key=1.23)
/my_entities(my_string_key='s1',my_decimal_key=1.23)
/my_entities(my_string_key='s1',my_integer_key=1)
None of this works. Can anyone provide a solution?
Final Words
I know that key attributes shouldn't be empty because that's not the point of key attributes. My scenario however requires to handle this specific case.
I also know that I could just use the $filter parameter, but I'd like a solution without having to resort to this "workaround". After all, it is a key and I'd like to treat it as one.
I think your efforts prove that key attributes cannot be empty. So, time for another approach.
Create a custom attribute and fill it with the combined values of the three key attributes. You can do this in a plugin registered in the pre operation stage of the Create and eventually Update messages of your entity.
Use the custom attribute as the key attribute.

How to get sorted rows out of cassandra when using RandomPartioner and Hector as Client?

To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.
I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/

Resources