Filter imported dataset in azure data factory - filter

I have a JSON file which I need to filter down to only show the data for the last 2 days.
Is there a way to add an expression to do this so that I can sink the dataset which contains data from the last 2 days?
Also, can it be done using the filter option in a pipeline or am I required to create a dataflow for this sort of problem?

I'm agree with #Mark Kromer, you should use Data flow. It has the filter active and can achieve that easier.
The filter needs to parse/inspect the data inside the file and possibly traverse hierarchies.
I just make a example which filter the data date > "2020-12-01":
Filter:
Output preview:
Filter based on your data column to keep the data in last 2 days.

Related

trying to add a field based on conditionals with NifI

I'm new to Apache NiFi and currently using it to route message data to various locations. I'm looking to add some fields based on a set of conditionals.
Currently I have a GetFile Processor that reads log files ---> ExtractGrok that applies a grok pattern to parse ---> ConvertRecord to convert from Grok to Json. The next part is where I'm stumped/not sure what to do next.
In my json I have a field refresh_time I need to create 2 new fields based on some conditions about the field refresh_time
something along the lines of if refresh_time < 10 then cache = 1; else if refresh_time > 10 then reprocess = 1
The end goal here is numeric fields cache and refresh_time that can be used down the road in aggregations.
What would be the best way to add 2 numerical fields based on a condition. Is there a processor for adding additional fields or updating the record to include new fields?
Thanks.
There's a couple ways you could achieve what you want to.
One option (More readable)
A QueryRecord would let you write a SQL statement across your Records and let you split them by the result. E.g.
Add a dynamic property called cache with a value SELECT * FROM FLOWFILE WHERE refresh_time < 10.
Add a dynamic property called refresh with a value SELECT * FROM FLOWFILE WHERE refresh_time > 10.
The QueryRecord will now have the relationships failure, original, cache and refresh.
Branching off from cache and refresh will be one UpdateRecord each, with Replacement Value Strategy set to Literal Value.
For the cache relationship, you can add a new dynamic property called cache with a value 1. For the refresh relationship, you can add a new dynamic property called refresh with a value 1.
Similar option (Possibly more performant)
If you want to avoid the additional UpdateRecord, you can add fields in the QueryRecord with something like this:
Two dynamic properties set as:
cache = SELECT *, 1 AS cache FROM FLOWFILE WHERE REFRESH < 10
reprocess = SELECT *, 1 AS reprocess FROM FLOWFILE WHERE REFRESH > 10
This option may be more performant due to fewer disk reads.
This gist is an example of the second option, you can import it to NiFi to try it out.
Also, FYI there is a GrokReader that you could use in ConvertRecord to parse with Grok straight to JSON, potentially skipping the ExtractGrok.

How to set start and end row or interval rows for CSV in Nifi?

I want to get particular part of excel file in Nifi. My Nifi template like that;
GetFileProcessor
ConvertExcelToCSVProcessor
PutDatabaseRecordProcessor
I should parse data between step 2 and 3.
Is there a solution for getting specific rows and columns ?
Note:If there is a option for cutting ConvertExcelToCSVProcessor, it will work for me.
You can use Record processors between ConvertExcelToCSV and PutDatabaseRecord.
to remove or override a column use UpdateRecord. this processor can receive your data via CSVReader and prepare an output for PutDatabaseRecord or QueryRecord . check View usage -> Additional Details...
in order to filter by column use QueryRecord.
here an example. this example receives data through CSVReader and makes some aggregations, you can as well do some filtering according to doc
also this post had helped me to understand Records in Nifi

How to use the field cardinality repeating in Render-CSV BW step?

I am building a generic CSV output module with a variable number of columns. The DataFormat in BW (5.14) lets you define repeating item and thus offers a list of items that I could use to map data to in the RenderCSV step.
But when I run this with data for >> 1 column (and loopings) only one column is generated.
Is the feature broken or do I use it wrongly?
Alternatively I defined "enough" optional columns in the data format and map each field separately - no really generic solution.
Looks like In BW 5, when using Data Format and Parse Data to parse text, repeating elements isn’t supported.
Please see https://support.tibco.com/s/article/Tibco-KnowledgeArticle-Article-27133
The workaround is to use Data Format resource, Parse Data and Mapper
activities together. First use Data Format and Parse Data to parse the
text into the xml where every element represents one line of the text.
Then use Mapper activity and tib:tokenize-allow-empty XSLT function to
tokenize every line and get sub-elements for each field in the lines.
The link has also attached workaround implementation

Hasura GraphQL how to group query by month and year?

Is it possible in GraphQL or Hasura to group the results by month or year? I'm currently getting the result list back as a flat array, sorted by the date attribute of the model. However, I'd like to get back 12 subarrays corresponding to each month of the year.
From docs - natively not supported.
Derived data or data transformations leads to views. Using PostgreSQL EXTRACT Function you can have separate month field from data ... but still as flat array.
Probably with some deeper customization you can achieve desired results ... but graphql [tree, arrays] structures are more for embedding not for view ...
How many records you're processing? Hundreds? Client side conversion (done easily from apollo client data on react component/container [view] level) may be good enough [especially with extracted month field].
PS. You can have many results groupped in arrays if you 'glue' many queries (copies, each month filtered) on top level ... but probably not recommended solution.

Add a field to a document

In particular, we have a tons of log messages constantly coming in the following format:
Jul 23 09:24:16 mmr mmr-core[5147]: Aweg3AOMTs_1563866656876839.mt
Jul 23 09:24:18 mmr mmr-core[5210]: Aweg3AOMTs_1563866656876839.0.dn
There are different id numbers (1563866656876839) and two possible suffixes (mt/dn).
We parse it with logstash and store these messages in one index.
When the id number with mt suffix gots dn suffix within 1 hour it means GOOD and it should get a new field status with approved value in it. If not the field value should be disapproved.
So in the end a new index isn't needed :D But I'm still curious how to achieve that and if it is even possible to create and fill the new field in document based on a time condition or how to say...
Thank you for your reply!
yes, it is possible to create and fill the new field in document based on a time condition.
First you have to create three aggregate filters with task_id as your id. One filter to create a map and second filter to submit the map as an event. In last filter there should be timeout option in case of your timeout scenario.

Resources