Can we have task start condition dependent on Sucess condition of PIPE in SNOWFLAKE - loading

I have a requirement where 3 different file will be loaded to a single table with 3 different PIPE. I want target my target process to be triggered only once all 3 file has been loaded to my stage.
I don't want to run my target process multiple times.
So is there any way we can have start condition of task on PIPE sucess.
I went to documentation but didn't find any such info or is there way of implementing it which I might be missing.

The general way to implement this pattern is with streams. Your pipes would load to three separate tables, each with a stream on it. You can then have a task that runs on a schedule, with the WHEN parameter set with SYSTEM$STREAM_HAS_DATA, three times. This ensures that your TASK only runs when all three pipes have completed successfully. Example:
CREATE TASK mytask1
WAREHOUSE = mywh
SCHEDULE = '5 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('MYSTREAM') AND SYSTEM$STREAM_HAS_DATA('MYSTREAM2')
AND SYSTEM$STREAM_HAS_DATA('MYSTREAM3')
AS
<Do stuff.>;
You have a couple options here. You can:
use the data in the streams to do whatever you want to in the task, or
you can use the data in the streams to fill the single table that the three pipes were originally filling.
If you choose option 1, you might then also want to create a view that replaces your original single table.
If you choose option 2, you can set up a task that runs using the AFTER clause to do whatever it is that you want to do.

Related

How to pass data between tasks in Spring Cloud Composed Task?

It appears to me that out of the box Spring Cloud Composed Task does not support passing parameters between tasks.
Can you please suggest some options for the below requirement?
a) I have a composed task in which downloader is the first task, once that completes then two more tasks say Item and Item Group runs. Once those completes, a transformation takes place.
b) I need to run the above composed task for different store (E.g store no 1, 2 etc...)
Even if we are using database to pass the parameters b/n task what's the unique id we can use to relate the composed task.

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

Always read first n lines on spring batch job restart

I am using spring batch module to read a complex file with multi-line records. First 3 lines in the file will always contain a header with few common fields.
These common fields will be used in the processing of subsequent records in the file. The job is restartable.
Suppose the input file has 10 records (please note number of records may not be same as number of lines since records can span over multiple lines).
Suppose job runs first time, starts reading the file from line 1, and processes first 5 records and fails while processing 6th record.
During this first run, since job has also parsed header part (first 3 lines in the file), application can successfully process first 5 records.
Now when failed job restarted it will start from 6th record and hence will not read the header part this time. Since application requires certain values
contained in the header record, the job fails. I would like to know possible suggestions so that restarted job always reads the header part and then starts
from where it left off (6th record in the above scenario).
Thanks in advance.
i guess, the file in question does not change between runs? then it's not necessary to re-read it, my solution builds on this assumption
if you use one step you can
implement a LineCallbackHandler
give it access to the stepExecutionContext (it's easy with annotations, but can be too with interfaces, just extend StepExecutionListenerSupport)
save the header values into the ExecutionContext
extract them from the context and use them where you want to
it should work for re-start as well, because Spring Batch reads/saves the values from the first run and will provide the complete ExecutionContext for subsequent runs
You can make 2 step job where:
First step reads first 3 lines as header information and puts everything you need to job context (and therefore save it in DB for future executions if job fails). If this step fails, header info will be read again and if it passes you are sure it will always have header info in job context.
Second step can use same file for input but this time you can tell it to skip first 3 lines and read rest as is. This way you will get restartability on that step and each time job fails it will resume where it left of.

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Windows Workflows - While Activity for creating multiple tasks not working

I am using a while activity for creating multiple tasks for a workflow. The code is executed fine and the task is created when the loop runs only once. But when the loop runs twice or more, only one task is getting created. Also the WF status shows as Error Occured.
All I want to do here is create multiple tasks (no of tasks depends on an entered column value) for the same user. Is it posible to use 'while' in this scenario? Or is there any other way to go ahead?
NB: I am using state machine workflow.
You may want to use a Replicator Activity which will in turn "clone" its child-activities. It can be run parallel or sequentially.
I found Working with the Replicator Activity and an Until Condition useful.
Otherwise without the Replicator, there is just one Task Activity.
In either case, make sure to assign a new Guid to the TaskId property. However, as an annoying "feature": it will not work if you just assign the TaskId property (I know, I tried and was like "Wth?!?"). Instead, bind the TaskId to a Field/Property and then assign to that.

Resources