Souffle datalog: Stop execution when one record of a relation exists - static-analysis

Is there a way to make souffle stop execution when one record of a relation is found (for example, say I have .decl relationA(x:number), and we find one record relationA(some number) then execution stops). Otherwise, is it not possible to make souffle flush relations to disk before execution finishes?
Thank you

If you have your "some number" beforehand
you may be able to reduce the program output
by using the Magic Set feature.
https://souffle-lang.github.io/magicset

Related

Can we have task start condition dependent on Sucess condition of PIPE in SNOWFLAKE

I have a requirement where 3 different file will be loaded to a single table with 3 different PIPE. I want target my target process to be triggered only once all 3 file has been loaded to my stage.
I don't want to run my target process multiple times.
So is there any way we can have start condition of task on PIPE sucess.
I went to documentation but didn't find any such info or is there way of implementing it which I might be missing.
The general way to implement this pattern is with streams. Your pipes would load to three separate tables, each with a stream on it. You can then have a task that runs on a schedule, with the WHEN parameter set with SYSTEM$STREAM_HAS_DATA, three times. This ensures that your TASK only runs when all three pipes have completed successfully. Example:
CREATE TASK mytask1
WAREHOUSE = mywh
SCHEDULE = '5 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('MYSTREAM') AND SYSTEM$STREAM_HAS_DATA('MYSTREAM2')
AND SYSTEM$STREAM_HAS_DATA('MYSTREAM3')
AS
<Do stuff.>;
You have a couple options here. You can:
use the data in the streams to do whatever you want to in the task, or
you can use the data in the streams to fill the single table that the three pipes were originally filling.
If you choose option 1, you might then also want to create a view that replaces your original single table.
If you choose option 2, you can set up a task that runs using the AFTER clause to do whatever it is that you want to do.

Nifi - Process the files based on count or time elapsed?

I have a following flow,
ListFile ---> FetchFile ---> ? ExecuteScript (maybe) ---> Notify
Basically, I want to go to Notify, if
Total flowfiles (from fetch files) is say 200; OR
Time elapsed (from last signal) is say 3 hours.
I think the 1st condition is easy to achieve. I can have a groovy script which can read number of flowfiles, if 200 go to SUCCESS or else ROLLBACK the session.
But I want to know how to also check the time elapsed for n (number can be less than 200) flowfiles in queue is more than 3 hours or so?
Update
Here is the problem: We have a batch processing (~200 files and can increase based on business in future) currently. We have a NiFi pipeline, i.e. List, Fetch, Basic validation on checksum, etc and process (call the SQL) which is working fine.
As per the business, throughout the day we can have the correction to data so that we can get all or some of the files to "re-process". That is also fine and working.
Now, as per new requirements, we need to build the process after this "batch" is completed. So in the best case, I can have the MergeContent processor with max bin of n and give the signal or notify to my new processor.
However, as explained above, throughout that day we can get few or all files processed again. So now my "n" may not match the new "number" of files re-processed. Hence, even in this case if we have elapsed say 3 hours, then irrespective of "n" not equal to new number of files reprocessed, I should notify the new process to run again.
Hence, I am looking for n files OR m hours elapsed check.
I think this may be an example of an XY problem -- you're trying to solve a problem and believe that counting the number of files fetched or time elapsed will help, but this pattern is usually discouraged in Apache NiFi and there are other solutions to the original problem. I would encourage you to describe more fully the higher level problem you are trying to solve to see if there is a better solution.
I will answer the question though (none of these are ideal solutions).
You can use a MergeContent processor with a minimum bin count of 200
You can use an ExecuteScript processor as you noted
You can write a value (the current timestamp) to a DistributedCacheMapServer when the Notify processor executes, and check that value with a FetchDistributedCacheMap processor against the current timestamp and use a simple Expression Language statement to compare the timestamp values
I think you may also want to read some examples of Wait/Notify logic, because creating thresholds like "200 incoming flowfiles || 3 hours elapsed time" is what the Wait processor does.
"How to wait for all fragments to be processed, then do something?" by Koji Kawamura
"NiFi workflow monitoring – Wait/Notify pattern with split and merge" by Pierre Villard
"Simple NiFi Wait/Notify Example" answer by Abdelkrim Hadjidj

Maximum Number of User Per Mapping In ODI12c

I am new to ODI. While working i ODI Project, I am facing one issue.
I have 10 mappings In ODI12c,and all are using same target table,but due to some performance issue, I want that at a time only max 2 users can Execute mappings(Max 2 mappings), since they are using same target table. If more then 2 user uses that same target then it should not execute.
How should I implement this in ODI12c?
You can do something, but not exactly what you said. You can setup and option called "Concurrent Execution Controller" and tell a scenario to wait until it's previous execution it's finished.
So, you can do the next:
1.create a package
2.create scenarios for all the mappings
3.create a variable
2.inside the package call a scenario (no matter what scenario) and at the scenario name, put the variable (see image below)
3.generate scenario for package
4.double click on the scenario of the package and choose "Limit Concurrent Executions" then choose Wait to Execute and then setup the Wait Polling interval to X seconds to wait
5.execute the package scenario and when the variable it's prompted, complete the name of the mapping you want to be executed
Please tell me if you need more info.

How do readers keep track of current position in case query result changes?

After reading this answer (by Michael Minella)
Spring batch chunk processing , how does the reader work ?if the result set changes?
I assume with JdbcPagingItemReader, the query is run again for each page. In this case, when reading a new page it is possible a new record had been inserted in a position before this page starts, causing the last record of previous page to be processed again.
This means in order to prevent a record to be reprocessed I must always set a "processed already" flag manually into input data and check it before writing ?
Is this a feasible approach ?
The same question applies to a JdbcCursorItemReader when the process is interrupted (power outage) and restarted. What happens if a new record has been inserted before the current index that is saved into ExecutionContext ?
Your assumptions are right.
In case of the JdbcPagingItemReader this will also depend on the transaction isolation level of your transaction (READ_COMMITED, READ_UNCOMMITTED, ...).
In case of the JdbcCursorItemReader you have to ensure that the query returns the exact same result set (including order) in the case of a restart. Otherwise the results are unpredictable.
In the batches I'm writing, I often save the result of the selection into a csv file in the first step and configure the reader with "saveState=false", if I cannot guarantee that the selection will produce the same results in case of a crash. So, if the first step fails a restart will produce a complete new csv-file. After the first step, all the entries that need to be processed are in a file. And of course, this file cannot change and therefore, in a case of a restart, continuing processing from the last successful chunk is possible from the second step onward.
Edited:
Using a "state-column" works well, if you have a single step that does the reading (having the state-column in its where-clause), processing and writing/updating (the state-column to 'processed') the state. You just have to start the job again as a new launch, if such a job fails.

Correct use-case for Concurrency::task_group

The documentation says: "task_group class represents a collection of parallel work which can be waited on or canceled."
1). Do I take it to mean that tasks need to be logically related (but broken down) and that you will ideally need to wait on them elsewhere to collate the results?
IOW, is it possible to use task_group to just schedule asynchronous tasks that basically have no relation to each other (as an analogy: sort of like dumping each iteration of some processing activity in a queue and picking it up for execution by another thread)? Each of them just execute and die away and as a result I wouldn't even have to wait or cancel them.
(I do understand that the task_group dtor will throw an exception if I don't cancel or wait on incomplete tasks. Lets forget that for the moment and only focus on whether I am using it for the right purpose).
This page has an explanation on task groups - not bad.
In a nutshell,
use task groups (the concurrency::task_group class or the concurrency::parallel_invoke algorithm) when you want to decompose parallel work into smaller pieces and then wait for those smaller pieces to complete.

Resources