How to pass variable to ADF Execute Pipeline Activity? - etl

Environment:
I have around 100 pipelines that run on a number of triggers.
Outcome: I want to create a master pipeline that calls those 100 pipelines.
Currently, I've created a list of pipeline names and put them to an array. Then I was hoping to use forEach and execute pipeline activities to pass those names.
Issue, it seems that execute pipeline activity does not take variables or it is not obvious how to do it.
I do not want to create master pipeline manually as it can change often and I hope there must be better way to do it than manually.

You are correct that the "Invoked pipeline" setting of the Execute Pipeline activity does not support a variable value: the Pipeline name must be known at design time. This makes sense when you consider parameter handling.
One way around this is to create an Azure Function to execute the pipeline. This answer has the .Net code I leverage in my pipeline management work. It's a couple years old, so probably needs an update. If you need them to run sequentially, you'll need to build a larger framework to monitor and manage the executions, which is also discussed in that answer. There is a concurrency limit (~40 per pipeline, I believe), so you couldn't run all 100 simultaneously.

Related

Control parallelism in Apache Beam Dataflow pipeline

We are experimenting with Apache Beam (using Go SDK) and Dataflow to parallelize one of our time consuming tasks. For little more context, we have caching job which takes some queries, runs it across database and caches them. Each database query may take few seconds to many minutes and we want to run those in parallel for quicker task completion.
Created a simple pipeline that looks like this:
// Create initial PCollection.
startLoad := beam.Create(s, "InitialLoadToStartPipeline")
// Emits a unit of work along with query and date range.
cachePayloads := beam.ParDo(s, &getCachePayloadsFn{Config: config}, startLoad)
// Emits a cache response which includes errCode, errMsg, time etc.
cacheResponses := beam.ParDo(s, &cacheQueryDoFn{Config: config}, cachePayloads)
...
The number units which getCachePayloadsFn emits are not a lot and will be mostly in hundreds and max few thousands in production.
Now the issue is cacheQueryDoFn is not getting executed in parallel and queries are getting executed sequentially one by one. We confirmed this by putting logs in StartBundle and ProcessElement by logging goroutine id, process id, start and end time etc in caching function to confirm that there is no overlap in execution.
We would want to run the queries always in parallel even if there are just 10 queries. From our understanding and documentations, it creates bundles from the overall input and those bundles run in parallel and within bundle it runs sequentially. Is there a way to control the number of bundles from the load or any way to increase parallelism?
Things we tried:
Keeping num_workers=2 and autoscaling_algorithm=None. It starts two VMs but runs Setup method to initialize DoFn on only one VM and uses that for entire load.
Found sdk_worker_parallelism option here. But not sure how to correctly set it. Tried setting it with beam.PipelineOptions.Set("sdk_worker_parallelism", "50"). No effect.
By default, the Create is not parallel and all the DoFns are being fused into the same stage as the Create, so they also have no parallelism. See https://beam.apache.org/documentation/runtime/model/#dependent-parallellism for some more info on this.
You can explicitly force a fusion break with the Reshuffle transform.

Parallel Processing with Starting New Task - front end screen timeout

I am running an ABAP program to work with a huge amount of data. The SAP documentation gives the information that I should use
Remote Function Modules with the addition STARTING NEW TASK to process the data.
So my program first selects all the data, breaks the data into packages and calls a function module with a package of data for further processing.
So that's my pseudo code:
Select KEYFIELD from MYSAP_TABLE into table KEY_TABLE package size 500.
append KEY_TABLE to ALL_KEYS_TABLE.
Endselect.
Loop at ALL_KEYS_TABLE assigning <fs_table> .
call function 'Z_MASS_PROCESSING'
starting new TASK 'TEST' destination in group default
exporting
IT_DATA = <fs_table> .
Endloop .
But I am surprised to see that I am using Dialog Processes instead of Background Process for the call of my function module.
So now I encountered the problem that one of my Dialog Processes were killed after 60 Minutes because of Timeout.
For me, it seems that STARTING NEW TASK is not the right solution for parallel processing of mass data.
What will be the alternative?
As already mentioned, thats not an easy topic that is handled with a few lines of codes. The general steps you have to conduct in a thoughtful way to gain the desired benefit is:
1) Get free work processes available for parallel processing
2) Slice your data in packages to be processed
3) Call an RFC enabled function module asynchronously for each package with the available work processes. Handle waiting for free work processes, if packages > available processes
4) Receive your results asynchronously
5) Wait till everything is processed and merge the data together again and assure that every package was handled properly
Although it is bad practice to just post links, the code is very long and would make this answer very messy, therfore take a look at the following links:
Example1-aRFC
Example2-aRFC
Example3-aRFC
Other RFC variants (e.g. qRFC, tRFC etc.) can be found here with short description but sadly cannot give you further insight on them.
EDIT:
Regarding process type of aRFC:
In parallel processing, a job step is started as usual in a background
processing work process. (...)While the job itself runs in a
background process, the parallel processing tasks that it starts run
in dialog work processes. Such dialog work processes may be located on
any SAP server.
The server is specified with the GROUP (default: parallel_generators) see transaction RZ12 and can have its own ressources just for parallel processing. If your process times out, you have to slice your packages differently in size.
I think, best way for parallel processing in SAP is Bank Parallel Processing framework as Jagger mentioned. Unfortunently its rarerly mentioned in any resource and its not documented well.
Actually, best documentation I found was in this book
https://www.sap-press.com/abap-performance-tuning_2092/
Yes, it's tricky. It costed me about 5 or 6 days to force it going. But results were good.
All stuff is situated in package BANK_PP_JOBCTRL and you can use its name for googling.
Main idea there is to divide all your work into steps (simplified):
Preparation
Parallel processing
2.1. Processing preparation
2.2. Processing
(Actually there are more steps there)
First step is not paralleized. Here you should prepare all you data for parallel processing and devide it into 'piece' which will be processed in parallel.
Content of pieces, in turn, can be ID or preloaded data as well.
After that, you can run step 2 in parallel processing.
Great benefit of all this is that error in one piece of parallel work won't lead to crash of all your processing.
I recomend you check demo in function group BANK_API_PP_DEMO
To implement parallel processing, you need to do a bit more than just add that clause. The information is contained in this help topic. A lot of design effort needs to be devoted to ensure that the communication and result merging overhead of the parallel processing does not negate the performance advantage gained by the parallel processing in the first place and that referential integrity of the data is maintained even when some of the parallel tasks fail. Do not under-estimate the complexity of this task.
You could make use of the bgRFC technique. This is a new method of background processing made by SAP.
BgRFC has, in addition to the already existing IN BACKGROUND TASK, the possibility to configure and monitor all calls which run through this method.
You can read more documentation between the different possibilities here. This is all (of course) depending on your SAP version.

Run Teamcity configuration N times

In the set of my TeamCity configurations, I decided to make something like an aging test*. And run a single configuration for a 100 times.
Can I make in a few simple clicks?
*aging test - test that is showing, that due time/aging, results will not be changed.
As of now, this is not possible from UI. If you run one build configuration few times without any changes, they will be merged and only 1 will be executed. If you want to run 100, you have to trigger them one by one, after the previous one finished executing.
But the better solution is to trigger builds from script using REST API (for more details see the documentation here), if builds have different values in custom parameters they all will be put in the queue.
HOW: Define a dummy custom parameter, and trigger the build from script within a loop. Pass the value of iterating variable as parameter value. So, TeamCity will think those are different builds and execute all of them.

Force build to top of queue when triggered

I have a single agent and many builds. There are frequently several builds in the queue that take an hour a piece to execute. I want to trigger daily at a specific time a build which takes less than five seconds but needs to run immediately (next in the queue). Is there any way to do this?
Build priorities are suggested in various places but they do not help. I set the priority to the max value of 100 and it was placed at 15 out of 17 in the queue.
You can use Teamcity REST to trigger the build and put on top of the queue. You can make use of triggering option queueAtTop="true"
I ended up working around the problem and moving this build to another practically dedicated teamcity agent which means this executes promptly. This is not a good solution and I would prefer to accept an actual answer if anyone is able to offer one.

Windows Workflows - While Activity for creating multiple tasks not working

I am using a while activity for creating multiple tasks for a workflow. The code is executed fine and the task is created when the loop runs only once. But when the loop runs twice or more, only one task is getting created. Also the WF status shows as Error Occured.
All I want to do here is create multiple tasks (no of tasks depends on an entered column value) for the same user. Is it posible to use 'while' in this scenario? Or is there any other way to go ahead?
NB: I am using state machine workflow.
You may want to use a Replicator Activity which will in turn "clone" its child-activities. It can be run parallel or sequentially.
I found Working with the Replicator Activity and an Until Condition useful.
Otherwise without the Replicator, there is just one Task Activity.
In either case, make sure to assign a new Guid to the TaskId property. However, as an annoying "feature": it will not work if you just assign the TaskId property (I know, I tried and was like "Wth?!?"). Instead, bind the TaskId to a Field/Property and then assign to that.

Resources