Jenkins scripted pipeline - sequential steps - jenkins-pipeline

I have this command in my scripted pipeline:
parallel map_of_stages
map_of_stages is array of stages.
In this example - "parallel" is the "executing"
But in case I need those stages to be run sequentially?
Is there an option of having something like "parallel" but for serial execution stages?
Thanks

"sequential" isn't a keyword you can use in Jenkins pipelines, but the steps are executed sequentially unless they are in a paralell block anyway, so you shouldn't need it.
This has some useful info:
https://jenkins.io/doc/book/pipeline/syntax/#sequential-stages

source stack overflow question
As found in another question:
map_of_stages.each{ key, value ->
value()
}

Related

Read from memory for full pipeline, read from files if retry or partial pipeline

How can I use the pipeline to run from memory/file? I think the features are there but I am not sure how I can write the pipeline like this.
My use case is:
normal pipeline, from step 1 to step 10
run from step 2 to step 10
Imagine at step 1, I will write a dataframe to a csv, and step 2 will need to read from that. If I am running from step 1, I would want to pass that dataframe in memory (to save the read time). But if I start running from step 2, I will need to read from csv.
What is the best practice to do so with Kedro?
https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#pipeline-with-circular-dependencies
I can think of 2 ways, depending on your use case.
a) You could use separate environments for this. When running the full pipeline, you use some environment regular where you don't have a catalog entry for dataset in question (hence will be turned into MemoryDataSet), while in a separate dev environment you have an entry in your catalog.yml to save it as a CSV. But it does mean you'd have to run dev from node 1 in order to generate the csv to be used for subsequent runs.
kedro run --env regular
kedro run --env dev
kedro run -e dev --from-nodes node2
Relevant docs: https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#additional-configuration-environments
b) Another way to do it, if you always want the first node to write to csv, is to have node1 return 2 outputs (same data), one as pandas.CSVDataSet and one as MemoryDataSet, and you define different pipelines. Pipeline complete where second node reads from memory, and partial where you don't have node1, and node2 reads from the csv dataset.
kedro run --pipeline complete
kedro run --pipeline partial
https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#running-a-pipeline-by-name
In addition to options suggested by #Lorena Balan, you can use a CachedDataSet. Your catalog entry will look similar to this:
my_cached_dataset:
type: CachedDataSet
dataset:
type: pandas.CSVDataSet
filepath: path/to/file
Cached dataset will save the data using the regular underlying dataset and will also populate its internal cache, then the subsequent load will pull the data from that memory cache. If the cache is empty (your scenario 2), CachedDataSet on load will pull the data from the underlying [CSV] file.

Does Kedro support Checkpointing/Caching of Results?

Let's say we have multiple long running pipeline nodes.
It seems quite straight forward to checkpoint or cache the intermediate results, so when nodes after a checkpoint are changed or added only these nodes must be executed again.
Does Kedro provide functionality to make sure, that when I run the pipeline only those steps are
executed that have changed?
Also the reverse, is there a way to make sure, that all steps that have changed are executed?
Let's say a pipeline producing some intermediate result changed, will it be executed, when i execute a pipeline depending on the output of the first?
TL;DR: Does Kedro have makefile-like tracking of what needs to be done and what not?
I think my question is similar to issue #341, but I do not require support of cyclic graphs.
You might want to have a look at the IncrementalDataSet alongside the partitioned dataset documentation, specifically the section on incremental loads with the incremental dataset which has a notion of "checkpointing", although checkpointing is a manual step and not automated like makefile.

how to terminate mapreduce job after checking for a condition?

I already found this solution. But, As answer says it's unsafe to do so, is there any safer way to do this using new MapReduce library (org.apache.hadoop.mapreduce)
As I was willing to terminate the MapReduce job that runs in a loop; So, I solved this problem by using counters as follows,
public static enum SOLUTION_FLAG{
SOLUTION_FOUND;
}
I took help from this site,
How to use the counters in Hadoop?
From the value of a flag, I decided to if I can skip the task and when a job ends, at the end of each loop I checked for a value of this flag.
Let me know if I'm doing it correctly.

Passing multiple arguments to external programs in a Pipeline

I'm trying to build a pipeline for NGS data.
I made a small example pipeline for passing commands to shell. Example pipeline has two scripts thats called from shell that just concatenates(sumtool.py) and multiplies(multool.py) values in many dataframes (10 in this case). My wrapper(wrapper.py) handles with the input and passes the commands that runs the scripts in order. Here is the relevant part of the code from the wrapper:
def run_cmd(orig_func):
#wraps(orig_func)
def wrapper(*args,**kwargs):
cmdls = orig_func(*args,**kwargs)
cmdc = ' '.join(str(arg) for arg in cmdls)
cmd = cmdc.replace(',','')
Popen(cmd,shell=True).wait()
return wrapper
#run_cmd
def runsumtool(*args):
return args
for file in getcsv():
runsumtool('python3','sumtool.py','--infile={}'.format(file),'--outfile={}'.format(dirlist[1]))
This works alright but I want to be able to pass all the commands at once for the first script with all the dataframes wait for it to finish and then run the second script with all commands at once for every dataframe. Since Popen().wait() waits for each command it takes way longer.
I tried to incorporate luigi for a solution but I wasn't successful running external programs or trying to pass multiple I/O's with luigi. Any tip on that is appreciated.
Another solution I'm imagining is passing the samples individually all at once but I'm not sure how to put it in python(or any other language really). This would also solve the I/O problem with luigi.
thanks
Note1: This is a small example pipeline I build. My main purpose is to call programs like bwa, picard in a pipeline ... which i cannot import.
Note2: I'm using Popen from subprocess already. You can find it between lines 4 and 5.

Better to use task dependencies or task.doLast in Gradle?

After building my final output file with Gradle I want to do 2 things. Update a local version.properties file and copy the final output final to some specific directory for archiving. Let's assume I already have 2 methods implemented that do exactly what I just described, updateVersionProperties() and archiveOutputFile().
I'm know wondering what's the best way to do this...
Alternative A:
assembleRelease.doLast {
updateVersionProperties()
archiveOutputFile()
}
Alternative B:
task myBuildTask(dependsOn: assembleRelease) << {
updateVersionProperties()
archiveOutputFile()
}
And here I would call myBuildTask instead of assembleRelease as in alternative A.
Which one is the recommended way of doing this and why? Is there any advantage of one over the other? Would like some clarification please... :)
Whenever you can, model new activities as separate tasks. (In your case, you might add two more tasks.) This has many advantages:
Better feedback as to which activity is currently executing or failed
Ability to declare task inputs and outputs (reaping all benefits that come from this)
Ability to reuse existing task types
More possibilities for Gradle to execute tasks in parallel
Etc.
Sometimes it isn't easily possible to model an activity as a separate task. (One example is when it's necessary to post-process the outputs of an existing task in-place. Doing this in a separate task would result in the original task never being up-to-date on subsequent runs.) Only then the activity should be attached to an existing task with doLast.

Resources