Automating ETL Data pipeline knime using Jenkins? - jenkins-pipeline

I am trying to automate an ETL data pipeline in knime using Jenkins. can anyone suggest an idea about this pipeline.

Related

How to identify which input need to pass for kettle Jobs when running in PDI tool?

We have many .ktr kettle jobs. we are new to these and We have found a way to run those in PDI tool. But we are unable to identify which input file need to pass for these kettle jobs when executing those in PDI(Pentaho Data Integration) tool.
Can anyone please explain how to understand or where to check that which input file need to pass for these kettle jobs during execution?

Jenkins declarative pipeline - how to manage Jenkins server

How can I have a declarative Jenkins pipeline that is able to manage Jenkins server itself? I.e:
a pipeline that is able to query what Jobs I have in a folder and then disable/enable those jobs
Query what agents are available and trigger a job on that agent
A pipeline global variable currentBuild has a property called rawBuild that provides access to the Jenkins model for the current build. From there you can get to many of the Jenkins internals.
I'm not sure what you can find in the way of agent and job triggering - have a look there are/were plugins that offered alternatives to the default model.

Reindexing/Updating Elasticsearch using Logstash on Jenkins

I would like to automate the process of updating the elasticsearch with latest data on demand and secondly, recreating the index along with feeding data using a Jenkins job.
I am using jdbc input plugin for fetching data from 2 different databases (postgresql and microsoft sql). When the Jenkins job is triggered on demand, the logstash should run the config file and do the tasks we would like to achieve above. Now, we also have a cronjob running on the same sever (AWS) , where the logstash job would be running on demand. The issue is, the job triggered via Jenkins, starts another logstash process along with the cron job running logstash already on the AWS server. This would end up starting multiple logstash processes without terminating them, once on demand work is done.
Is there a way to achieve this scenario? Is there a way to terminate the logstash running via Jenkins job or if there's some sort of queue that would help us insert our data on demand logstash requests?
PS: I am new to ELK stack

Different tools available for creating data pipelines

I need to create data pipelines in hadoop. I have data import, export, scripts to clean data set up and need to set it up in a pipeline now.
I have been using Oozie for data import and export schedules but now need to integrate R scripts for data cleaning process as well.
I see falcon is used for the same.
How to install falcon in cloudera?
What other tools are available to create data pipelines in hadoop?
2) I'm tempted to answer nifi from Hortonworks, since this post on linkedin it has grown a lot and it's very close to replace oozie. When I'm writing this answer the difference between oozie and nifi is the place where they run: nifi on external cluster and oozie into hadoop.

Copy on-prem data to S3 using AWS Data Pipeline

How do I import data from an on-prem SQL database to an Amazon S3 using the AWS Data Pipeline?
Links to any tutorials will greatly help me.
You will need to run a Task Runner on-prem. That task runner can execute a pipeline that can export data from SQL data base to S3.
If your database is MySQL, it can directly export the data out. As shown in the below tutorial.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-copydata-mysql.html
If it is not MySQL, then you need to write an export job yourself. For eg. use Talend DI to create an ETL script and wrap the Talend Job in a Shell Activity.
There is also an on-prem Task Runner template.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-localmysqltords.html

Resources