Apache Nifi-registry deployment using git repo as flow repo - apache-nifi

We would like to use Nifi registry with git as storage engine. In that case, i modified providers.xml and i was able to save the flows there.
Challenges:
There is no 2 way sync. We can only save the flows modified by Nifi user but if we modify the flow directly in git location, it will not be reflected on nifi registry
There is no review or approval process for Nifi registry. A user has to login to nifi-registry server, create a branch and issue a pull request.
As a workaround, we can delete the database file ( H2) and restart the nifi resgistry.
Lastly, everything should be automated in CI/CD like what we do for regular maven project.
Any suggestions ?

The purpose of the git storage is mostly to let user visualize the differences through tools like git hub, or any other tools that can support diffs, plus by pushing to a remote you also get a remote backup of the flow content. It is not meant to be modified outside of the application, just like you wouldn't bypass an application and go right into it's database and start changing data.

Related

How to update flows from dev to prod with state

I have a nifi flow it has keeps some state with the ListS3 processor.
I have a dev instance and a prod instance.
I want some options of deploying from dev to prod where the state is kept and where I don't manually have to go in and change all the processor's and process groups.
It seems like this can't be done with templates? Based on the following stackoverflow question:
how does NIFI listfile maintains its timestamp?
edit:
Just so there is no misunderstanding I want to keep prod state when deploying.
It sounds like you aren't using NiFi registry, so you're downloading a flow template and then importing it. This can't preserve state, as it's not the same flow.
You should be using NiFi Registry to version control your flows, which supports this Dev -> Prod workflow.
Build your flow in Dev NiFi, version to Registry.
In prod, add a new Process Group and select the Import option when it asks you for a name. You'll be able to pick your versioned flow.
Run your flow so that it stores some state. View the processors state to verify.
Now update the flow in Dev, and commit the local change to Registry.
Then, update the flow in Prod to the latest version from Registry. It will preserve state on the stateful processor.
For detailed steps on installing & using Registry, see these links:
https://nifi.apache.org/docs/nifi-registry-docs/html/getting-started.html
https://pierrevillard.com/2018/04/09/automate-workflow-deployment-in-apache-nifi-with-the-nifi-registry/
https://alasdairb.com/2021/03/22/nifi-in-production-nifi-registry/
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.2.0/versioning-a-dataflow/content/connecting-to-a-nifi-registry.html
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.4.0/getting-started-with-nifi-registry/content/import-a-versioned-flow.html
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.4.0/getting-started-with-nifi-registry/content/save-changes-to-a-versioned-flow.html
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.4.0/getting-started-with-nifi-registry/content/start-version-control-on-a-process-group.html

GitLab Custom CI configuration path and merge request

For one of our repositories we set "Custom CI configuration path" inside GitLab to a remote gitlab-ci.yml. We want to do this to prevent Developers to change the gitlab-ci.yml file (as protected files are available in EE Premium and up). But except this purpose, the Custom CI configuration path feature should work anyway for Merge Requests.
Being in repo
group1/repo1
we set
.gitlab-ci.yml#group1/repo1-ci
repo1-ci repository exists and ci works correctly when we push to configured branches etc.
For Merge Request functionality GitLab tells us:
Detached merge request pipeline #123 failed for ...
Project group1/repo1-ci not found or access denied!
We added the developers to repo1-ci repo as developers, to be able to read the files. It does not help. Anyway the expectation is, that it is not run with user permissions, so it should simply find the gitlab-ci.yml file.
Any ideas on this?
So our expectations were right an it seems that we have to add one important thing into our considerations:
If a user interacts in the GitLab UI with the Merge Request features and you are using "Custom CI configuration path" for your gitlab-ci.yml file, please ensure
this user needs at least read permissions to that remote file, even if you moved it to another repo on purpose (e.g. use enhanced file protection in PREMIUM/ULTIMATE or push/merge protect the branches for the Developer role)
the user got this permission change applied in a running session
The last part failed for our users, as it worked one day later. Seems that they just continued working from their open merge request page and GitLab checks the accessibility out of this session (using a cookie, token or something which was not updated with the the access to the remote repo/file)
It works!

How to restore flows from git in NiFi Registry?

I'm using GitFlowPersistenceProvider in NiFi Registry 0.3. Today I created another NiFi Registry and I wanted to load all flows from the previous one using the same provider. Unfortunately nothing happens - any buckets nor flows aren't recreated. I tried to create all buckets manually but even then any flows aren't imported.
GitFlowPersistenceProvider documentation states:
When NiFi Registry starts, this provider reads through Git commit
histories and lookup these bucket.yml files to restore Buckets and
Flows for each snapshot version.
What should I do to load existing flows into new NiFi Registry using GitFlowPersistenceProvider?
Unfortunately that documentation is not totally accurate. Currently there is a metadata DB which defaults to an embedded H2, but can also be Postgres, and then the flow storage. You would need to restore both in order to spin up a new instance with the same data.
In the next release there is a new feature where if you start a new instance with a completely empty DB (i.e. no buckets) and the git flow provider, then it will restore everything.
You can do the same by stopping nifi-registry 0.4.0 , deleting the database file ( if any) and then starting the nifi registry to rebuild the database based on git repo.
https://issues.apache.org/jira/browse/NIFIREG-209

How to push from Gitlab to Github with webhooks

My Google-fu is failing me for what seems obvious if I can only find the right manual.
I have a Gitlab server which was installed by our hosting provider
The Gitlab server has many projects.
For some of these projects, I want that Gitlab automatically pushes to a remote repository (in this case Github) every time there is a push from a local client to Gitlab.
Like this: client --> gitlab --> github
Any tags and branches should also be pushed.
AFAICT I have 3 options:
Configure the local client with two remotes, and push simultaneous to Gitlab and Github. I want to avoid this because developers.
Add a git post-receive hook in the repository on the Gitlab server. This would be most flexible (I have sufficient Linux experience to write shell scripts as git hooks) and I have found documentation on how to do this, but I want to avoid this too because then the hosting provider will need to give me shell access.
I use webhooks in Gitlab. I am unfamiliar with what the very basics of webhooks are, and I am unable to locate understandable documentation or even a simple step-by-step example. This is the documentation from Gitlab that I found and I do not understand it: http://demo.gitlab.com/help/web_hooks/web_hooks
I would appreciate good pointers, and I will summarize and document a solution when I find it.
EDIT
I'm using this Ruby code for a web hook:
class PewPewPew < Sinatra::Base
post '/pew' do
push = JSON.parse(request.body.read)
puts "I got some JSON: #{push.inspect}"
end
end
Next: find out how to tell the gitlab server that it has to push a repository. I am going back to the GitLab API.
EDIT
I think I have an idea. On the server where I run the webhook, I pull from GitLab and then I push to Github. I can even do some "magic" (running tests, building jars, deploying to Artifactory,...) before I push to GitHub. In fact it would be great if Jenkins were able to push to a remote repository after a succesful build, then I wouldn't need to write my own webhook, because I'm pretty sure Jenkins already provides a webhook for Gitlab, either native or via a plugin. But I don't know. Yet.
EDIT
I solved it in Jenkins.
You can set more than one git remote in an Jenkins job. I used Git Publisher as a Post-Build Action and it worked like a charm, exactly what I wanted.
would work of course.
is possible but dangerous because GitLab shell automatically symlinks hooks into repositories for you, and those are necessary for permission checks: https://github.com/gitlabhq/gitlab-shell/tree/823aba63e444afa2f45477819770fec3cb5f0159/hooks so I'd rather stay away from it.
Web hooks are not suitable directly: they make an HTTP request with fixed format on certain events, in your case push, not Git protocol requests.
Of course, you could write a server that consumes the hook, clones and pushes, but a service (single push and no deployment) or GitLab CI (already implements hook management) would be strictly better solutions.
services are a the best option if someone implements it: live in the source tree, would do a single push, and require no extra deployment overhead.
GitLab CI or othe CIs like Jenkins are the best option currently available. They are essentially already implemented server for the webhooks, which automatically clone for you: all you have to do then is to push from them.
The keywords you want to Google for are "gitlab mirror github". That has led me to: Gitlab repository mirroring for instance. There seems to be no perfect, easy solution today.
Also this has already been proposed at the feature request forum at: http://feedback.gitlab.com/forums/176466-general/suggestions/4614663-automatic-push-to-remote-mirror-repo-after-push-to Always check there ;) Go and upvote the request.
The key difficulty now is how to store the push credentials.
I solved it in Jenkins. You can set more than one git remote in an Jenkins job. I used Git Publisher as a Post-Build Action and it worked like a charm, exactly what I wanted.
I added "-publisher" jobs that run after "" is built successfully. I could have done it in one job, but I decided to split it up. The build jobs are triggered by a web hook in GitLab; the publisher jobs are using a #daily schedule from the BuildResultTrigger plugin.

configuring compatible development and production sites

I am developing a Magento site.
I have access to a local host and a remote host and would like to
somehow configure development and production environments. On the
remote host I restore the database data that was backed up on the
local host, but when I do so, I overwrite the host's base name and
this causes the site to be redirected to a nonexistent URL when
the page is loaded. How can I avoid this clash:
I want to be able to develop either (a) on http:// remotehost/foobardev
and back up my data to http:// remotehost/foobar or otherwise (b) develop
on http:// localhost/foobar and deploy on http:// remotehost/foobar . I
want to know how to transfer the database data back and forth without
overwriting the values found in Magento Admin Panel -> System
-> Configuration -> Web -> Unsecure Base URL / Secure Base URL
when I run mysql and use the mysql command source to reinstate
the database entries found on the development site onto the
production site.
So, I would like an easier way to restore the database contents without
overwriting the base url configured in magento admin panel as doing so
would cause a redirect to a nonexisting or wrong place on each page load
and thus render the system unusable.
Not exactly a SO type of question. Magento EE has staging built in and can merge your data as well. You have to understand that syncing data from dev to live is not easily possible without some serious sync framework that keeps track on state of every row and column and knows what data is new and what is old and solve syncing conflicts.
Here's your flow based on assumption that you are using CE and does not have data migration tools bundled.
setup live database and count that data will move only from live to dev and never from dev to live as you don't have data migrations. Every config you need to make and preserve in database level do it on live database (test them out in dev environment and then create in live)
make a shell script , fabric script whatever deployment script you are comfortable with that will export live db dump , deletes dev database if exists and create a new database and import live database to it, run a pre or post sql script that will change/delete config values that are environment dependant (like base_url, secure_base_url etc)
to avoid double data entry always create all attributes and config values that you need to preserve with magento setup scripts.
Same goes about code and here's a common setup scenario with live, stage and development environments
one master version control (preferably bare just to avoid that someone will change files there) repository based on clean magento versions tree
separate branches for each environment (live, stage, dev(n)) and a verified code flow from dev (where you develop and can have broken codebase state) to stage (where release candidate resides and is ready for testing and does not change) from stage to live (where your live code is in stable state)
every developer works on a checkout from dev branch and commits to it's own dev branch and then pushes changes to dev where they can be evaluated and decided if changes are mature enough for staging
stage is a place where release candidate lives and client can test (or automated tests) and diagnose if it's ready enough to be released, no one ever changes code here and code comes from dev branch
live is live and running version where no one ever changes any code directly . If tests are passed code can come here from stage only
so to visualise it better imagine your codebase residing in git.
myproject_magento_se (your project git repository on bitbucket.org or in github or wherever you can host)
--> master (branch with all clean magento versions from your current to latest)
--> dev (git checkout -b master (or by specific version from master)
--> stage (while on dev: git checkout -b stage)
--> live (while on stage: git checkout -b live)
and imagine your hosts setup like this:
www.mylivesite.com = git clone yourgitrepo; git checkout live;
stage.mylivesite.com = git clone yourgitrepo; git checkout stage;
dev.mylivesite.com = git clone yourgitrepo; git checkout dev;
For all this you better have deployment scripts that do switching and code and database lifting between environments with a push of the button.
Here's a few common actions that you need to perform daily with every software project
move/reset data from live to stage from live to dev (have obfuscation calls if needed to scramble or change client related data)
move code from dev to stage
move code from stage to live
reset/create any dev with live state (Data and code)
have fun :) and go through this thread as well https://superuser.com/questions/90301/sync-two-mysql-databases and all other you can find searching on SO in similar matter.

Resources