Clear the cache of FetchDistributedMapCache processor - apache-nifi

How to clear the cache of FetchDistributedMapCache processor in Apache NiFi?
I tried deleting the persisted directory and also tried giving a new directory all together but it still fetches old data. Thanks for your help.

You should be able to stop the DistributedMapCacheClient and DistributedMapCacheServer, then delete the existing DistributedMapCacheServer and create a new one with same port as the previous one, then start them back up.

Inside NiFi, you could create a new DistributedMapCacheServer and point your processor at that instead. Outside of NiFi, I've written a Groovy script where you can interact with the DistributedMapCacheServer from the command line. The API only allows you to remove entries you know about; in the upcoming NiFi 1.2.0 release, you will be able to remove entries using a regular expression for the keys (implemented in NIFI-3627). At that point I will update the Groovy script to enable that feature.

Related

impossible to add extensions on nifi

I want to add 3 extension to nifi (nifi-encryptMD5-nar-1.0.nar-unpacked,nifi-getOperator-nar-1.0-SNAPSHOT.nar-unpacked,nifi-splitAttributeValue-nar-1.0.nar-unpacked)
I added the extensions folder in the directory /opt/nifi/nifi-1.9.2/work/nar/extensions/
then when I restart the nifi service, nifi turns off and does not turn on, when I force the start with the user nifi, nifi turns on but the extentions have been deleted from the directory /opt/nifi/nifi-1.9.2/work/nar/extensions/
you have to put *.nar packages into nifi/lib directory.
nifi will extract it automatically on startup into nifi/work folder.
As daggett says, you need to use the .nar files, not any unpacked directories.
In your nifi.properties there will be two or more properties that provide locations for NiFi libraries:
nifi.nar.library.directory=./lib
nifi.nar.library.autoload.directory=./extensions
nifi.nar.library.directory.<something>=./<yourdir>
The first is the default and contains all the basic NiFi files. It is only checked on startup and any valid nars found are unpacked in the work directory and loaded. Generally you don't want to add anything here except in test environments as it complicates upgrades.
The second is empty by default but it is scanned every 30 seconds for new .nars. These will be unpacked and loaded if possible, but only for new libraries. Already loaded libraries will not be reloaded.
This is a good location to add your validated custom libraries without having to restart NiFi.
The third and further need to be added manually to the properties file. These are loaded on startup only and useful if you have a lot of custom processors and want to keep them organized.
In your situation I'd put the .nars in the extensions folder and check the logs to see if they were loaded successfully. You'll then need a full refresh of the browser window (Shift+F5 I think) before they show up in the list of processors.
In a cluster setup, add the .nars on all nodes and verify their availability before trying to add them to the canvas or things might get messy.

Search Templates | Loading & Caching the script file locally

Context:
We are moving from ES 5.X to ES 7.X
Earlier we were using JEST Client, now we are planning to use ES High-Level Client
Our search queries are complex and we are planning to use SearchTemplate API
We will store template files locally & cache them to reduce the overhead of I/O
What I have tried so far:
I've read the documentation of EHLC and I can't find a mechanism to load & cache script files directly from the file system
I can see that we can store the script in E.S which we don't want to do, assumingly we won't be having changelogs there.
Question:
Is there an inbuilt mechanism to use the locally stored file as a script in EHLC? OR we shall use inline scripts and load & cache the script file using custom code
Based on the comments I'd suggest the following:
Keep track of the templates with git.
Monitor the changes and trigger a pub/sub message whenever applicable (PR merges etc.).
Configure your pub/sub handler to update the stored search template in ES.
Otherwise, when we talk about local loading + caching, the machines with slightly older EHLC processes wouldn't get notified about the most recent changes in git and would continue using stale scripts.

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user. 
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
  
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

How to move elasticsearch index using file system?

Usecase:
I have created es-indexes: mywebsiteindex-yyyymmdd , mysharepointindex-yyyymmdd in my laptop/dev machine. I want to export/zip that index as a file. The file may be migrated by someone who has credentials to target machine. And the zip/file may be imported to target-elastic folder.
You can abstract the words 'machine' 'folder' 'zip' in the above. Focus is 'transfer index as a file and reimport at target which I may not have access through http/tcp/ftp/ssh'.
Is there any python/other script out there that can export-from-source and import-to-target? A script that hides internal complexities of node/cluster count differences between dev/prod etc, and just move index.
Note: I already referred to the below page, so no need to reiterate the same
https://www.elastic.co/guide/en/cloud/current/ec-migrate-data.html
There are some options:
You can use the snapshot and restore api to create a snapshot of your index and restore it in your new instance. (recommended way)
You can use the reindex api in your new instance to reindex your index from remote.
You can use Logstash with your old instance as an input and your new instance as the output.
And you can write a script/application using one of the supported clients to query your index, export to a file, read that file and import in your new instance. (logstash can also do that).
But you can't move your data files, this is not supported nor recommended by Elastic.

Disabling/Pause database replication using ML-Gradle

I want to disable the Database Replication from the replica cluster in MarkLogic 8 using ML-Gradle. After updating the configurations, I also want to re-enable it.
There are tasks for enabling and disabling flexrep in ML Gradle. But I couldn't found any such thing for Database Replication. How can this be done?
ml-gradle uses the Management API to handle configuration changes. Database Replication is controlled by sending a PUT command to /manage/v2/databases/[id-or-name]/properties. Update your ml-config/databases/content-database.json file (example that does not include that property) to include database-replication, including replication-enabled: true.
To see what that object should look like, you can send a GET request to the properties endpoint.
You can create your own command to set replication-enabled - see https://github.com/rjrudin/ml-gradle/wiki/Writing-your-own-management-task
I'll also add a ticket for making official commands - e.g. mlEnableReplication and mlDisableReplication, with those defaulting to the content database, and allowing for any database to be specified.

Resources