nifi Could not find Process Group with ID - apache-nifi

After installing nifi, I am trying to create a flow to test hdfs-nifi connectivity. But I am getting following error continuously for every click on the dashboard.
I am the root. so I have complete access to the components.

Did you copy a flow.xml.gz file from an existing instance of NiFi, or have controller services, reporting tasks, or other components which reference a group that no longer exists?
Try searching for the UUID using the search bar in the top right, or shut down NiFi, and use the following terminal commands to look for any references to this process group ID (double check or copy/paste the UUID because I typed it from looking at your screenshot):
cd $NIFI_HOME
gunzip -k conf/flow.xml.gz
grep '1861df7a-0168-1000-3931-028d9eb92cbd' conf/flow.xml
You should remove the referenced process group (you can back up the flow.xml.gz first if you are concerned about data loss (this would be the flow definition, not any flowfiles, content, or provenance data).

Related

How to get Validation error message into a Attribute from ValidateResult Processor of Nifi

I am trying to validate a json using ValidateRecord Processor via Avroschemaregistry. I need to store validation error message into a sql table, so i tried to capture the error message in attribute but i am unable to capture the error message in attribute, any idea how to do it
After your ValidateRecord Processor, you can choose to route flow files which are 'invalid' to a separte log and route them to your sql table, you can do the same if they 'fail'. I am assuming from the 'error message' you mean the 'bullentin' which would occur when the Processor can neither validate or invalidate the flow file based on your schema.
A potential solution to this is to use the SiteToSiteBulletinReportingTask
Screenshot of SiteToSiteBulletinReportingTask
You can build a dataflow to receive these bulletin events, manipulate them as you want and store them in a location of your choice for your auditing needs.
From the sounds of it, the SiteToSiteBulletinReportingTask should be able to achieve what you want. To implement this, add a iteToSiteBulletinReportingTask to the 'Reporting Tasks' in the NiFi Settings: Reporting Tasks in NiFi Settings
You can name your input port and have it flow towards your SQL store and you should have what you're after.
You need to allow NiFi nodes to receive data via site-to-site on the input port and you also need to grant the correct permissions on the root process group so the nodes are able to see the component, view and modify the data.
Side note: I would usually log everything, and have all failures and invalid route to log files, which I put to store, e.g. HBase/SQL. One suggestion I've seen is configure the logging subsystem to additionally send specific error categories to your destination of choice (e.g. active notification vs passive parsing of logs). NiFi is leveraging a very flexible logback system (an evolution of log4j). The best part - changes to the $NIFI_HOME/conf/logback.xml configuration file do not require an instance restart, will be picked up within 30 seconds or less.

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user. 
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
  
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

How to run spark-jobs outside the bin folder of spark-2.1.1-bin-hadoop2.7

I have an existing spark-job, the functionality of this spark-job is to connect kafka-server get the data and then storing the data into cassandra tables, now this spark-job is running on server inside spark-2.1.1-bin-hadoop2.7/bin but whenever I am trying to run this spark-job from other location, Its not running, this spark-job contains some JavaRDD related code.
Is there any chance, I can run this spark-job from outside also by adding any dependency in pom or something else?
whenever I am trying to run this spark-job from other location, Its not running
spark-job is a custom launcher script for a Spark application, perhaps with some additional command-line options and packages. Open it, review the content and fix the issue.
If it's too hard to figure out what spark-job does and there's no one nearby to help you out, it's likely time to throw it away and replace with the good ol' spark-submit.
Why don't you use it in the first place?!
Read up on spark-submit in Submitting Applications.

MonetDB - issue with dumping results of select statement into file

I'm using "COPY SELECT ... INTO file" statement from within application code. After file is ready, next step is to move the file to different location. The only problem is that file created by MonetDB has only root permissions so my application code can't touch it. Is there a way to configure MonetDB so dumps are saved as specified user? or my only solution is to iterate results in batches in application and save to file that way. Dumps can range from several MB to 1GB.
You could run MonetDB as the same user that your application server is configured for. Also, both your application server and MonetDB probably should not run as 'root'.
There is no direct support to export files with different permissions. You could try configuring the umask for the user that the starts the DB.

check directory of oracle logs

I'm using the check_logfiles nagios plugin to monitor Oracle alert logs. It works wonderfully for that purpose.
However I also need to monitor and entire directory of oracle trace logs for errors. This is because the oracle database is always creating new log files with different names.
What I need to know is the best way to scan an entire directory of oracle trace logs to find out which ones match patterns that specify oracle alerts.
Using check logfiles I tried specifying these options -
--criticalpattern='ORA-00600|ORA-00060|ORA-07445|ORA-04031|Shutting
down instance'
and to specify the directory of logs -
--logfile='/global/cms/u01/app/orahb/admin/opbhb/udump/'
and
--logfile="/global/cms/u01/app/orahb/admin/opbhb/udump/*"
Neither of which have any effect. The check runs but returns ok. Does anyone know if this nagios plugin called check_logfiles can monitor a directory of files rather than just a single file? Or perhaps there is another, better way to achieve the same goal of monitoring a bunch of files that can't be specified ahead of time?
Use a script which:
Opens each file
Copies entries which match the pattern
Outputs the matches to a file

Resources