How we can configure Reporting Task 'SiteToSiteBulletinReportingTask 1.17.0' from NiFi CLI - apache-nifi

Can anybody let me know, How i can configure NiFi ReportingTask 'SiteToSiteBulletinReportingTask 1.17.0' using NiFi CLI command.

Related

Run Azure Databricks Notebook from Apache Nifi

I am new to apache Nifi, is there any way to run Azure Databricks notebook from Nifi?
Or is to be done by different tool?
You can run Azure databricks notebook from Apache Nifi.
To Connect Databricks Data in Apache NiFi
Download the CData JDBC Driver for Databricks installer, unzip the package, and run the JAR file to install the driver.
Copy the CData JDBC Driver JAR file (and license file if it exists), cdata.jdbc.databricks.jar (and cdata.jdbc.databricks.lic), to the Apache NiFi lib subfolder, for example, C:\nifi-1.3.0-bin\nifi-1.3.0\lib.
On Windows, the default location for the CData JDBC Driver is C:\Program Files\CData\CData JDBC Driver for Databricks.
Start Apache NiFi. For example:
cd C:\nifi-1.3.0-bin\nifi-1.3.0\bin
run-nifi.bat
Lastly Navigate to the Apache NiFi UI in your web browser: typically http://localhost:8080/nifi
You can refer this article( https://www.cdata.com/kb/tech/databricks-jdbc-apache-nifi.rst ) for more information

Running spark jobs on emr using airflow

I have an EC2 instance and an EMR. I want to run spark jobs on EMR using airflow. Where would airflow needs to be installed for this?
On EC2 instance.
On EMR master node.
I am considering using SparkSubmit operator for this. What arguments should I provide while creating the airflow task?
You will be installing airflow on ec2 and I will suggest installing a containerized version of it. See this answer.
For submitting spark jobs, you will need EmrAddStepsOperator from airflow, and you will need to provide the step for spark-submit.
(Note: If you are starting the cluster from the script, you will need to use EmrCreateJobFlowOperator as well, see details here)
A typical submit step will look something like this
spark_submit_step = [
{
'Name': 'Run Spark',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--jars',
"/emr/instance-controller/lib/bootstrap-actions/1/spark-iforest-2.4.0.jar,/home/hadoop/mysql-connector-java-5.1.47.jar",
'--py-files',
'/home/hadoop/mysqlConnect.py',
'/home/hadoop/main.py',
'custum_argument',
another_custum_argument,
another_custom_argument]
}
}
]

Why hadoop commands don't work on google cloud shell

After creating cluster for my project in google DataProc I've tried to type several commands for Hadoop (like hadoop fs -ls). Unfortunately it appears cloud shell doesn't see Hadoop at all!
-bash: hadoop: command not found
Someone on stackoverflow said:
"It doesn't work in Cloud Shell because it doesn't have Hadoop CLI
utilities pre-installed."
But I've no idea how to install it or either activate it. Maybe through cluster creation, but had issue with creating it through dataproc API. I've done it through cloud shell instead.
What should I do to use Hadoop commands in cloud shell properly?
apparently hadoop commands works only on VM Instances not on general project directory. So make sure you connect to cluster via Compute Engine -> VM instances -> [your node] in INSTANCES tab via SSH

CDH 5.3.2 - Need to restart impala daemon from shell/script

I am using CDH 5.3.2 cluster and have a requirement to be able to start/stop impala daemons from a script. The command mentioned in Cloudera Docs
sudo service impala-server start
works fine on my CDH 5.10 local VM but on CDH 5.3.2 cluster I get an error "impala-server: unrecognized service". On checking in /etc/init.d I see that no such service is listed either (while its listed in 5.10 version)
Then i tried to restart the service directly from impala bin directory
cd /usr/bin
./impalad stop
However running into below error now:
E0918 11:55:27.815739 12046 JniFrontend.java:622] FileSystem is file:///
W0918 11:55:27.817589 12046 JniFrontend.java:534] Cannot detect CDH version. Skipping Hadoop configuration checks
E0918 11:55:27.817620 12046 impala-server.cc:210] Unsupported file system. Impala only supports DistributedFileSystem but the configured filesystem is: LocalFileSystem.fs.defaultFS(file:///) might be set incorrectly
E0918 11:55:27.817631 12046 impala-server.cc:212] Aborting Impala Server startup due to improper configuration
I checked core-site.xml on Cloudera Manager and fs.defaultFS is correctly set so not sure where its picking the value from. Any pointers on how to go further on this?
The init.d service packages to start Impala from the command line are meant to be used for CDH users who do NOT want to use Cloudera Manager. The right way to start and stop Impala on a Cloudera Manager cluster is to use the CM API:
https://cloudera.github.io/cm_api/apidocs/v17/index.html
start cluster service API
stop cluster service API
commands API
The tutorial shows how to use the CM APIs but for your situation you probably need to do:
$ curl -X POST -u USER:PASSWORD \
'CM_URL//api/v1/clusters/CLUSTERNAME/services/IMPALA_SERVICE/commands/stop'
replacing USER, PASSWORD, CM_URL, CLUSTERNAME, IMPALA_SERVICE_NAME with the appropriate values. The curl command will return a command ID.
Then poll this API with the command ID to see that the start/stop operation completed.
$ curl -u USER:PASSWORD 'CM_URL//api/v1/commands/COMMAND_ID'
However, if you still want to use the init.d service packages then you'll need to install the impala-server package.

How to Configure MR1 on CDH5.1 vm

I have installed CDH5.1 VM on my machine. CDH 5.1 is by default set to MR2(YARN). I would want to change the configuration from MR2 to MR1. Request to let me know the changes that need to be done.
Just do the steps to set MR configuration as given in the cdh5.1.2 Documentation
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_mr_cluster_deploy.html#topic_11_3
then use the hadoop command and not he yarn command to run the jar

Resources