nifi Loading file without polling - apache-nifi

I'm new to using this tool to work on cloudera.
We need to upload files from an FTP folder to an HDFS folder.
The basic flow is defined with the classic getSFTP and putHDFS processors.
My question now is the following.
Is it possible that nifi does a one-time upload of data without the polling option?
Is there any way to disable polling?
Thank you for your attention
Daniele Consalvo
Infordata Analytics Team

ListSFTP -> FetchSFTP -> PutHDFS
You can configure ListSFTP to run on a schedule - but if you really only want it to run once then:
Start FetchSFTP & PutHDFS
Use 'Run Once' on ListSFTP
The List will run once, then stop.

Related

How can I use NiFi to read/write directly from ADLS without HDInsight

We would like to use NiFi to connect with ADLS (using PutHDFS and FetchHDFS) without having to install HDInsight. Subsequently we want to use Azure DataBricks to run Spark jobs, and hoping that it can be done using NiFi's ExecuteSparkInteractive processor. From all the examples I could find, invariably HDP or HDInsight seem to be required.
Can anyone share the pointers how it can be done without needing HDP or HDInsight?
Thanks in advance.
As far as I can tell, ADLS won't work well (or work at all) with *HDFS processors available in Apache NiFi. There was a feature request made - NIFI-4360 and a subsequent PR raised for the same - #2158 but it was briefly reviewed but now not much progress is there. You can fork that or copy pasta that code-base and hopefully review it.
I did a test-setup more than a year ago. The PutHDFS processor worked with some additional classpath resources. The following dependencies have been required:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.0.4-SNAPSHOT.jar
hadoop-azure-datalake-2.0.0-SNAPSHOT.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
See also the following Blog for more details. You can copy the libs, the core-site.xml and hdfs-site.xml from an HDInsight setup to the machine where NiFi is running. You also should set the dfs.adls.home.mountpoint properly, directing to root or a data directory. Be aware that this is not officially supported, so phps. you should also consider Azure Data Factory or StreamSets as an option for Data Ingest.
PutHDFS does not expect a classic hadoop cluster in the first place. It expects core-site.xml only for conventional reasons. As you will see in the below example a minimalist config file to have PutHDFS work with ADLS.
Using NiFi PutHDFS processor to ingress into ADLS is simple. Below steps will lead to the solution
Have ADLS Gen1 set up(ADLS has been renamed as ADLS Gen1)
Additionally have OAUTH authentication set up for your ADLS account. See here
Create an empty core-site.xml for configuring PuHDFS processor
Update core-site.xml with the following properties(I am using Client keys mode for auth in this example)
fs.defaultFS = adl://<yourADLname>.azuredatalakestore.net
fs.adl.oauth2.access.token.provider.type = ClientCredential
fs.adl.oauth2.refresh.url = <Your Azure refresh endpoint>
fs.adl.oauth2.client.id = <Your application id>
fs.adl.oauth2.credential = <Your key>
Update your NiFi PutHDFS processor to refer to the core-site.xml and additional ADLS libraries(hadoop-azure-datalake-3.1.1.jar and azure-data-lake-store-sdk-2.3.1.jar) created in previous step as shown below.
Update the upstream processors and test.

Using Apache Nifi in a docker instance, for a beginner

So, I want, very basically, to be able to spin up a container which runs Nifi, with a template I already have. I'm very new to containers, and fairly new to Nifi. I think I know how to spin up a Nifi container, but not how to make it so that it will automatically run my template every time.
You can use the apache/nifi Docker container found here as a starting point, and use a Docker RUN/COPY command to inject your desired flow. There are three ways to load an existing flow into a NiFi instance.
Export the flow as a template (an XML file containing the exported flow segment) and import it as a template into your running Nifi instance. This requires the "destination" NiFi instance to be running and uses the NiFi API.
Create the flow you want, manually extract the entire flow from the "source" NiFi instance by copying $NIFI_HOME/conf/flow.xml.gz, and overwrite the flow.xml.gz file in the "destination" NiFi's conf directory. This does not require the destination NiFi instance to be running, but it must occur before the destination NiFi starts.
Use the NiFi Registry to version control the original flow segment from the source NiFi and make it available to the destination NiFi. This seems like overkill for your scenario.
I would recommend Option 2, as you should have the desired flow as you want it. Simply use COPY /src/flow.xml.gz /destination/flow.xml.gz in your Dockerfile.
If you literally want it to "run my template every time", you probably want to ensure that the processors are all in enabled state (showing a "Play" icon) when you copy/save off the flow.xml.gz file, and that in your nifi.properties, nifi.flowcontroller.autoResumeState=true.

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

How to get data from temp files of hadoop?

I have an application to transfer data from remote systems to HDFS using map reduce . I however am lost when I have to deal with isues like network failure .. That is , when a connection from remote data source is lost and data is no longer accessible to my mapreduce application. I can always restart the job but when data is huge then restarting is an expensive option . I know the mapreduce would create temp folder but will it put data there ? Can I read that data out and then Can I somehow start reading the rest of the data ?
A mapreduce job can write arbitrary files, not only the ones managed by Hadoop.
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
out = fs.create(new Path(fileName));
using this code you create arbitrary files which work like normal files in the local filesystem. Then, you manage connection exceptions such that when a source is unaccessible you nicely close the file and record somewhere (e.g. in HDFS itself) that happened an interruption and at which point.
In the case of FTP, you could write just the list of file paths and folders. When a job finish to download a file, write its path on the downloaded list, and when an entire folder is downloaded write the folder path, so in case of resume you will not have to traverse a directory content to check that all files were downloaded.
At the program startup, on the other hand, it will check this file to decide whether the previous attempt failed and, in case, where to start the download.
In general, Hadoop will kill your program if it's not writing/reading anything for a timeout. Your application can tell it to wait but in general is not good to have an idle job, so it's better to end the job nicely instead that waiting for the network to work again.
You can also create your own filewriter, this way:
conf.setOutputFormat(MyOwnOutputFormat.class);
your filewriter could save its own temporary files in the format you prefer, so if the application crashes you know how files are saved.
HDFS saves files with chunks of 64MB by default, and when a job fails you may not even have a temporary file unless you use your own writer.
This is a generic solution, it depends on which is the source of data (ftp, samba, http...) and its support to download resumes.
EDIT: in case of FTP, you could just use csync to syncronize a FTP server with your local filesystem, and hdfs-fuse to mount a HDFS filesystem. It works when you have many small files.
You haven't specified what tool you are using to ingress data into HDFS/Hadoop.
Some of the tools that you can use to ingress data into HDFS/Hadoop which support recoverability are Flume, Scribe & Chukwa (for log files) and they all support various configurable levels of file transfer reliability guarantees, and Sqoop for transferring relational db data into HDFS or Hive, etc.

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources