NiFi putFTP not efficient

NiFi putFTP not efficient - ftp

I have a nifi flow that sends more than 50 files per minute using the putFTP processor. The server has limited resources, but I need to send in a faster pace. I looked at the ftp server logs (not nifi), and my conclusion:
A new ftp connection (session) is created for every file. Is there an option to configure many files on one session? (connect to port 21, authenticate once, and then send many files on different ports)
When sending one file, many CWD (Change Working Directory) commands are sent. For example, sending file to /myfiles/test/dest/file.txt:
CWD /
CWD /myfiles
CWD /
CWD /myfiles/test
CWD /
CWD /myfiles/test/dest
This is not efficient. Is there any way to improve the putFTP? Is this a bug?

First question: use run duration
A new ftp connection (session) is created for every file. Is there an
option to configure many files on one session? (connect to port 21,
authenticate once, and then send many files on different ports)
First, (if it fits your use case) you can use the MergeContent processor to merge multiple (smaller) flow files into one (bigger) flow file and feed it to PutFTP.
Second, the PutFTP processor has the SupportsBatching annotation:
Marker annotation a Processor implementation can use to indicate that
users should be able to supply a Batch Duration for the Processor. If
a Processor uses this annotation, it is allowing the Framework to
batch ProcessSessions' commits, as well as allowing the Framework to
return the same ProcessSession multiple times...
Source: https://github.com/apache/nifi/blob/master/nifi-api/src/main/java/org/apache/nifi/annotation/behavior/SupportsBatching.java
Increase the run duration of your PutFTP processor towards more throughput to use the same task for many flow files. You might want to adjust the Maximum Batch Size in the properties tab to accommodate to that change.
Read more about it here:
Dcoumentation: Run duration
Understanding NiFi processor's "Run Duration" functionality.
What should be Ideal Run-duration and Run schedule configuration in nifi processors
Second question: inspect source code
When sending one file, many CWD (Change Working Directory) commands
are sent. For example, sending file to /myfiles/test/dest/file.txt
By inspecting FTPTransfer.java you can see, that the put method does the following:
put -> get client
put -> get client -> resetWorkingDirectory -> changeWorkingDirectory(homeDirectory)
put -> setAndGetWorkingDirectory
This might be the behavior you discovered.

Related

Download one file at a time through the same session in Apache Camel FTP

I want to implement following use case with Apache Camel FTP:
On a remote location I have 0 to n amount of files stored.
When I receive a command, using FTP, I want to download one file as a byte array (which one does not matter), if any files are available.
When the file is downloaded, I want to save it in a database as a blob.
Then I want to delete the stored/processed file on the remote location
Wait for the next download command and once received go back to step 1.
The files have to be downloaded through the same FTP session.
My problem is that if I use a normal FTP route, it downloads all available files.
When I tell the route to only download one, I have to create a new route for the other files and I cannot reuse the FTP session.
Is there a way to implement this use case with Apache Camel FTP?

Camel-ftp doesn't consume all available files at once it consumes them individually one after another meaning that each file gets processed separately. If you need to process them in some specific order you can try using file-name or modified date with sortBy option.
If you want to control when file gets downloaded i.e when command gets called you can call FTP Consumer endpoint using pollEnrich
Example:
// 1. Loads one file from ftp-server with timeout of 3 seconds.
// 2. logs the body and headers
from("direct:example")
.pollEnrich("ftp:host:port/directoryName", 3000)
.to("log:loggerName?showBody=true&showHeaders=true");
You can call the direct consumer endpoint with ProducerTemplate you can obtain from CamelContext or change it to whatever consumer endpoint fits your use case.
If you need to use dynamic URI you can use simple to provide the URI for poll-enrich and also also provide timeout afterwards.
from("direct:example")
.pollEnrich()
.simple("ftp:host:port/directoryName?fileName=${headers.targetFile}")
.timeout(3000)
.to("log:loggerName?showBody=true&showHeaders=true");

Executing NIFI InvokeHTTP processor once during a flow rather than on a per inbound flowfile basis

I have a NIFI flow that moves files from one FTP server to another. The flow starts with ListSFTP processor and ends with the PutSFTP processor. The password required to authenticate with the PutSFTP processor is stored inside another application that exposes REST endpoints to get the password. I want to get the password once and use the same to put all fetched files into the destination SFTP server. Please advise where/how i can use the InvokeHTTP processor in this case so that it doesnt get invoked for each flow file (It doesnt make sense to fetch the password on a per flowfile basis).

create a parallel flow that scheduled on time base depending on how long auth-token could live.
For example if token is valid during 60 min - then schedule it for every 45 min.
RequestAuth(InvokeHTTP) --> PutDistributedMapCache
and inside your main flow use FetchDistributedMapCache instead of InvokeHTTP

Saving the jmeter result .jtl files in the Slaves machine

I have configured Jmeter distributed testing, and I'm successfully able to trigger the test from Master to Slave machine. But the results files are not being generated on Slave machine even if I explicitly added a Listener into the test plan.
Can anybody help on this.
Thanks in advance

Question answered in JMeter group by #glinius:
in user.properties, add: mode=StrippedBatch
This will:
remove some data from the SampleResults as the response body, but do you need response body during a High Load Test, NO, DEFINITELY NO !
will send Sample Results as Batches and not for every sample reducing CPU, IO and network roundtrips

Adding the listener itself is not sufficient, you need to specify the location for the .jtl file in the listener, i.e. Simple Data Writer is a good choice
The user which is running the JMeter slave process must have write permissions to the folder specified. See How to Save Response Data in JMeter article for more details if needed. If you want to save the response data - make sure to provide mode=Standard property
Also make sure to provide the valid resultcollector.action_if_file_exists property, i.e. APPEND if you want to add new results to the existing file or DELETE if you want to overwrite the old results with the new ones.
The property can be passed via -G command-line argument from the master or via -J command-line argument from the slave. More information: Full list of command-line options

ListFile processor, force processor to list full directory everytime

My use case.
Some processing somewhere else add files to some dir (_use_it) -> call my flow using REST -> Now I want my process to read all files from mentioned directory (_use_it).
I want to read all files everytime from this directory, not just changed/added files. I can't start/stop process. This flow has to run as a background process.
I think, I am looking for ListFile processor to run once, then stop, and then when It runs again, it forgets previous state. "some twisted logic" :)
Thanks

1. Using GetFile Processor:
You can use GetFile processor instead of ListFile + FetchFile processors and GetFile processor doesn't store the state.
GetFile processor Gets all the files in the directory every time.
Keep Source File property If true, the file is not deleted after it
has been copied to the Content Repository; this causes the file to be
picked up continually and is useful for testing purposes. If not
keeping original NiFi will need write permissions on the directory it
is pulling from otherwise it will ignore the file.
(or)
2. Using ListFile Processor:
Making use of NiFi RestAPI we can clear the state of list file processor and then processor will list out all files in the directory every time.
Clear state of the processor:
POST
/processors/{id}/state/clear-requests
Before you are starting the Listing all files in the directory flow
Use Rest Api to stop the ListFile processor
Clear the state of ListFile processor
Start the ListFile processor.
Refer to this and this links to STOP the processor via RestApi

Implement Multiple client reads a file and multiple servers writes to a file via Client Server

Below is the question, I was asked in an interview,
Datacenter has 10000 servers.We have a single syslog driver which collates all the logs from all the servers in the datacenter and writes it in a single file called syslog.log
Let's say the datacenter has 1000 Admins.At any point of time any admin can login to syslog server and invoke a command say
getlog --serverid --severity
And the command should continuously tail the logs matching the conditions provided by the user untill he interupts.
Any number of users can concurrently login to this server and run this command. His request should be honoured, but with one condition, at any given point in time there can be only one file descriptor open for syslog.log file.
Implement getlog such that it satisfies the above condition.
I told my approach as Critical Section problem, we can use mutex/semaphore to lock the file until a user finishes. But he is expecting something like Client-Server Model.
How to serve this functionality using client and server architecture?
What is the best approach to solve this?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio