ExecuteProcess vs. ExecuteStreamCommand in NiFi - apache-nifi

NiFi documentation defines ExecuteProcess and ExecuteStreamCommand as follows:
ExecuteProcess
Runs an operating system command specified by the user and writes the output of that command to a FlowFile. If the command is expected to be long-running, the Processor can output the partial data on a specified interval. When this option is used, the output is expected to be in textual format, as it typically does not make sense to split binary data on arbitrary time-based intervals.
ExecuteStreamCommand
Executes an external command on the contents of a flow file, and creates a new flow file with the results of the command.
Both these definitions mention word command; however, one (ExecuteProcess) says it executes an OS command, while another (ExecuteStreamCommand) says it executes an external command.
What is the difference between these two?
Is my understanding/guess correct, that "OS command" implies something like local OS tools (e. g. ping, curl, netstat, etc.) and "external command" implies something that is not necessarily OS-native, but still runs as a command-line/shell tool on the host OS (e.g. java -jar somejar.jar)?

Atleast from my understanding this is what I found out to be the main difference between ExecuteProcess and ExecuteStreamCommand.
The ExecuteProcess processor doesn't support having an incoming connection, so in this case this has to be a process that is independent or the starting point of an job. While the ExecuteStreamCommand does allow incoming connections that can read an existing flowfile.
Also, the ExecuteProcess would support running command-line/shell tool as you can see in the documentation it specifies if just the name of an executable is provided, it must be in the user's environment PATH.
You can give a quick run to try a sample Java/Python command to test it in NiFi.

Related

Apache Nifi: Is there a limit on the output produced by the ExecuteStreamCommand Processor

I am running into a strange issue and just want to rule out a possibility by asking this question.
I am executing Java code via this processor, the processor works fine on certain command-line arguments but on others the output is blank. However, when I run the exact same command in the host terminal window the output is fine. The output that's written to the STDOUT is about 1 MB.
My question is, is there a limit on the size of the data that this processor can write to STDOUT, the output flow file?
I am on NiFi 1.9.1
Thank you.
If you are using the property "Output Destination Attribute" then the output is put in a flow file attribute and is limited by the value of "Max Attribute Length" with default of 256 characters.
If you are not putting the output in an attribute, then there is no limit.

in nifi, how to call an external program that ask an input file and an output file in parameters

I have an external program, an ebook converter, to convert .epub to .txt.
This converter requests a file as input and another file as output, the filenames are important here because the extension is used to determine which conversion should be made, also from what I saw while testing, the program is performing seeks on the input file. So those contrains prenvents usage of named pipes or SDTIN redirections.
For another project, or at least a POC, I'll have to encapsulated an existing bundle of tools that are working the same way as above and recreate the workflows bettween in Nifi.
In short term, writing a custom processor is not possible.
So how should i do it ?
here are a couple of possible solutions I found :
create a PutFile processor who will put the file in a temporary location, chain it with an executeStreamCommand who will execute the external command and put the output in a temporary location and chain it with a fetchFileProcessor. But the issue here may be that I have to found a way to clean the temporary files.
another solution would be to create a script to use with ExecuteScript to do something like : write the flowfile to disk with filename based on attributes, executute the external command, read the outputfile to the flowfile and perform some cleaning. but from what I found, it's not that easy to write to disk from this processor, right ?
so what direction should I go, any advices ?
You are on the right track. I think you could do a sequence of processors like the following:
UpdateAttribute - create attributes for the input and output file names.
PutFile - write the temp input file.
ExecuteProcess - run the conversion utility. I recommend wrapping this in a shell script, which gives you an opportunity to clean up the temp input file when it completes.
FetchFile - read the conversion output into the FlowFile content. Use Completion Strategy of Delete File to clean up the converted file.

Hadoop/Yarn/Spark can I call command line?

Short version of my question: We need to call the command line from within a Spark job. Is this feasible? The cluster support group indicated this could cause problems with memory.
Long Version: I have a job that I need to run on a Hadoop/MapR cluster processing packet data captured with tshark/wireshark. The data is binary packet data, one file per minute of capture. We need to extract certain fields from this packet data such as IP addresses etc. We have investigated options such a jNetPcap but this library is a bit limited. So it looks like we need to call the tshark command line from within the job and process the response. We can't do this directly during capture as we need the capture to be as efficient as possible to avoid dropping packets. Converting the binary data to text outside the cluster is possible but this is 95% of the work so we may as well run the entire job as a non-distributed job on a single server. This limits the number of cores we can use.
Command line to decode is:
tshark -V -r somefile.pcap
or
tshark -T pdml -r somefile.pcap
Well, it is not impossible. Spark provides pipe method which can be used to pipe data to external process and read the output. General structure could be for example something like this:
val files: RDD[String] = ??? // List of paths
val processed: RDD[String] = files.pipe("some Unix pipe")
Still, from your description it looks like GNU Parallel could be much better choice than Spark here.

Bosun adding external collectors

What is the procedure to define new external collectors in bosun using scollector.
Can we write python or shell scripts to collect data?
The documentation around this is not quite up to date. You can do it as described in http://godoc.org/bosun.org/cmd/scollector#hdr-External_Collectors , but we also support JSON output which is better.
Either way, you write something and put it in the external collectors directory, followed by a frequency directory, and then an executable script or binary. Something like:
<external_collectors_dir>/<freq_sec>/foo.sh.
If the directory frequency is zero 0, then the the script is expected to be continuously running, and you put a sleep inside the code (This is my preferred method for external collectors). The scripts outputs the telnet format, or the undocumented JSON format to stdout. Scollector picks it up, and queues that information for sending.
I created an issue to get this documented not long ago https://github.com/bosun-monitor/bosun/issues/1225. Until one of us gets around to that, here is the PR that added JSON https://github.com/bosun-monitor/bosun/commit/fced1642fd260bf6afa8cba169d84c60f2e23e92
Adding to what Kyle said, you can take a look at some existing external collectors to see what they output. here is one written in java that one of our colleagues wrote to monitor jvm stuff. It uses the text format, which is simply:
metricname timestamp value tag1=foo tag2=bar
If you want to use the JSON format, here is an example from one of our collectors:
{"metric":"exceptional.exceptions.count","timestamp":1438788720,"value":0,"tags":{"application":"AdServer","machine":"ny-web03","source":"NY_Status"}}
And you can also send metadata:
{"Metric":"exceptional.exceptions.count","Name":"rate","Value":"counter"}
{"Metric":"exceptional.exceptions.count","Name":"unit","Value":"errors"}
{"Metric":"exceptional.exceptions.count","Name":"desc","Value":"The number of exceptions thrown per second by applications and machines. Data is queried from multiple sources. See status instances for details on exceptions."}`
Or send error messages to stderror:
2015/08/05 15:32:00 lookup OR-SQL03: no such host

redis: EVAL and the TIME

I like the Lua-scripting for redis but i have a big problem with TIME.
I store events in a SortedSet.
The score is the time, so that in my application i can view all events in given time-window.
redis.call('zadd', myEventsSet, TIME, EventID);
Ok, but this is not working - i can not access the TIME (Servertime).
Is there any way to get a time from the Server without passing it as an argument to my lua-script? Or is passing the time as argument the best way to do it?
This is explicitly forbidden (as far as I remember). The reasoning behind this is that your lua functions must be deterministic and depend only on their arguments. What if this Lua call gets replicated to a slave with different system time?
Edit (by Linus G Thiel): This is correct. From the redis EVAL docs:
Scripts as pure functions
A very important part of scripting is writing scripts that are pure functions. Scripts executed in a Redis instance are replicated on slaves by sending the script -- not the resulting commands.
[...]
In order to enforce this behavior in scripts Redis does the following:
Lua does not export commands to access the system time or other external state.
Redis will block the script with an error if a script calls a Redis command able to alter the data set after a Redis random command like RANDOMKEY, SRANDMEMBER, TIME. This means that if a script is read-only and does not modify the data set it is free to call those commands. Note that a random command does not necessarily mean a command that uses random numbers: any non-deterministic command is considered a random command (the best example in this regard is the TIME command).
There is a wealth of information on why this is, how to deal with this in different scenarios, and what Lua libraries are available to scripts. I recommend you read the whole documentation!

Resources