I see some processors have both GetXXX and ConsumeXXX variants (GetKafka an ConsumeKafka for example). When to a name a processor with GetXXX over ConsumeXXX?
Note: I understand the technical differences between ConsumeKafka and GetKafka. My question is mainly on the naming convention.
There is no real meaning behind the names and technically they could be named anything. The typical convention is to start the processor name with a verb that describes the action being performed, followed by the system/thing being acted on.
"Get" processors are typically processors that have no incoming connection and pull data from some external source, and "Put" processors are typically processors that deliver data to an external system.
When the first Kafka processors were developed using the 0.8 Kafka client, they were called GetKafka and PutKafka. The community then wanted to also support Kafka 0.9 at the same time, so ConsumeKafka and PublishKafka were implemented which better aligned with Kafka's terminology, and also provided another name since they couldn't also be called GetKafka and PutKafka.
Related
I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?
You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.
Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.
I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Thanks!
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.
I've read some performance claims about how Elixir and Erlang use hardware, and I'm trying to see if I understand their basis. Some background:
First, Erlang supports writing nested lists of immutable strings (iolists) to IO (files, sockets, etc) and uses writev and the strings' memory addresses to do so (see Evan Miller's blog post on this).
Second, the docs for an Erlang web framework called Chicago Boss say:
Erlang Respects Your RAM!
Erlang is different from other platforms because when rendering a server-side template, it doesn't create a separate copy of a web page in memory for each connected client. Instead, it constructs pointers to the same pieces of immutable memory across multiple requests.
So if two people request two different profile pages at the same time, they're actually sent the same chunks of memory for the header, footer, and other shared template snippets. The result is a server that can construct complex, uncached web pages for hundreds of users per second without breaking a sweat.
Third, a book about an Elixir (Erlang VM) web framework called Phoenix says:
Templates are precompiled. Phoenix doesn’t need to copy strings for each rendered template. At the hardware level, you’ll see caching come into play for these strings where it never did before.
From looking at the source, I know that this framework uses iolists to represent a completed response template.
Putting all this together, I think what's being implied is that if a web framework uses writev to tell the OS to send the same header and footer strings from the same memory locations, one web request after another, the hardware will be able to say "oh, I know that value, it's already in CPU cache so I don't have to look in RAM for it."
Is that right? (I have very little understanding of system calls and hardware.) If not, any ideas on how hardware caching is involved?
(Bonus if you can tell me how to see or infer what's happening.)
Yes, it's mostly the processor caches that help you. The time needed to retrieve the data is smaller as it's in a faster memory (ie the CPU caches).
Some pointers for understanding what the caches are and how they work:
https://www.quora.com/How-does-the-cache-memory-in-a-computer-work
http://www.hardwaresecrets.com/how-the-cache-memory-works/
http://lwn.net/Articles/252125/
To see this, measure how much a request takes (client side) in the normal server operation. After that have a separate process within the same vm that constantly creates and writes to disk a very large string (it probably has to be megabytes in size - whatever the size of the L2/L3 caches on your process are). Remeasure how much the request takes - if done correctly this should be at least 1 order of magnitude slower.
I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe
I am currently designing system which should allow access to database. Assumptions are as follows:
Database should has access layer. The access layer should provide objects that represents database tables. (This would be done using some ORM framework).
Client which want to get data from database, should get object from access layer first, and then get data using those objects.
Clients could use Python, Java or C++.
Access layer is based on Java.
There won't be to many clients, but they will be opearating on large amounts of data.
The question which is hard for me is what technology should be used for passing object between acces layer and clients. I consider using ZeroC ICE, Apache Thrift or Google Protocol Buffers.
Does anyone have opinion which one is worth using?
This is my research for Protocol Buffers:
Advantages:
simple to use and easy to start
well documented
highly optimized
defining object data structure in java-like language
automatically generating implementation of setters and getters and build methods for Python, Java and C++
open-source bidnings for other languages
object could be extended without affecting old version of an applications
there are many of open-source RpcChanel and RpcController implementation (not tested)
Disadvantages:
need to implement object transfer
objects structure have to be defined before use, so we can't add some fields on the fly (Updated: there are posibilities to do that, see the comments)
if there is a need for reading one object's filed, we have to parse whole file (in contrast, in XML we could ignore chosen tags)
if we want to use RPC for invoke object methods, we need to define services and deliver RpcChanel and RpcController implementation
This is my research for Apache Thrift:
Advantages:
provide compiler that generates source code for supported languages (classes, all things that are important)
allow defining optional fields in the structures ( when we do not set value on a field, the size of transfered data is lower)
enable point out some methods that are "one way" (returning nothing and client after invokation do not wait for answer from server about completion processing of query)
support collections (maps, lists, sets), objects, primitives serialization (deserialization), constants, enumerations, exceptions
most of problems, errors are solved and explained
provide different methods of serialization: (TBinaryProtocol...) and different ways of exchanging data: (TBufferedTransport, TZlibTransport... )
compiler produces classes (structures) for languages thaw we can extend by adding some new methods.
possible to add fields to protocol(server as well as client) and remove other- old code and new one can properly interact(some rules in update)
enable asynchronous calls
easy to use
Disadvantages:
documentation - contains some errors that sometimes it is really hard to get to know what is the source of the problem
not allways problems are well taged (when we look for solution in the Internet).
not support overloading for service methods
tutorials cover only simple examples of thrift usage
hard to start
ICE ZeroC:
Is better than Protocol Buffers, because I wouldn't need to implement object passing by myself via e.g. sockets. ICE also gives ServantLocators which can provide management of connections.
The question is: whether ICE is much slower and less efficient than the PB?