Providing REST APIs: Apache NiFi vs Spring Boot [closed] - spring

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We're currently evaluating some open source tools at our company how to create our REST APIs in the future.
Last candidates are Apache NiFi and Spring. I'm familiar with Spring and it's relatively easy to implement APIs that satisfy our needs.
However, I'm not sure if NiFi is the better tool or even designed to purely be used as API provider.
Generally, our APIs do the following:
Parse JSON payload/input parameters (sometimes quite complex XQuery stuff on the payload)
Send those infos to Oracle DB functions where main logic resides
Parse Oracle output and send appropriate HTTP response
If anyone with NiFi or Spring experience (or both) has some more insights on what's the better alternative here, I'd greatly appreciate it. Thanks in advance!

NiFi isn't specifically designed for creating RESTful APIs, but there's no reason you couldn't achieve this in NiFi. After all, the use-case you describe is pretty much just moving data; send data payload -> parse data -> send to oracle -> respond.
You can build complex HTTP handling logic with the NiFI HandleHTTPRequest and HandleHTTPResponse processors.
You can easily work with JSON in NiFi; either using the concept of Records with JsonTreeReader, or using something like JoltTransformJSON.
You can interact with DBs, including Oracle, using the DBCPConnectionPool and then run SQL using PutSQL, ExecuteSQL, QueryDatabaseTable (and their corresponding Record varients e.g. ExecuteSQLRecord).
You'll also gain some of the benefits of NiFi out of the box, e.g. fault tolerance, clustering, scaling out, visibility, lineage etc.
NiFi is a no-code approach, so it's a vastly different experience to developing a Spring application. You'll need to learn the do's and don't's of NiFi, how to properly structure flows, how to scale, etc. You can also extend NiFi with custom development, but you'd have to learn the NiFi structure and APIs.
Obviously, you could achieve all of this with Spring too; if your needs are very simple (you won't need to scale out, you don't need guarenteed fault tolerance, etc.), or if your API is going to branch into wider use-cases than you described here, it will probably be easier as you already have Spring experience.
There are other considerations; how you version control (NiFi has NiFi Registry), external dependencies (NiFi requires ZooKeeper), overhead (NiFi has it's UI for building flows), deployment (NiFi disks requirements for repositories, OS support, etc.), management/support (are you comfortable supporting NiFi/Registry/ZooKeeper if there are issues), upgrades, etc.

I would also go for the spring framework based approach. Nifi is more of an ETL like tool

#McLovin, love the name.
As a long time user of NiFi, I have NiFi APIs in production with several huge enterprises. I have also experience with Spring and can appreciate both methods.
However, the number one qualifier for NIFI here, in my opinion, is myself, or whomever I am training in delivery, not having to write any code. This is amazing!! I also love the ability to set an inbound port in NiFi to accept anything or allow or deny requests based on my flow logic which I can change at any time. I can modify the flow while it is live. I can add more logic while it is live. I can capture exceptions, send notifications, and build in replay ability with requests.
I would absolutely choose NiFi over Spring to create a scalable API.

Related

Elasticsearch in microservices architecture, design question [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am designing a system, where I have several microservices communicating via middleware.
Now, every blueprint about microservices underlines that microservices must be autonomous and each of them must handle their own data. Currently, each microservice in my system does store data in a relational database.
I have a new requirement to implement a full-text search, each of my microservices is storing possibly searchable entities.
I was thinking to use an ElasticSearch cluster, where I'd have several indexes, indexes would serve as boundaries that separate the data which comes from various micro-services. I would like to underscore the fact that I plan to use ES only as a search engine, not as a system of record.
Here is my dilemma:
1. Should I allow each microservice to handle ES interactions directly (as caching and persistence)? 2. Or should I create a separate microservice (let's call it "search"), which would be the one that interacts with the ES cluster?
I am leaning towards 1. b/c since each microservice has to be autonomous in persistence, caching, it can also handle full-text searches too.
It will be interesting to hear out different opinions.
UPDATE:
Here is why I think each microservice should handle their searches individually:
To me, a full-text search capability is similar to persistence and caching layers, each micro-service knows better the business model and is responsible for implementing those layers individually.
If I introduce one more microservice just for doing searches, I'll have one extra possible point of failure, same goes to using PubSub as a middleman if we do not want direct interaction between search microservice and the rest of the pack.
On the contrary, using ES directly, which is a highly-available SaaS, eliminates single point of failure.
All write requests will be fast and there will be no lag. Information will be searchable right away. This will guarantee a seamless user experience.
I do not see search as another business process (maybe my understanding is flawed). To me, it is just a nice-to-have feature, not part of core functionality. However, once implemented, I want it to provide a great user experience.
This model of having an individual search microservice reminds of CQRS (command query responsibility segregation) architectural pattern. Where I'd first push the data to DB in my microservice A, then publish it to the messaging broker (command), a message would be picked up from the queue by the consumer and pushed into ES. Then frontend, on the read path (query), would go directly to search microservice.
I have never seen this pattern implemented for searching, it makes sense to do it in a big data world, where one microservice would ingest the data, then the worker process aggregates it for analytics and pushes it into an aggregated data table or separate data store and only then the data will become queryable via separate micro-service, that is enabling fetching of the analytics data.
Are there any publications out there or successful implementations of the CQRS pattern for ES (taking to consideration that ES is not used as a primary system of record but as a full-text search engine)?
Another search service would be overly abstracting it.
What I would do:
Use Xpack Security RBAC, which is now free, to lock down the indices for each micro service to an account that the service is configured to use: https://www.elastic.co/blog/security-for-elasticsearch-is-now-free
Use search templates in Elasticsearch to abstract out the search logic from the services, to ES, then have the services call the templates.
I would go with a separate Search service. There are several reasons for that.
It's another (business) process so you can be more flexible. Let's say you might have CustomerMasterData service and CustomerAddress service. But search requirements are to be able to search either by customer name or by address. Having two different Servers/ES indexes will not make you life easier. However in case of separate search service you can actually build index that holds data from different sources.
Service should own data. It means that Search should be the only service that has access to the ES index directly.
Filling ES index could be separated and done via communication to another service. I would do it via messaging system. For instance Search service sends Sync request and other services that listen to the queue will send out data. It allows to keep things independent.

APACHE NIFI vs APACHE AIRFLOW vs APACHE FALCON ? Which suits best in the below scenario? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am developing a solution in Java which communicates with a set of devices through REST APIs which belongs to different vendors. So for each vendor, there are a set of processes that I have to perform inside my solution. However, these processes will be changed according to each vendor. Following are the high-level processes that need to be performed.
Retrieve an XML file from a folder
Process the XML file
Perform some image processing
Schedule a job and execute it on the scheduled time
Storing data on a MySQL DB and perform some REST calls to outside APIs
So for one vendor might have all of the above processes. But for another, might not have some processes (Eg: Image processing). Following things should be able to obtain from the selected solution.
I should be able to create custom workflows for new vendors
Need to identify any failures that have been occurred within the workflow and perform retry mechanisms.
Should be able to execute some functions parallelly (Eg: Image processing)
Scalable
Opensource
So I was told to look into workflow managers like Nifi/Airflow/Falcon. I did some research on them but couldn't finalize the most suitable solution.
NOTE: There is NO requirement to use Hadoop or any other cluster and data flow frequency is not that high
Currently, I am thinking of using Nifi. But can anyone please give your opinion on that? What would be the best solution for my use case?
Apache NiFi is not a workflow manager in the way the Apache Airflow or Apache Oozie are. It is a data flow tool - it routes and transforms data. It is not intended to schedule jobs but rather allows you to collect data from multiple locations, define discrete steps to process that data and route that data to different destinations.
Apache Falcon is again different in that it allows you to more easily define and manage HDFS datasets. It is effectively data management within a HDFS cluster.
Based on your description, NiFi would be useful addition to your requirements. It would be able to collect your XML file, process in it in some manner, store the data in MySQL, and perform REST calls. It would also be easily configurable for new vendors, and tolerates failures well. It performs most functions in parallel and can be scaled into a clustered NiFi with multiple host machines. It was designed with performance and reliability in mind.
What I am unsure about is the ability to perform image processing. There are some processors (extract image metadata, resize image) but otherwise you would need to develop a new processor in Java - which is relatively easy. Or, if the image processing uses Python or some other scripting language, you can use one of the ExecuteScript processors.
'Scheduling jobs' using NiFi is not recommended.
Full disclosure: I am an Apache NiFi contributor.
I am using nifi with an OP's similar use case. Regarding scheduling, I like how nifi works with Kafka, I have some scripts scheduled to run with a crontab frequency, just adding the message into Kafka topics, which topic is listened by nifi, then starts the orchestration for loading, transforming, fetching, indexing, storing, etc, also, you can always handle HttpRequest so you can make kinda "webhook receivers" in order to trigger a process from an external HTTP POST, once again, for simple deployments (these ones you plug and play in a single machine) cronjob nails the task. For image processing, I have an OCR image reader with python connected with an ExecuteScript processor and one facial reckon with opencv with ExecuteCommand processor, the automatic nifi's back-pressure has solved many of the problems I ran by only running the python script and the command by itself.

Spring Cloud Netflix & Spring Cloud Data Flow microservice arheticture

I'm developing an application that must both handle events coming from other systems and provide a REST API. I want to split the applications into micro services and I'm trying to figure out which approach I should use. I drew attention to the Spring Cloud Netflix and the Spring Cloud Data Flow toolkit, but it's not clear to me whether they can be integrated and how.
As an example, suppose we have the following functionality in the system:
1. information about users
card of orders
product catalog
sending various notifications
obtaining information about the orders from third-party systems
processing, filtering, and transformation of order events
processing of various rules based on orders and sending notifications
sending information about user orders from third-party systems to other users using websockets (with pre-filtering)
Point 1-4 - there I see the classical micro service architecture. Framework - Spring Netflix Stack.
Point 5-9 - it's best to use an event-driven approach. Toolkit - Spring Data Flow.
The question is how to build communication between these platforms.
In particular - POPULATE ORDER DETAILS SERVICE must transform the incoming orders and save additional information (in case it needed) in the database. ORDER RULE EXECUTOR SERVICE should obtain information about the current saved rules, execute them and send notifications. WEB SOCKET SERVICE should send orders information only if a particular user has set the filters, and ORDER SAVER SERVICE should store the information about the transformed orders in the database.
1.
Communication between the micro-services within the two platforms could be using the API GATEWAY, but in this case, I have the following questions:
Does the Spring Cloud platform allow to work with micro services that way?
Performance - the number of events is very huge, which can significantly slow down the processing of events. Is it possible to use other approaches, for example, communication not through the API Gateway but through in-memory cache?
2.
Since some functionality intersects between these services, I have a question about what is "microservice" in the understanding of the Spring Cloud Stream framework. In particular, does it make sense to have separate services? Can the microservice in the Spring Cloud Stream have a REST API, work with the database and simultaneously process the events? Does such a diagram make sense and is it possible to build such a stack at the moment?
The question is which of these approaches is more correct? What did Spring Data Streams mean by "microservice"?
Given the limited information in the post, it is hard to convince on all the matters pertaining to this type of architecture, but I'll attempt to share some specifics, and point to samples. Also for the same reasons, it is hard to solve for your needs end-to-end. From the surface, it appears you're attempting to build event-driven applications and wondering whether Spring Cloud Stream (SCSt) and Spring Cloud Data Flow (SCDF) could help.
They can, yes.
The Order, User, and Catalog seem like domain objects and it would all come together to solve for a use-case. For instance, querying for a number of orders for a particular product, and group by the user. There are a few samples that articulate the data flow between the entities to solve similar problems. Here's a live code-walkthrough of event-driven systems in action. There's another example of social-graph application, too.
Though these event-driven applications can run standalone as individual services with the help of of message broker (eg: Kafka or RabbitMQ), you could of course also register them in SCDF and use them in the SCDF DSL to build a coherent data pipeline. We are expanding on more direct capabilities in SCDF for these types of use-cases, but there are ways to orchestrate them today with current abilities, too. Follow spring-cloud/spring-cloud-#2331#issuecomment-406444350 for more details.
I hope this gives an idea. Try to build something small using SCSt/SCDF, prove it out, and expand to more complex use-cases.

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

Design strategy for Microservices in .NET [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What would be a good way for Microservices .NET to communicate with each other? Would a peer to peer communication be better (for performance) using NETMQ (port of ZeroMQ) or would it be better via a Bus (NServiceBus or RhinoBus)?
Also would you break up your data access layer into microservices too?
-Indu
A Service Bus-based design allows your application to leverage the decoupling middleware design pattern. You have explicit control in terms of how each Microservice communicates. You can also throttle traffic. However, it really depends on your requirements. Please refer to this tutorial on building and testing Microservices in .NET (C#).
We are starting down this same path. Like all new hot new methodologies, you must be careful that you are actually achieving the benefits of using a Microservices approach.
We have evaluated Azure Service Fabric as one possibility. As a place to host your applications it seems quite promising. There is also an impressive API if you want your applications to tightly integrate with the environment. This integration could likely answer your questions. The caveat is that the API is still in flux (it's improving) and documentation is scarce. It also feels a bit like "vendor lock".
To keep things simple, we have started out by letting our microservices be simple stateless applications that communicate via REST. The endpoints are well-documented and contain a contract version number as part of the URI. We intend to introduce more sophisticated ways of interaction later as the need arises (ie, performance).
To answer your question about "data access layer", my opinion would be that each microservice should persist state in whatever way is best for that service to do so. The actual storage is private to the microservices and other services may only use that data through its public API.
We've recently open sourced our .NET microservices framework, that covers a couple of the needed patterns for microservices. I recommend at least taking a look to understand what is needed when you go into this kind of architecture.
https://github.com/gigya/microdot

Resources