Is Spring XD the right tool choice? - hadoop

We're building an M2M IoT platform and part of the ecosystem is a Big Data storage and analytics component.
The platform connects devices at one end and provides a streaming data output using ActiveMQ to interface with the Big Data application layer.
I'm right now designing this middle layer which accepts machine data, running real time processes and stores this data in to a Hadoop storage module.
From what I see, Spring XD seems to be able to orchestrate this process from ingestion, to filtering, processing, analytics and export to Hadoop.
However, I do not know anyone who has done something like this. Anyone here who has executed something similar? Need your feedback into the choice of tool for the middleware.

Spring XD is great with RabbitMQ, for ActiveMQ you can use the JMS connector.
For more information take a look at Spring Integration, which is the main underpinnings and has been around for ever.
Spring XD runs on YARN or Zookeeper which are very solid.
I have seen it used for orchestration of big data in a few places.

Related

Spring Batch and Kafka

I am a junior programmer in banking. I want to make a microservice system that get data from kafka and processes it. after that, save to database and send final data to client app. What technology can i use? I plan to use spring bacth and kafka. Can the technology be implemented in my project or is there a better alternative?
To process data from a Kafka topic I recommend you to use Kafka Streams API, especially Spring Kafka Streams.
Kafka Streams and Spring
And to store the data in a database, you should use a Kafka Sink Connector.
Kafka Connect
This approach is very common and easy if your company has a Kafka ecosystem.
In terms of alternatives, here you will find an interesting comparison:
https://scramjet.org/blog/welcome-to-the-family
3 in 1 serverless
Scramjet takes a slightly different approach - 3 platforms in one.
Both the free product https://hub.scramjet.org/ for installation on your server and the cloud platform are available - currently also free in the beta version https://scramjet.org/#join-beta

logstash vs spring cloud data flow, which one is suitable for data preprocessing?

I'm using spring boot along with elasticsearch to make a search system on my website.
I've some data that i need to push in elastic search, this data ( a product for example ) must be processed before ( passed to another micro-service that filters the JSON, adds some fields for a better search result do some calculations and return the object i want to store ). is it possible to do so with log stash, or do i need to use Spring Cloud Data Flow ? thanks in advance.
what i want to do:
save a product ( product service )
log the saved product or stream it.
process it before storage ( another service )
save the document ( elastic search server )
Thanks in advance.
Obviously it depends on various factors but I can try to provide some insights on Spring Cloud Data Flow from the technical standpoint.
If you want to construct a streaming pipeline where your filtering apps are connected via a messaging system that does this flow of data processing, you can checkout Spring Cloud Data Flow.
Spring Cloud Data Flow (and the underlying framework supports such as Spring Cloud Stream and Spring Cloud Task) provides the operational benefits over how you manage your streaming pipelines but it may not make sense if you don't need a data pipeline with a messaging system etc., In those cases, you would just stick to a simple Spring Boot app that does this whole filtering model. As soon as you start exploring the distribution of these applications loosely coupled via messaging system, Spring Cloud Data Flow would be handy.
Please checkout SCDF guide to understand some of the features and recipes to know more about what SCDF can offer and choose what fits in your case.

Spring dataflow and GCP Pub Sub

I'm building an event-driven microservice architecture, which is supposed to be Cloud agnostic (as much as possible). Since this is initially going in GCP and I don't want to spend a long time in configurations and all that, I was going to use GCP's Pub/Sub directly for the event queue and would take care of other Cloud implementations later, but then I came across Spring Cloud Dataflow, which seemed nice because these are Spring Boot microservices and I needed a way to orchestrate them.
Does Spring Cloud Dataflow support Pub Sub as it's event queue?
Would it make my life easier in terms of configuration and setup going that path, rather than choosing a non native broker?
It'd be useful first to unpack the Spring Cloud Stream's "binder abstraction" because it is using this framework, you'd have a portable event-driven streaming application, which can run locally in your laptop or any cloud of your choice against the desired message broker.
Learn more about the binder-abstraction here. Here are all the available binder implementations of choice. Google PubSub is an option, and it is maintained by Google here.
Now, let's talk about Spring Cloud Data Flow (SCDF). Once when you have built the streaming applications, you could use SCDF to design+create a data pipeline made of such applications. There's the option to mix and reuse the collection of utility applications that we build, maintain, and release as well. The utility applications can be packaged with Google PubSub or other binders. More details here.
When you deploy the data pipeline, SCDF will resolve and download the individual applications to deploy them natively on platforms like Kubernetes or Cloud Foundry. We have users doing the same in a variety of cloud infrastructure (VMs, Bare-metal, EC2, Rackspace, etc.), including DIY platforms, too.
While also automating the deployment of the applications, SCDF will automate the configuration setup based on naming conventions derived from stream/task and application names as a combination. So, when the apps bootstrap, they would have automatically received the connection configurations (from SCDF) and as well the destination/topic to connect to along with the other metadata to reason through a collection of apps as a "stream" or a "task/batch" data pipeline. This allows you to monitor and manage the pipelines centrally.
Lastly, there's the native ability in SCDF to rolling-upgrade/rolling-downgrade 1 or many applications in a data pipeline without impacting the upstream or downstream consumers in production. More details here. There's a webinar recording (demo starts at ~41.25) on how to do with CI/CD automation.

Azure alternative to spring cloud dataflow process

I'm looking for the azure alternative for the Data flow model of Data Source-processor-sink.
I want the three entities to be separate microservices. I want to use messaging as a link between these three.
Basically, Source app takes the data from another service and sends it to processor while processor app acts on it and sends relevant notification/alert to sink.
I'm aware I can use rabbitmq for the messaging but I need to know which one will be better in azure - service bus topics or eventhub? and how can I use them?
At the moment, there isn't a Spring Cloud Stream binder implementation for Azure Event Hubs.
Unless we have this, the out-of-the-box or the custom apps cannot be built as a messaging-microservice app, where Spring Cloud Stream provides the programming model and Spring Cloud Data Flow lets you orchestrate the individual microserivces in to a data pipeline (i.e., source-processor-sink) via the DSL/Drag-and-Drop GUI.
Microsoft was exploring the binder implementation in the past; possibly it would end up in Azure Spring Boot project. Feel free to drop an issue on their backlog.

Use EIP and integration solutions to distribute layers on cloud?

I want to adopt a solution of EIP for cloud deployment for a web application:
The application will be developed in such an approach that each layer (e.g. data, service, web) will come out as a separate module and artifact.
Each layer has the opportunity to deployed on a different virtual resource on the cloud. In this regards, web nodes will in a way find the related service nodes and likewise service nodes are connected to data nodes.
Objects in the service layer provide REST access to the services in the application. Web layer is supposed to use REST services from service layer to complete requests for users of the application.
For the above requirement to deliver a "highly-scalable" application on the cloud, it seems that solutions such as Apache Camel, Spring Integration, and Mule ESB are of significant options.
There seems to be other discussion such as a question or a blog post on this topic, but I was wondering if anybody has had specific experiences with such a deployment scheme on "the cloud"? I'd be thankful for any ideas and sharing experiences. TIA.
To me this looks a bit like overengineering. Is there a real reason that you need to separate all those layers? What you describe looks a lot like the J2EE applications from some years ago.
How about deploying all layers of the application onto each node and just use simple Java calls or OSGi services to communicate.
This aproach has several advantages:
Less complexity
No serialization or DTOs
Transactions are easy / no distributed transactions necessary
Load Balancing and Failover is much easier as you can do it on the web layer only
Performance is probably a lot higher
You can implement such an application using spring or blueprint (on OSGi).
Another options is to use a modern JavaEE server. If this is interesting to you take a look at some of the courses of Adam Bien. He shows how to use JavaEE in a really lean way.
For communicating between nodes I have good experiences with Camel and CXF but you should try to avoid remoting as much as possible.

Resources