We want to add a tracer id(MDC) across multiple REST services as part of the logging, so we can follow all calls in our log based on the id.
Is Spring sleuth suitable for using as distributed logging framework with 100% sampling all the time, or are we misusing it?
We are concerned about the overhead with 100% sampling
Yes you can. You can modify the percentage at runtime if necessary. We wouldn't be building a tool that is not suitable for distributed systems.
Related
We're working on a project , and we want to use some toggling feature tool like ff4j or togglz but we have a real constraints about performances, i mean we really need a tool with the less time of execution , i've checked a little bit ff4j and togglz but i don't know what is best for this solution, or may be if you know some other tools.
Context of project: its a netflix microservices architecture, so we have eureka,ribbon,zuul and microservices.
otherwise , if you have another solution , may be develop a sidecar please give me some ideas.
thank you in advance :)
Disclaimer : I created FF4j, as such I won't give you answer relative to performance comparison. I will provide architecture design principles.
Microservices means distributed architecture so you will have to store the state of your features in a common persistence storage (DB).
The cost of feature toggle framework won't be time to evaluate the feature state predicate (it is a simple condition) it will be the time to access the data from the persistence storage.
FF4j provides support for both REDIS and CONSUL:
Redis seems a good candidate as very fast for put/get and distribute.
Consul is also a good idea in distributed microservice : it provides a key-value store.
Eureka may does the same, I don't know, ff4j does not have store for it yet.
If you have to store your features in a slower DB such as SQL-Like then you might consider to use caching. FF4j provides some cacheProxy to handle such use cases.
Other Considerations :
Put the administration console only in a backend application not on each microservices (security + performance overhead)
Feature Toggle can do more with Configuration Management and monitoring.
You may want to have a look at this 15min talk exactly on that subject. LIVE DEMO starting at 7:10
and related github repository for sample with Spring-Cloud
I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications.
But lately, my client decided to use Java Spring Batch for 2 of our major pipelines :
Read from MongoDB --> Business Logic --> Write to JSON File (~ 2GB | 600k Rows)
Read from Cassandra --> Business Logic --> Write JSON File (~ 4GB | 2M Rows)
I was pretty baffled by this enterprise-level decision. I agree there are greater minds than mine in the industry but I was unable to comprehend the need of making this move.
My Questions here are:
Has anybody compared the performances between Apache Spark and Java Spring Batch?
What could be the advantages of using Spring Batch over Spark?
Is Spring Batch "truly distributed" when compared to Apache Spark? I came across methods like chunk(), partition etc in offcial docs but I was not convinced of its true distributedness. After all Spring Batch is running on a single JVM instance. Isn't it ???
I'm unable to wrap my head around these. So, I want to use this platform for an open discussion between Spring Batch and Apache Spark.
As the lead of the Spring Batch project, I’m sure you’ll understand I have a specific perspective. However, before beginning, I should call out that the frameworks we are talking about were designed for two very different use cases. Spring Batch was designed to handle traditional, enterprise batch processing on the JVM. It was designed to apply well understood patterns that are common place in enterprise batch processing and make them convenient in a framework for the JVM. Spark, on the other hand, was designed for big data and machine learning use cases. Those use cases have different patterns, challenges, and goals than a traditional enterprise batch system, and that is reflected in the design of the framework. That being said, here are my answers to your specific questions.
Has anybody compared the performances between Apache Spark and Java Spring Batch?
No one can really answer this question for you. Performance benchmarks are a very specific thing. Use cases matter. Hardware matters. I encourage you to do your own benchmarks and performance profiling to determine what works best for your use cases in your deployment topologies.
What could be the advantages of using Spring Batch over Spark?
Programming model similar to other enterprise workloads
Enterprises need to be aware of the resources they have on hand when making architectural decisions. Is using new technology X worth the retraining or hiring overhead of technology Y? In the case of Spark vs Spring Batch, the ramp up for an existing Spring developer on Spring Batch is very minimal. I can take any developer that is comfortable with Spring and make them fully productive with Spring Batch very quickly. Spark has a steeper learning curve for the average enterprise developer, not only because of the overhead of learning the Spark framework but all the related technologies to prodictionalize a Spark job in that ecosystem (HDFS, Oozie, etc).
No dedicated infrastructure required
When running in a distributed environment, you need to configure a cluster using YARN, Mesos, or Spark’s own clustering installation (there is an experimental Kubernetes option available at the time of this writing, but, as noted, it is labeled as experimental). This requires dedicated infrastructure for specific use cases. Spring Batch can be deployed on any infrastructure. You can execute it via Spring Boot with executable JAR files, you can deploy it into servlet containers or application servers, and you can run Spring Batch jobs via YARN or any cloud provider. Moreover, if you use Spring Boot’s executable JAR concept, there is nothing to setup in advance, even if running a distributed application on the same cloud-based infrastructure you run your other workloads on.
More out of the box readers/writers simplify job creation
The Spark ecosystem is focused around big data use cases. Because of that, the components it provides out of the box for reading and writing are focused on those use cases. Things like different serialization options for reading files commonly used in big data use cases are handled natively. However, processing things like chunks of records within a transaction are not.
Spring Batch, on the other hand, provides a complete suite of components for declarative input and output. Reading and writing flat files, XML files, from databases, from NoSQL stores, from messaging queues, writing emails...the list goes on. Spring Batch provices all of those out of the box.
Spark was built for big data...not all use cases are big data use cases
In short, Spark’s features are specific for the domain it was built for: big data and machine learning. Things like transaction management (or transactions at all) do not exist in Spark. The idea of rolling back when an error occurs doesn’t exist (to my knowledge) without custom code. More robust error handling use cases like skip/retry are not provided at the level of the framework. State management for things like restarting is much heavier in Spark than Spring Batch (persisting the entire RDD vs storing trivial state for specific components). All of these features are native features of Spring Batch.
Is Spring Batch “truly distributed”
One of the advantages of Spring Batch is the ability to evolve a batch process from a simple sequentially executed, single JVM process to a fully distributed, clustered solution with minimal changes. Spring Batch supports two main distributed modes:
Remote Partitioning - Here Spring Batch runs in a master/worker configuration. The masters delegate work to workers based on the mechanism of orchestration (many options here). Full restartability, error handling, etc. is all available for this approach with minimal network overhead (transmission of metadata describing each partition only) to the remote JVMs. Spring Cloud Task also provides extensions to Spring Batch that allow for cloud native mechanisms to dynamically deploying the workers.
Remote Chunking - Remote chunking delegates only the processing and writing phases of a step to a remote JVM. Still using a master/worker configuration, the master is responsible for providing the data to the workers for processing and writing. In this topology, the data travels over the wire, causing a heavier network load. It is typically used only when the processing advantages can surpass the overhead of the added network traffic.
There are other Stackoverflow answers that discuss these features in further detail (as does as the documentation):
Advantages of spring batch
Difference between spring batch remote chunking and remote partitioning
Spring Batch Documentation
My application is heavily relying on asynchronous web-services.
It is built with spring boot 1.5.x, which allows me to utilize standard Java 8 CompletableFuture<T> in order to produce deferred async responses. For more info see
https://nickebbitt.github.io/blog/2017/03/22/async-web-service-using-completable-future
Spring boot 2.0.x now comes with the starter pack that can utilize reactive paradigm. Spring WebFlux is the framework, which is implementing reactive HTTP.
Since I have my API implemented as described in the first paragraph, will I gain much by redoing my services to use non-blocking reactive approach? In a nutshell, I'll have non-blocking API as well, right?
Is there an example how to convert async API that is based on CompletableFuture<T> to Mono<T>\Flux<T>?
I was thinking to get rid of servlet-based server altogether (Jetty in my case) and go with Netty + Reactor.
Needless to say that I am new to the whole reactive paradigm.
I would like to hear your opinions.
I have two things to say:
Q: Is there an example how to convert async API that is based on CompletableFuture to Mono\Flux?
A:
1) You have to configure endpoint in a bit different way https://docs.spring.io/spring/docs/current/spring-framework-reference/web-reactive.html
2) CompletableFuture to Mono\Flux example: Mono.fromFuture(...)
As for the question: "will I gain much by redoing my services to use non-blocking reactive approach". The general answer is provided in the documentation: https://docs.spring.io/spring-framework/docs/current/reference/html/web-reactive.html#webflux-performance .. and it is no.
Performance has many characteristics and meanings. Reactive and non-blocking generally do not make applications run faster. They can, in some cases, (for example, if using the WebClient to run remote calls in parallel). On the whole, it requires more work to do things the non-blocking way and that can slightly increase the required processing time.
The key expected benefit of reactive and non-blocking is the ability to scale with a small, fixed number of threads and less memory. That makes applications more resilient under load, because they scale in a more predictable way. In order to observe those benefits, however, you need to have some latency (including a mix of slow and unpredictable network I/O). That is where the reactive stack begins to show its strengths, and the differences can be dramatic.
This is general answer, but the specifics will depend and you must measure and see. I would start by recreating a simple part of the application and checking the performance of both in an isolated environment.
Is it sensible to use Spring in the server side of an in memory data grid based application?
My gut feeling tells me that it is nonsense in a low latency high performance system. A colleague of mine is insisting on including Spring in it. What are the pros and cons of such inclusion?
My position is that Spring is OK to be used in the client but it is too heavy for the server, it brings too many dependancies and is one more leaky abstraction to think of.
Data Grid systems are memory and I/O intensive in general. Using Spring does not affect that (you may argue that Spring creates a lot of beans but with proper Garbage Collection tuning this is not a problem).
On the other hand using Spring (or any other DI) helps you structure and test your code.
So if you are using implementing some sort of server based on Data Grid systems, pay attention to properly adjusting GC, sockets in your OS (memory buffers and socket memories). Those will give you much more benefits than cutting down DI.
First, I'm surprised by the "leaky abstraction" comment. I've never heard anyone criticize Spring for this. In fact, it's just the opposite. Spring removes the implementation details of infrastructure such as data grids from your application code and provides a consistent and familiar programming model, allowing you to focus on business logic. Spring does a lot to enhance configuration and access to data grids, especially Gemfire, and generally does not create any runtime overhead per se. During initialization of a Spring application, Spring uses tools like reflection and AOP internally which may increase the start up time of an application, but this has no impact on runtime performance. Spring has been proven in many high-throughput, low-latency production applications. In extreme cases, things like network latency and serialization, concerns external to Spring, are normally the biggest factors affecting performance.
"Spring brings in too many dependencies" is a common complaint, but is a fallacy. I would say Spring brings in the exact right amount of dependencies for what it needs to do. Additionally, Spring Boot starters and the platform BOM do a lot to simplify dependency management so you don't need to worry about version incompatibilities or explicitly declaring common dependencies. I'll have to side with your colleague on this one.
I have a web service written in scala and built on top of twitter finagle RPC system. Now we are hitting some performance issues. We have external API components and database layer.
I am planning of installing Zipkin in order to have a service level tracing system. This will allow me to know where the bottleneck is at the service level.
I am wondering though if there are framework out there to monitor the performance inside my application layer. The application is a suite of filters that are applied consecutively to my data and I would like to know which filter take time to compute. I heard about JVM profiling but it seems a little overkill for what I want to do. What would you recommend ? Thanks for your help.
Well before starting digging into JVM stuff or setting up all the infrastructure needed by Zipkin you could simply start by measuring some application-level metrics.
You could try the library metrics via this scala api.
Basically you manually set up counters and gauges at specific points of your application that will help you diagnose your bottleneck problem.