When I was new to drools I went thru some forums and developed an application with drools configured using KnowledgeBuilder api's and StatefulKnowledgeSessions. At that time the drools files are less in number and packaged with application.
When profiling I found the drools are consuming much memory and the memory allocations rates (TLAB's) are high. It's making me think if I need to have a caching solution in place to NOT create KnowledgeSessions everytime there is a request to application.
My application supports almost more than 100 type of events and for every event I have 3 different types of drools files which I have execute one at a time to filter results intermediately.
Also the number of drools files in the application and frequency to change/configure new is increasing and I have to externalize drools from application package.
I have different drools file for different needs and not all required every time, so I am thinking to keep under a REST service backed up a NO-SQL database in which these are pumped into whenever we want to change.
By thus I am thinking the application can GET the ONLY needed/required drool files (I will have naming conventions of the drools files as such) from the service and if required cache them too locally (will use Guava caching/any Inprocess cache to evict accordingly if there is no need), by thus the memory consumption/allocation rate may come down (guessing)
Am I right with above design? If such how to read drools from cache/stringbuilder/memory?
Currently I read them from file system by the below api
KnowledgeBuilder knowledgeBuilder = KnowledgeBuilderFactory.newKnowledgeBuilder();
knowledgeBuilder.add(ResourceFactory.newClassPathResource("drools_conventions.drl"), ResourceType.DRL);
KnowledgeBase kbase = KnowledgeBaseFactory.newKnowledgeBase();
StatefulKnowledgeSession kSession = knowledgeBase.newStatefulKnowledgeSession();
List filter = new ArrayList<>();
kSession.setGlobal("Rule", filter);
kSession.dispose(); ```
Can I cache Knowledgebase/StatefulKnowledgeSession?
Since I am new to drools I would like to take opinions/suggestions, learn to implement the best/possible solution to above situation.
I think what you could take a look on are stateless sessions. I think if you just insert facts, fire rules and then immediately dispose the KieSession, that is the best option for you. Other than that there is already a caching mechanism of sessions available in newer Drools versions.
I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications.
But lately, my client decided to use Java Spring Batch for 2 of our major pipelines :
Read from MongoDB --> Business Logic --> Write to JSON File (~ 2GB | 600k Rows)
Read from Cassandra --> Business Logic --> Write JSON File (~ 4GB | 2M Rows)
I was pretty baffled by this enterprise-level decision. I agree there are greater minds than mine in the industry but I was unable to comprehend the need of making this move.
My Questions here are:
Has anybody compared the performances between Apache Spark and Java Spring Batch?
What could be the advantages of using Spring Batch over Spark?
Is Spring Batch "truly distributed" when compared to Apache Spark? I came across methods like chunk(), partition etc in offcial docs but I was not convinced of its true distributedness. After all Spring Batch is running on a single JVM instance. Isn't it ???
I'm unable to wrap my head around these. So, I want to use this platform for an open discussion between Spring Batch and Apache Spark.
As the lead of the Spring Batch project, I’m sure you’ll understand I have a specific perspective. However, before beginning, I should call out that the frameworks we are talking about were designed for two very different use cases. Spring Batch was designed to handle traditional, enterprise batch processing on the JVM. It was designed to apply well understood patterns that are common place in enterprise batch processing and make them convenient in a framework for the JVM. Spark, on the other hand, was designed for big data and machine learning use cases. Those use cases have different patterns, challenges, and goals than a traditional enterprise batch system, and that is reflected in the design of the framework. That being said, here are my answers to your specific questions.
Has anybody compared the performances between Apache Spark and Java Spring Batch?
No one can really answer this question for you. Performance benchmarks are a very specific thing. Use cases matter. Hardware matters. I encourage you to do your own benchmarks and performance profiling to determine what works best for your use cases in your deployment topologies.
What could be the advantages of using Spring Batch over Spark?
Programming model similar to other enterprise workloads
Enterprises need to be aware of the resources they have on hand when making architectural decisions. Is using new technology X worth the retraining or hiring overhead of technology Y? In the case of Spark vs Spring Batch, the ramp up for an existing Spring developer on Spring Batch is very minimal. I can take any developer that is comfortable with Spring and make them fully productive with Spring Batch very quickly. Spark has a steeper learning curve for the average enterprise developer, not only because of the overhead of learning the Spark framework but all the related technologies to prodictionalize a Spark job in that ecosystem (HDFS, Oozie, etc).
No dedicated infrastructure required
When running in a distributed environment, you need to configure a cluster using YARN, Mesos, or Spark’s own clustering installation (there is an experimental Kubernetes option available at the time of this writing, but, as noted, it is labeled as experimental). This requires dedicated infrastructure for specific use cases. Spring Batch can be deployed on any infrastructure. You can execute it via Spring Boot with executable JAR files, you can deploy it into servlet containers or application servers, and you can run Spring Batch jobs via YARN or any cloud provider. Moreover, if you use Spring Boot’s executable JAR concept, there is nothing to setup in advance, even if running a distributed application on the same cloud-based infrastructure you run your other workloads on.
More out of the box readers/writers simplify job creation
The Spark ecosystem is focused around big data use cases. Because of that, the components it provides out of the box for reading and writing are focused on those use cases. Things like different serialization options for reading files commonly used in big data use cases are handled natively. However, processing things like chunks of records within a transaction are not.
Spring Batch, on the other hand, provides a complete suite of components for declarative input and output. Reading and writing flat files, XML files, from databases, from NoSQL stores, from messaging queues, writing emails...the list goes on. Spring Batch provices all of those out of the box.
Spark was built for big data...not all use cases are big data use cases
In short, Spark’s features are specific for the domain it was built for: big data and machine learning. Things like transaction management (or transactions at all) do not exist in Spark. The idea of rolling back when an error occurs doesn’t exist (to my knowledge) without custom code. More robust error handling use cases like skip/retry are not provided at the level of the framework. State management for things like restarting is much heavier in Spark than Spring Batch (persisting the entire RDD vs storing trivial state for specific components). All of these features are native features of Spring Batch.
Is Spring Batch “truly distributed”
One of the advantages of Spring Batch is the ability to evolve a batch process from a simple sequentially executed, single JVM process to a fully distributed, clustered solution with minimal changes. Spring Batch supports two main distributed modes:
Remote Partitioning - Here Spring Batch runs in a master/worker configuration. The masters delegate work to workers based on the mechanism of orchestration (many options here). Full restartability, error handling, etc. is all available for this approach with minimal network overhead (transmission of metadata describing each partition only) to the remote JVMs. Spring Cloud Task also provides extensions to Spring Batch that allow for cloud native mechanisms to dynamically deploying the workers.
Remote Chunking - Remote chunking delegates only the processing and writing phases of a step to a remote JVM. Still using a master/worker configuration, the master is responsible for providing the data to the workers for processing and writing. In this topology, the data travels over the wire, causing a heavier network load. It is typically used only when the processing advantages can surpass the overhead of the added network traffic.
There are other Stackoverflow answers that discuss these features in further detail (as does as the documentation):
Advantages of spring batch
Difference between spring batch remote chunking and remote partitioning
Spring Batch Documentation
I Wanted to create Java EE application in JSF+Spring Framework with WildFly AS. One of the hot requirements is:
Plug and Play Modules This means if I update my application Or Add new module into my Application.
(Obviously Update bean.xml, web.xml, pojo classes , jars etc)
Then without redeployment of my *.war file and with out restarting my Wildfly AS changes occurs.
This is a complicated requirement for a few reasons. How will you handle changes to your DB schema/entity model? How will you handle sessions which are in progress at the time of the upgrade and are actively using the 'old' code? How do you handle changes to container managed code, code that is managed by the container only at deployment time, for example new EJBs etc?
One approach I have seen used in production to achieve some of these requirements is to do rolling updates with application versioning and full schema backwards compatibility. This is done in a clustered environment which is fronted by proxy servers that can allow active sessions using the old version of the application to continue until finished and ensure that new sessions go to servers/contexts containing the new version of the code. So you end up still deploying WARs which contain the new version of your code, and eventually undeploy the old versions when all old sessions have ended/expired. To do this you have to assume the burden in your code to fully support working against two simultaneous versions of your model when new versions introduce changes to it. This is not a trivial burden. You also have to assume the burden of the extra infrastructure to route sessions appropriately.
I know a product like JRebel will let you do hot deploys of code (even things like EJBs) with the idea being that it shortens the develop/test cycle but I haven't seen it used outside of a development environment. Also you would still have to deal with active sessions that were started on the old version /model.
I have a Spring-WS web service that has three issues:
Slow startup time
Slow generation of the dynamic WSDL
Heavy usage of PermGen (app has to be 1.6 compatible)
Currently, the spring-ws-servlet.xml file has several <context:component-scan> elements for autowired dependencies. Two of these scan nearly everything in two external libraries containing Hibernate DAO and Entity classes. Similarly, the Hibernate session factory bean scans a large number of entities from these two libraries.
So, my questions:
Obviously, we would see at least some performance improvement by limiting the scope of the <context:component-scan> elements. But really, would it be that much?
Similarly, would I see improvements by limiting the scope of what Entities are scanned by the session factory?
Making these changes will NOT be a quick process (alter code, test, etc). Therefore, if anyone can add their wisdom, I would greatly appreciate it.
Actually I am developing a spring ws application on Google Cloud and I also have the same problem with slow start up time. The biggest difference that I have notice was when I have moved to aspectj compile time weaving using aspectj-maven-plugin. If you haven't done this yet try this one. The result may be vary depends on your code and deployment environment. On the cloud every file operation is much slower so this may be a reason why this work for me so well.
We're in the process of redesigning a large application (web-portal). We are suppose to use existing database that they have used for their old application. Now we are planning to use CQ for hosting the pages and supporting authoring on those pages.
So as we have closed on CQ option, question comes to integrate CQ with some external frameworks like Spring (to use JDBCTemplate) or Hibernate framework to access data from database. I have following options:
Either integrate CQ with hibernate framework to leverage caching, transaction management, data object mapping etc. But catch is hibernate can only be use to access data not for other purposes like making RESTful calls that we require too.
Or integrate CQ with spring framework to leverage JDBCTemplate to access data and also spring can help me with caching, transaction management, making RESTful calls but catch is that using JDBCtemplate will cause following problems
a. LOC will increase and the code will be hard to maintain
b. Query strings are hard to maintain in case if change in table takes place
Or use both and leverage advantages of both frameworks wherever required.
Should I look forward to integrate CQ with both frameworks. If yes then question arises that what problem it will cause in terms of:
- Ease of Use
- Productivity
- Maintainability
- Stability
- Performance
- Ease of Troubleshooting
If it's data integration that you're after, CQ5 is based on Apache Sling which allows for accessing arbitrary data sources via its ResourceProvider mechanism. This was originally a read-only mechanism but read-write functionality was recently added.
I have always used the java singleton class for my basic caching needs.
Now the project is using ehcache and without looking deeply into source code, I am not able to figure out what was wrong with the singleton pattern.
i.e What are the benefits of using the ehcahce framework except that the caching can be done by using xml configuration and annotation without writing the boilerplate code (i.e a static HashMap)
It depends on what you need from your caching mechanism. Ehcache provides a lot of cool features, which require a lot of well designed code to do it manually:
LRU, LFU and FIFO cache eviction policies
Flexible configuration
many more ...
I would recommend you go through them at http://ehcache.org/about/features and decide do you really need something in your project.
The most important one:
The ability to overflow to disk - this is something you don't have in normal HashMap and writing something like that is far from trivial. EhCache can function as simple to configure key-value database.
Even if you don't use overflow to disk, there's a large boilerplate to write with your own implementation. If loading the whole database would be possible, that using memory database with persistence on write and restoring on startup would be the solution. But memory is limited and you have to remove the elements from memory. But which one, based on what? Also, you must assert cache elements are not too old. Older elements should be replaced. If you need to remove elements from cache, you should start from the outdated ones. But should you do it when user requests something? It will slow down the request. Or start your own thread?
With EhCache you have the library in which all those issues are addressed and tested.
Also there is a clustered closed source version of ehcache, which allows you to have a distributed cache. That might be one reason you might want to consider using ehcache.