Evaluation of ehcache in web application - ehcache

Is it a good practice to store your data in ehcache to improve the performance of a web application when lots of update operation on data regularly?

It all depends on how much reads you have over writes. Your updates will be costlier. So the time gain by reading should offset that.
Ehcache handles concurrent access. However, it is atomic, not transactional. So if you are getting multiple values from different caches, you can get updates in-between. But that's the same for a database. Also, you can use XA to make sure your writes are in sync with the database.

Related

Maintain consistency between multiple datastores

I'm writing real time application that needs fast access to some resources. I'm using a relational database and redis. I use the relational database for safe storing of the resources and redis for fast access of those same resources.
The problem I'm facing is that the code for maintaining consistency between these two stores becomes very complicated. I need to check if writes are being done to both and if one of those fails, undo the one that worked.
I thought of using something like kafka where the write would be sent to a specific topic and having different consumers (SQLConsumer and RedisConsumer) that would write to the respective databases. This way, the consumers would retry indefinitely and achieve eventual consistency. Ideally, the message wouldn't be committed until the write was successful.
Is this a common/correct approach? Is there other way in which I could improve my architecture?

Cache only specific tables in Spring boot

I have a table with millions of rows (with 98% reads, maybe 1 - 2% writes) which has references to couple of other config tables (with maybe 20 entries each). What are the best practices for caching the tables in this case? I cannot cache the table with millions of rows. But at the same time, I also don't want to hit the DB for the config tables. Is there a work around for this? I'm using Spring boot, and the data is in postgres.
Thanks.
First of all, let me refer to this:
What are the best practices for caching the tables in this case
I don't think you should "cache tables" as you say. In the Application, you work with the data, and this is what should be cached. This means the object that you cache should be already in a structure that includes these relations. Of course, in order to fetch the whole object from the database, you can use JOINs, but when the object gets cached, it doesn't matter already, the translation from Relational model to the object model was done.
Now the question is too broad because the actual answer can vary on the technologies you use, nature of data, and so forth.
You should answer the following questions before you design the cache (the list is out my head, but hopefully you'll get the idea):
What is the cache invalidation strategy? You say, there are 2% writes, what happens if the data gets updated, the data in the cache may become stale. Is it ok?
A kind of generalization of the previous question: If you have multiple instances (JVMs) of the same application, and one of them triggered the update to the DB data, what should happen to other apps' caches?
How long the stale/invalid data can reside in the cache?
Do the use cases of your application access all the data from the tables with the same frequencies or some data is more "interesting" (for example, the oldest data is not read, but the latest data is always "hot")? Probably if its millions of data for configuration, the JVM doesn't have all these objects in the heap at the same time, so there should be some "slice" of this data...
What are the performance implications of having the cache? How does it affect the GC behavior?
What technologies can be used in your case (maybe due to some regulations/licensing, some technologies are just not available, this is more a case in large organizations)
Based on these observations you can go with:
In-memory cache:
Spring integrates with various in-memory cache technologies, you can also use them without spring at all, to name a few:
Google Guava cache (for older spring cache implementations)
Coffeine (for newer spring cache implementations)
In memory map of key / value
In memory but in another process:
Redis
Infinispan
Now, these caches are slower than those listed in the previous category but still can
be significantly faster than the DB.
Data Grids:
Hazelcast
Off heap memory-based caches (this means that you store the data off-heap, so its not eligible for garbage collection)
Postgres related solutions. For example, you can still go to db, but since you can opt for keeping the index in-memory the queries will be significantly faster.
Some ORM mapping specific caches (like hibernate has its cache as well).
Some kind of mix of all above.
Implement your own solution - well, this is something that probably you shouldn't do as the first attempt to address the issue, because caching can be tricky.
In the end, let me provide a link to some very interesting session given by Michael Plod about caching. I believe it will help you to find the solution that works for you best.

What's the performance penalty of long lived DB transactions interleaved with one another?

Could anyone provide an explanation or point me to a good source where it is explained the impact of long lived database transactions when there are other transactions involved?
I'm having difficulties trying to understand what is the real impact in the performance of an application of having transactions where most of the queries are reads and maybe a couple or three are writes, given the different isolation levels.
Mostly I would like to understand it in the situation where:
Neither the rows read nor the rows updated are involved in any other transaction.
The rows read are involved in another transaction but not the rows being updated and this other transaction is read only.
The rows read are involved in another transaction but not the rows being updated and this other transaction is modifying some data being read. I understand here it also affects whether the data is read before or after is being modified.
Both the rows read and the rows updated are involved in another transaction also modifying the data.
These questions come in the context of an application using micro services where all application layer services are annotated with #Transactional using JPA and PostgreSQL and, to transform the data, they need to do some network calls to other micro services within the transaction to fetch some other values.

storing data in secondary database

Our application (java,spring, hibernate) uses postgress to store data.
We are looking to add an analysis engine to the application. I want to explore using a nosql db to run the analysis on. This is an attempt at learning the nosql a bit also to free the main application activity from performance penalty (as much as possible).
So, I want the data changes to also synch to the nosql db (in addition to postgres). Any synch mechanism will affect the performance of the main data/transaction activity.
Is it a good idea to push the data changes to a message bus and free the main transaction as early as possible ? Can anyone point me to frameworks/technologies/ideas that address this issue of same data going to two different data stores.
The simplest solution would be sending data to a Postgres read replica and running your analytics queries on that. The performance impact is minimal and this would save a lot of time compared to alternative approaches.
Unless you really know what you are doing, I would avoid NoSQL for this kind of application. If your dataset is too big for a Postgres read replica, you might want to use Redshift, which is a columnar datastore that is optimized for types of analytics queries typically performed.

How to avoid database query storms using cache-aside pattern

We are using a PostgreSQL database and AppFabric Server, running a moderately busy ASP.NET MVC e-commerce site.
Following the cache-aside pattern we request data from our cache, and if it is not available, we query the database.
This approach results in 'query storms' where the database recieves multiple queries for the same data in a short space of time, while a given object in the cache is being refreshed. This issue is exacerbated by longer running queries, and obviously multiple requests for the same data can cause the query to run longer, forming an unpleasant feedback loop.
One solution to this problem is to use read-locking on the cache. However this can itself cause performance issues in a web farm situation (or even on a single busy web server) as web servers are blocked on reads for no reason, in case there is a database query taking place.
Another solution is to drop the cache-aside pattern and seed the cache independently. This is the approach we have taken to mitigate the immediate issues we are seeing with this problem, however it is not possible with all data.
Am I missing something here? And what other approaches have people taken to avoid this behaviour?
Depending on the number of servers you have and your current cache architecture it may be worthwhile to evaluate adding a server-level (or in-process) cache as well. In effect you use this as a fallback cache, and it's especially helpful where hitting the primary storage (database) is either very resource intensive or slow.
When I've used this I've used the cache-aside pattern for the primary cache and a read-through design for the secondary--in which the secondary is locking and ensures the database isn't over-saturated by the same request. With this architecture a primary cache-miss results in at most one query per entity per server (or process) to the database.
So the basic workflow is:
1) Try to retrieve from primary / shared cache pool
* If successful, return
* If unsuccessul, continue
2) Check in-process cache for value
* If successful, return (optionally seeding primary cache)
* If unsuccessul, continue
3) Get lock by cache key (and double-check in-process cache, in case it's been added by another thread)
4) Retrieve object from primary persistence (db)
5) Seed in-process cache and return
I've done this using injectable wrappers, my cache layers all implement the relevant IRepository interface, and StructureMap injects the correct stack of caches. This keeps the actual cache behaviors flexible, focused, and easy to maintain despite being fairly complex.
We've used AppFabric successfully with the seeding strategy you mention above. We actually do use both solutions:
Seed known data where possible (we have a limited set, so this is actually easy for us to figure out)
Within each cache access method, make sure to do look-aside as necessary, and populate cache on retrieval from data store.
The look-aside is necessary, as items may be evicted due to memory pressure, or simply because they were missed in the seeding operation. We have a "warming" service that pulses on an interval (an hour) and keeps the cache populated with the necessary data. We keep analysis on cache misses, and use that to tweak our warming strategy if we see frequent misses during the warming interval.

Resources