One DAO per thread or threadsafe DAO?

One DAO per thread or threadsafe DAO? - dao

I'm wondering if there's an approved practice in a multi-threaded app. Should I have one DAO per thread or simply make one DAO a thread safe singleton.

This really depends a lot on the mechanism you're using for data access. If you have a very scalable data access, and lots of threads, using some form of thread static data access can be advantageous.
If you don't have scalable data access, your provider doesn't support multiple threads per process, or you just don't need the scalability at that point, using a singleton with appropriate synchronization is simpler and easier to implement.
For most business style applications, I personally think the singleton approach is easier to maintain, and probably better - if for no other reason than it's much, much easier to test effectively. Having multiple threads for data access is likely not required, as the data access is probably not going to be a bottleneck that effects usability (if you design correctly, and batch requests appropriately).

Use the approach that best suits your application architecture, unless:
1) Your data access objects are expensive to create, in which case you should lean toward a thread-safe singleton.
2) Your objects maintain mutable state, as in the Active Record pattern. (Immutable DAO configuration state, like timeout thresholds, doesn't count.)

Related

Is REPLICATE DATA pattern good option to minimize synchronous micro-services communication?

In a world of microservices, often one microservice needs to invoke another, synchronous or asynchronous way.
In the case of synchronous way of communication, I have understood that it affects the availbility of services, as both services need to be available during calls.
To minimize this synchronous way of communication, one possible solution is to have DATA REPLICATION at client service. The client service also up-to-date data by listening to events published by services.
According to me, this is not a good choice as we are duplicating data and it might become stale and also database overhead.
what will be the best suitable scenario when the above pattern will be the best suit?

Microservices are distributed systems. This means that they are constrained by the CAP theorem, which basically means you have a choice between:
Sacrifice availability to preserve consistency: this would (among other things) lead to one service invoking functionality in another in a synchronous way. If the other service is unavailable, so is all functionality in this service which depends on that service's functionality.
Sacrifice consistency to preserve availability: you build services to be autonomous and not depend on other services being up. This leads in fairly short order to services not sharing databases and to asynchronous replication of data (because if service A has synchronously replicated data from service B, then service B being down doesn't affect A's availability, but A being down affects B's availability): with asynchronous replication, the best you can hope for is eventual consistency.
The choice between those two (if you happen to have the ability to freeze the entire universe if there's a network partition, you might be able to sacrifice partition tolerance for consistency and availability) is ultimately a business question (it's worth noting that there's a continuum of approaches between those extremes). How much are you spending on storage and on designing an (arguably) more complex system vs. how much are you losing by being unavailable?
It should be noted that the universe is inherently eventually consistent: the sun could have gone supernova a few minutes ago and we can't know it for a few minutes more.
As for the concern about duplicated data: chances are the data is already duplicated (backups) and in any database worth using the data is duplicated (the write-ahead log).
As for situations, it's a lot harder to think of a situation where aiming for strong consistency is strictly the most suitable option.
But for an example, consider a chain of coffee shops. We have a cash register service and we have a loyalty/rewards service. Data from the loyalty/rewards service is needed by the cash register (if a customer is redeeming a "50% off a latte" reward you'd want the register to know that it's valid), and every transaction (at least those with a loyalty ID) at the register should be known by the rewards service.
If we want the reward redemptions to be consistent, then it implies that if the loyalty/rewards service is inaccessible from the register, no rewards can be redeemed. There's a nonzero chance that a customer who can't redeem a reward just walks out (and a further nonzero chance that they never get coffee from you again).
Conversely, if we want both services to have a consistent view then we're demanding that if the power's out at any store we can't determine new rewards, or if the loyalty/rewards service is inaccessible from the register, no new sales can be made.
The solution is for both services to maintain the data they need to function, even if another service controls updates to that data. They'll eventually catch up. In the case of reward redemption, assuming the unavailability happens rarely enough, it may even be desirable to have the cash register perform a preliminary validation and if that passes, assume that the reward is valid and submit it later to the loyalty/reward service.

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks

I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.

Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

Transactions in Berkeley DB. Fast?

When there is only one writer to a Berkeley DB, is it worth to use transactions?
Do transaction cause a significant slowdown? (in percents please)

You use transactions if you require the atomicity that they provide. Perhaps you need to abort the transaction, undoing everything in it? Or perhaps you need the semantic that should the application fail, a partially completed transaction is aborted. Your choice of transactions is based on atomicity, not performance. If you need it, you need it.
If you don't need atomicity, you may not need durability. Then, that is significantly faster!

Transactions with DB_INIT_TXN in Berkeley DB are not significantly
slower than other models, although generally maintaining a transactional
log requires all data to be written to the log before being written
to the database.
For a single writer and multiple readers, try the DB_INIT_CDB
model because the code is much simpler. Locks in the INIT_CDB
model are per-table and so overall throughput might be worse
than a INIT_TXN model because of coarse grained per-table
lock contention.
Performance will depend on access patterns more than whether
one uses DB_INIT_TXN or DB_INIT_CDB models.

How to deal with Java EE concurrency

Please let me know the best practices of providing application concurrency in software project. I would like to use Hibernate for ORM. Spring to manage the transactions. Database as MySQL.
My concurrency requirement is to let as much as users to connect to the database and
make CRUD operations and use the services. But I do not like to have stale data.
How to handle data concurrency issues in DB.
How to handle application concurrency. What if two threads access my object simultaneously will it corrupt the state of my object?
What are the best practices.
Do you recommend to define isolation levels in spring methods
#Transactional(isolation=Isolation.READ_COMMITTED). How to take that decision.?
I have come up with below items would like to know your feedback and the way to address?
a. How to handle data concurrency issues in DB.
Use Version Tag and Timestamp.
b. How to handle application concurrency.
Provide optimistic locking. Not using synchronizations but create objects for each requests (Scope prototype).
c. What are the best practices.
cache objects whenever its possible.

By using transactions, and #Version fields in your entities if needed.
Spring beans are singletons by default, and are thus shared by threads. But they're usually stateless, thus inherently thread-safe. Hibernate entities shouldn't be shared between threads, and aren't unless you explicitely do it by yourself: each transaction, running in its own thread, will have its own Hibernate session, loading its own entity instances.
much too broad to be answered.
The default isolation level of your database is usually what you want. It is READ_COMMITTED with most databases I know about.
Regarding your point c (cache objects whenever possible), I would say that this is exactly what you shouldn't do. Caching makes your app stateful, difficult to cluster, much more complex, and you'll have to deal with the staleness of the cache. Don't cache anything until
you have a performance problem
you can't tweak your algorithms and queries to solve the performance problem
you have proven that caching will solve the performance problem
you have proven that caching won't cause problems worse than the performance problem
Databases are fast, and already cache data in memory for you.

Performance impact of having a data access layer/service layer?

I need to design a system which has these basic components:
A Webserver which will be getting ~100 requests/sec. The webserver only needs to dump data into raw data repository.
Raw data repository which has a single table which gets 100 rows/s from the webserver.
A raw data processing unit (Simple processing, not much. Removing invalid raw data, inserting missing components into damaged raw data etc.)
Processed data repository
Does it make sense in such a system to have a service layer on which all components would be built? All inter-component interaction will go through the service layers. While this would make the system easily upgradeable and maintainable, would it not also have a significant performance impact since I have so much traffic to handle?

Here's what can happen unless you guard against it.
In the communication between layers, some format is chosen, like XML. Then you build it and run it and find out the performance is not satisfactory.
Then you mess around with profilers which leave you guessing what the problem is.
When I worked on a problem like this, I used the stackshot technique and quickly found the problem. You would have thought it was I/O. NOT. It was that converting data to XML, and parsing XML to recover data structure, was taking roughly 80% of the time. It wasn't too hard to find a better way to do that. Result - a 5x speedup.

What do you see as the costs of having a separate service layer?
How do those costs compare with the costs you must incur? In your case that seems to be at least
a network read for the request
a database write for raw data
a database read of raw data
a database write of processed data
Plus some data munging.
What sort of services do you have a mind? Perhaps
saveRawData()
getNextRawData()
writeProcessedData()
why is the overhead any more than a procedure call? Service does not need to imply "separate process" or "web service marshalling".
I contend that structure is always of value, separation of concerns in your application really matters. In comparison with database activities a few procedure calls will rarely cost much.
In passing: the persisting of Raw data might best be done to a queuing system. You can then get some natural scaling by having many queue readers on separate machines if you need them. In effect the queueing system is naturally introducing some service-like concepts.

Personally feel that you might be focusing too much on low level implementation details when designing the system. Before looking at how to lay out the components, assemblies or services you should be thinking of how to architect the system.
You could start with the following high level statements from which to build your system architecture around:
Confirm the technical skill set of the development team and the operations/support team.
Agree on an initial finite list of systems that will integrate to your service, the protocols they support and some SLAs.
Decide on the messaging strategy.
Understand how you will deploy your service/system.
Decide on the choice of middleware (ESBs, Message Brokers, etc), databases (SQL, Oracle, Memcache, DB2, etc) and 3rd party frameworks/tools.
Decide on your caching and data latency strategy.
Break your application into the various areas of business responsibility - This will allow you to split up the work and allow easier communication of milestones during development/testing and implementation.
Design each component as required to meet the areas of responsibility. The areas of responsibility should automatically lead you to decide on how to design component, assembly or service.
Obviously not all of the above will match your specific case but I would suggest that they should at least be given some thought.
Good luck.

Abstraction and tiering will introduce latency, but the real question is, what are you GAINING to make the cost(s) worthwhile? Loose coupling, governance, scalability, maintainability are worth real $.
Even the best-designed layered app will exhibit more latency than an app talking directly to a DB. Users who know the original system will feel the difference. They may not like it, so this can be a political issue as much as a technical one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio