I need to be able to insert/update objects at a consistent rate of at least 8000 objects every 5 seconds in an in-memory HSQL database.
I have done some comparison performance testing between Spring/Hibernate/JPA and pure JDBC. I have found a significant difference in performance using HSQL.. With Spring/Hib/JPA, I can insert 3000-4000 of my 1.5 KB objects (with a One-Many and a Many-Many relationship) in 5 seconds, while with direct JDBC calls I can insert 10,000-12,000 of those same objects.
I cannot figure out why there is such a huge discrepancy. I have tweaked the Spring/Hib/JPA settings a lot trying to get close in performance without luck. I want to use Spring/Hib/JPA for future purposes, expandability, and because the foreign key relationships (one-many and many-many) are difficult to maintain by hand; but the performance requirements seem to point towards using pure JDBC.
Any ideas of why there would be such a huge discrepancy?
We have similar experience comparing Hibernate with JDBC in batch mode (Statement#executeBatch()). Basically, it seems like Hibernate just doesn't do that well with bulk operations. In our case, the Hibernate implementation was fast enough on our production hardware.
What you may want to do, is to wrap your database calls in a DAO, giving your application a consistent way of accessing your data. Implement your DAOs with Hibernate where it's convenient, and with JDBC where the performance requirements call for it.
As a minimum, you need to do batch inserts in Hibernate: http://www.hibernate.org/hib_docs/reference/en/html/batch.html Saves a lot of round-trip time.
And, as Justice mentioned, the primary goal of Hib is not computer performance, but developer performance. Having said that, it's usually possible to achieve comparable (not equal, but not that much worse) to JDBC results.
Never use one technology for all problems.
Depending on the problem decide what technology to use.
Of course jpa or hibernate is slower than jdbc. jdbc is on lower level than jpa.
Also a db professional with jdbc can write more optimized sql than jpa.
If you gave critical point where speed is required, jpa is not your choise.
Hibernate maintains a first-level cache of objects to use in dirty checking as well as to act as a Unit of Work and Identity Map. This adds to the overhead, especially in bulk-type operations. For bulk operations, you may want to investigate StatelessSessions that don't maintain this state.
All that mapping ... it can get a little bit expensive, with all the arcane logic and all the reflection and consistency-checking that it has to do.
The point of mapping is not to boost performance, of course. Typically, you take a performance hit. But what you lose in performance, you (can) gain many times over in developer productivity, consistency, testability, reliability, and so many more coveted attributes. Typically, when you need the extra performance and you don't want to give up mapping, you drop in some more hardware.
Related
We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html
We are moving our database from oracle to couchDB, for one of the use case is to implement the distributed transaction management.
For Ex: Read the data from JMS Queue and update it in multiple document, if any thing fails then revert back and throws an exception to JMS queue.
As we know couchDB does not support distributed transaction management.
Can you please suggest any alternative strategy to implement this or any other way out?
More than technical aspects I feel you might be interested in the bottom line of that.
As mentionned distributed transactions are not possible - this notion doesn't even exist, because it is not necessary. Indeed, unlike in the relational world, 95% of the time when you feel that you need them it means that you are doing something wrong.
I'll be straightforward with you : dumping your relational data into couchdb will end up being a nightmare both for writes and reads. For the first you'll say : how can I do transactions ? For the laters : how can I do joins ? Both are impossible and are concepts which do not even exist.
The convenient conclusions - too - many people reach is that "CouchDb is not enterprise ready or ACID enough". But the truth is you need to take the time to rethink your data structures.
You need to rethink your data structures and make them document oriented because if you don't you are off the intended usage of couchdb - and as you know this is risky territory.
Read on DDD and aggregates design, and turn your records into DDD entities and aggregates. So there'd be an ETL layer to CouchDb. If you don't have the time to do that I'd recommend not using CouchDb - as much as I love it.
CouchDB doesn't have properties which are necessary for distributed transactions so it's impossible. All major distributed transaction algorithms (Two-Phase commit protocol, RAMP and Percolator-style distributed transactions, you can find details in this answer) require linearizability on the record level. Unfortunately CouchDB is an AP solution (in the CAP theorem sense) so it can't even guarantee record-level consistency.
Of cause you can disable replication to make CouchDB consistent but then you'll lose fault-tolerance. Another option is to use CouchDB as a storage and to build a consistent database on top of it but it's an overkill for your task and doesn't use any CouchDB-specific feature. The third option is to use CRDT but it works only if your transactions are commutative.
When there is only one writer to a Berkeley DB, is it worth to use transactions?
Do transaction cause a significant slowdown? (in percents please)
You use transactions if you require the atomicity that they provide. Perhaps you need to abort the transaction, undoing everything in it? Or perhaps you need the semantic that should the application fail, a partially completed transaction is aborted. Your choice of transactions is based on atomicity, not performance. If you need it, you need it.
If you don't need atomicity, you may not need durability. Then, that is significantly faster!
Transactions with DB_INIT_TXN in Berkeley DB are not significantly
slower than other models, although generally maintaining a transactional
log requires all data to be written to the log before being written
to the database.
For a single writer and multiple readers, try the DB_INIT_CDB
model because the code is much simpler. Locks in the INIT_CDB
model are per-table and so overall throughput might be worse
than a INIT_TXN model because of coarse grained per-table
lock contention.
Performance will depend on access patterns more than whether
one uses DB_INIT_TXN or DB_INIT_CDB models.
Please let me know the best practices of providing application concurrency in software project. I would like to use Hibernate for ORM. Spring to manage the transactions. Database as MySQL.
My concurrency requirement is to let as much as users to connect to the database and
make CRUD operations and use the services. But I do not like to have stale data.
How to handle data concurrency issues in DB.
How to handle application concurrency. What if two threads access my object simultaneously will it corrupt the state of my object?
What are the best practices.
Do you recommend to define isolation levels in spring methods
#Transactional(isolation=Isolation.READ_COMMITTED). How to take that decision.?
I have come up with below items would like to know your feedback and the way to address?
a. How to handle data concurrency issues in DB.
Use Version Tag and Timestamp.
b. How to handle application concurrency.
Provide optimistic locking. Not using synchronizations but create objects for each requests (Scope prototype).
c. What are the best practices.
cache objects whenever its possible.
By using transactions, and #Version fields in your entities if needed.
Spring beans are singletons by default, and are thus shared by threads. But they're usually stateless, thus inherently thread-safe. Hibernate entities shouldn't be shared between threads, and aren't unless you explicitely do it by yourself: each transaction, running in its own thread, will have its own Hibernate session, loading its own entity instances.
much too broad to be answered.
The default isolation level of your database is usually what you want. It is READ_COMMITTED with most databases I know about.
Regarding your point c (cache objects whenever possible), I would say that this is exactly what you shouldn't do. Caching makes your app stateful, difficult to cluster, much more complex, and you'll have to deal with the staleness of the cache. Don't cache anything until
you have a performance problem
you can't tweak your algorithms and queries to solve the performance problem
you have proven that caching will solve the performance problem
you have proven that caching won't cause problems worse than the performance problem
Databases are fast, and already cache data in memory for you.
I am working on developing an application which caters to about 100,000 searches everyday. We can safely assume that there are about the same number of updates / insertions / deletions in the database daily. The current application uses native SQL and we intend to migrate it to Hibernate and use Hibernate Search.
As there are continuous changes in the database records, we need to enable automatic indexing. The management has concerns about the performance impact automatic indexing can cause.
It is not possible to have a scheduled batch indexing as the changes in the records have to be available for search as soon as they are changed.
I have searched to look for some kind of performance statistics but have found none.
Can anybody who has already worked on Hibernate Search and faced a similar situation share their thoughts?
Thanks for the help.
Regards,
Shardul.
It might work fine, but it's hard to guess without a baseline. I have experience with even more searches / day and after some fine tuning it works well, but it's impossible to know if that will apply for your scenario without trying it out.
If normal tuning fails and NRT doesn't proof fast enough, you can always shard the indexes, use a multi-master configuration and plug in a distributed second level cache such as Infinispan: all combined the architecture can achieve linear scalability, provided you have the time to set it up and reasonable hardware.
It's hard to say what kind of hardware you will need, but it's a safe bet that it will be more efficient than native SQL solutions. I would suggest to make a POC and see how far you can get on a single node; if the kind of queries you have are a good fit for Lucene you might not need more than a single server. Beware that Lucene is much faster in queries than in updates, so since you estimate you'll have the same amount of writes and searches the problem is unlikely to be in the amount of searches/second, but in the writes(updates)/second and total data(index) size. Latest Hibernate Search introduced an NRT index manager, which suites well such use cases.