I am a student who is learning spring and jpa recently. While developing 'get api' with conditions, I came to think about which method is advantageous in terms of performance.
When it is necessary to query data based on conditions, jpql or querydsl are usually used to generate dynamic queries. Can you tell me why generating a dynamic query like this and looking up only the necessary data is better than using the java stream filter() function after looking up the entire data?
Also, can you tell me why generating fewer queries is advantageous in terms of performance?
I know that generating fewer queries has a performance advantage, but I lack an understanding of why I say it has a performance advantage.
Can you tell me why generating a dynamic query like this and looking up only the necessary data is better than using the java stream filter() function after looking up the entire data?
In general addressing the database or any other external storage is much more expensive than most of operations on Java side because of networking latency. If you query all the data and use e.g. list.stream().filter() than the significant amount of data is transferred over the network. And if one vice versa queries only some data filtered on the DB side the transferred amount in lower.
Pay attention, that while this is true in general there might be a cases when filtering on Java side could be more effective. This is highly dependent on several things:
query complexity
amount of data
database structure (schema, indices, column types etc.)
As of number of queries here we have the same considerations: query execution costs, data transfer costs, so the less queries you have - the better. And again, this is not an axiom: in some cases having multiple lightweight queries with grouping/filtering on Java side might be faster, than one huge and complicated SQL-query.
Related
Is using query.aggregate(pipeline) in mongoDB more efficient than using normal queries such as query.equalTo, or query.greaterThan?
Aggregate queries definitely require much less code, but that alone doesn't seem to justify the complexity they bring with all the additional parantheses and abbreviations.
Normal queries seem more straightforward, but are they inferior in performance? What is a good use case for aggregate queries vs normal ones?
I'm building an REST API in Ruby with JRuby+Sinatra running on top of Trinidad web server.
One of the functionalities of the API will be getting very large datasets from a database and storing them in a middle caching/non relational DB layer. This is for performing filter/sorting/actions on top of that dataset without having to rebuild it from the database.
We're looking into a good/the best solution for implementing this middle layer.
My thoughts:
Using a non relational database like Riak to store the datasets and having a caching layer (like Cache Money) on top.
Notes:
Our datasets can be fairly large
Since you asked for an opinion, I'll give you mine... I think MongoDB would be a good match for your needs:
http://www.mongodb.org/
I've used used it to store large, historical datasets for a couple of years now that just keep getting bigger and bigger, and it remains up to the task. I haven't even needed to delve into "sharding" or some of the advanced features.
The reasons I think it would be appropriate for the application you describe are:
It is an indexed, schemaless document store which means it can be very "dynamic" with fields being added or removed
I've benchmarked it's performance versus some SQL databases for large "flat" data it performs orders of magnitude better in some cases.
https://github.com/guyboertje/jmongo will let you access MongoDB from JRuby
we know that linq is a layer built on top on the ado.net stack. it is very nice feature and makes database querying much better but linq is an additional layer and thus it adds some overhead to translate linq queries to sql queries and maps back the results while in ado.net we write the sql queries directly.
my question is when does linq performs faster than using the normal ado.net methods.
When the time saved in writing all those queries in raw SQL and managing all the other translation etc allows you to spend more time on finding performance bottlenecks.
LINQ isn't about outperforming SQL. It's about making code simpler and clearer, so you can concentrate on more important aspects. There may occasionally be times where the natural LINQ expression of query ends up with faster SQL than you'd have come up with yourself - although there are plenty of times the opposite will happen, too. You should still look at the SQL being generated, and profile it accordingly.
You will always be able to beat LINQ backed to a db with a stored procedure accessed from ADO and then either acted on directly or (if you must deal with objects) used to construct a an object with just the amount of data required for the task in hand.
However, LINQ lets us very quickly create a query which returns just that information needed for that task by returning anonymous objects.
To do the same with custom code per query would require either to not stop dealing with ADO at other layers (fraught in several ways) and/or to create a very large amount of objects that duplicate most of their functionality, but share no code.
So, while it can be beaten on performance, it can't be beaten in this case without a lot of rather repetitive code. And it can beat the more natural approach (to return entity objects with bloat we won't use) on performance.
Finally, even in cases where it doesn't win, it can still be faster to write, and clearer hot the operation relates to the way the entities are defined (this latter is the main reason I'm quite fond of it).
What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.
What is actually better? Having classes with complex queries responsible to load for instance nested objects? Or classes with simple queries responsible to load simple objects?
With complex queries you have to go less to database but the class will have more responsibility.
Or simple queries where you will need to go more to database. In this case however each class will be responsible for loading one type of object.
The situation I'm in is that loaded objects will be sent to a Flex application (DTO's).
The general rule of thumb here is that server roundtrips are expensive (relative to how long a typical query takes) so the guiding principle is that you want to minimize them. Basically each one-to-many join will potentially multiply your result set so the way I approach this is to keep joining until the result set gets too large or the query execution time gets too long (roughly 1-5 seconds generally).
Depending on your platform you may or may not be able to execute queries in parallel. This is a key determinant in what you should do because if you can only execute one query at a time the barrier to breaking up a query is that much higher.
Sometimes it's worth keeping certain relatively constant data in memory (country information, for example) or doing them as a separately query but this is, in my experience, reasonably unusual.
Far more common is having to fix up systems with awful performance due in large part to doing separate queries (particularly correlated queries) instead of joins.
I don't think that any option is actually better. It depends on your application specific, architecture, used DBMS and other factors.
E.g. we used multiple simple queries with in our standalone solution. But when we evolved our product towards lightweight internet-accessible solution we discovered that our framework made huge number of request and that killed performance cause of network latency. So we sufficiently reworked our framework for using aggregated complex queries. Meanwhile, we still maintained our stand-alone solution and moved from Oracle Light to Apache Derby. And once more we found that some of our new complex queries should be simplified as Derby performed them too long.
So look at your real problem and solve it appropriately. I think that simple queries are good for beginning if there are no strong objectives against them.
From a gut feeling I would say:
Go with the simple way as long as there is no proven reason to optimize for performance. Otherwise I would put the "complex objects and query" approach in the basket of premature optimization.
If you find that there are real performance implications then you should in the next step optimize the roundtripping between flex and your backend. But as I said before: This is a gut feeling, you really should start out with a definition of "performant", start simple and measure the performance.