Linq To Sql vs Entity Framework Performance - performance

I have been searching for recent performance benchmarks that compare L2S and EF and couldnt find any that tested calling stored procedures using the released version of EF. So, I ran some of my own tests and found some interesting results.
Do these results look right? Should I be testing it in a different way?
One instance of the context, one call of the sproc:
(dead link)
One instance of the context, multiple calls of the same sproc:
(dead link)
Multiple instances of the context, multiple calls of the same sproc:
(dead link)

I think you should test it in a somewhat different way, in order to distinguish startup costs vs. execution costs. The Entity Framework, in particular, has substantial startup costs resulting from the need to compile database views (although you can do this in advance). Likewise, LINQ has a notion of a compiled query, which would be appropriate if executing a query multiple times.
For many applications, query execution costs will be more important than startup costs. For some, the opposite may be true. Since the performance characteristics of these are different, I think it's important to distinguish them. In particular, averaging startup costs into the average cost of a query executed repeatedly is misleading.

This looks to be a pretty measurement of performance between LINQ to SQL and Entity Framework.
http://toomanylayers.blogspot.com/2009/01/entity-framework-and-linq-to-sql.html

I did a couple of test asp.net pages trying to see which performs better. My test was:
Delete 10,000 records
Insert 10,000 records
Edit 10,000 records
Databind the 10,000 records to a GridView and display on the page
I was expecting LinqToSQL to be faster but doing the above LinqToSQL takes nearly 2 minutes while LinqToEntities takes less than 20 seconds.
At least for this test it seems LinqToEntities is faster. My results seem to match yours as well.
I didn't try Inserting/Editing/Deleting/Displaying more than 1 table joined together though.
I'm interested in finding out more... or if my test isn't a valid type of test I'd be interested in seeing some real tests.

Related

Neo4j multi-client massive insertion - REST very poor performance - other ways?

I'm trying to benchmark Neo4j massive insertion in client-server environment.
So far I've found that there are only two ways to do it:
use REST
implement server extension
I can say upfront that our design requires to be able to insert from many concurrently running processes/machines, so using batch insert with direct connection is not an option.
I would also like to avoid having to implement server extension as we already have tight schedule.
I benchmarked massive insertion via REST from just a single client, sending 2 kinds of very simple Cypher queries:
create (vertex:V {guid: {guid}, vtype: {vtype}, random1: {random1}, random2: {random2} })
match (a:V {guid: {a} }) match (b:V {guid: {b} }) create (a)-[:label]->(b)
Guid field had an index.
Results so far are very poor around (10k V + 40k E) in 13 minutes, compared to competing products like Titan or Orient, which provide efficient server out of the box and throughput at around (10k V + 40k E) per 1 minute.
I tried longer lasting transactions, and query parameters, none give any significant gains. Furthermore, the overhead from REST is very small as I tested dummy queries and they execute much much faster (and both client and server are on the same machine). I also tried inserting from multiple threads - performance does not scale up.
I found another StackOverflow question, where advise was to batch inserts into large requests containing thousands of commands and periodically commit.
Unfortunatelly, due to the nature of how we generate the data, batching the requests is not feasible. Ideally we'd like the inserts to be atomic operations and have the results appear as soon as they are executed (no need for transactions in fact).
Thus my questions are:
are my Cypher queries optimal for the insertion?
are the results so far in line with what can be achieved with REST (or can I squeeze much more from REST) ?
are there any other ways to perform efficient multi-client massive insertion?
I have a number of thoughts/questions that don't fit very well in a comment ;)
What version of Neo4j are you using? 2.3 introduced some things which might help
When you say you have an index, do you mean the new style and not the legacy indexes? The newer indexes are created with CREATE INDEX ON :V(guid) and apply to the combination of a label and a property. You can try your queries in the web console prefixed with PROFILE to see if the query is hitting the index and where it might be slow
If you can have the data in a CSV format you might look into the LOAD CSV clause in Cypher. That's also a batch sort of thing, so it might not be as useful
I don't think it would help performance much, but this is a bit nicer to read:
match (a:V {guid: {a} }), (b:V {guid: {b} }) create (a)-[:label]->(b)
I know it's of no help now, but Neo4j 3.0 is planned to have a new compressed binary socket protocol called Bolt which should be an improvement over REST. It's estimated for Q2
I know a lot of these suggestions probably aren't too helpful, but they're things to think about. There's also a public Slack chat for Neo4j here:
http://neo4j.com/blog/public-neo4j-users-slack-group/
I'll share this question there to see if anybody has any ideas
EDIT:
Max DeMarzi passed on one of this articles on queueing requests which might be useful:
http://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/
Looks like you'd need to write a bit of Java, but he lays it out for you

Calculating results in a scalable way based on transaction data in web app?

Possible duplicate:
Database design: Calculating the Account Balance
I work with a web app which stores transaction data (e.g. like "amount x on date y", but more complicated) and provides calculation results based on details of all relevant transactions[1]. We are investing a lot of time into ensuring that these calculations perform efficiently, as they are an interactive part of the application: i.e. a user clicks a button and waits to see the result. We are confident, that for the current levels of data, we can optimise the database fetching and calculation to complete in an acceptable amount of time. However, I am concerned that the time taken will still grow linearly as the number of transactions grow[2]. I'd like to be able to say that we could handle an order of magnitude more transactions without excessive performance degradation.
I am looking for effective techniques, technologies, patterns or algorithms which can improve the scalability of calculations based on transaction data.
There are however, real and significant constraints for any suggestion:
We currently have to support two highly incompatible database implementations, MySQL and Oracle. Thus, for example, using database specific stored procedures have roughly twice the maintenance cost.
The actual transactions are more complex than the example transaction given, and the business logic involved in the calculation is complicated, and regularly changing. Thus having the calculations stored directly in SQL are not something we can easily maintain.
Any of the transactions previously saved can be modified at any time (e.g. the date of a transaction can be moved a year forward or back) and calculations are expected to be updated instantly. This has a knock-on effect for caching strategies.
Users can query across a large space, in several dimensions. To explain, consider being able to calculate a result as it would stand at any given date, for any particular transaction type, where transactions are filtered by several arbitrary conditions. This makes it difficult to pre-calculate the results a user would want to see.
One instance of our application is hosted on a client's corporate network, on their hardware. Thus we can't easily throw money at the problem in terms of CPUs and memory (even if those are actually the bottleneck).
I realise this is very open ended and general, however...
Are there any suggestions for achieving a scalable solution?
[1] Where 'relevant' can be: the date queried for; the type of transaction; the type of user; formula selection; etc.
[2] Admittedly, this is an improvement over the previous performance, where an ORM's n+1 problems saw time taken grow either exponentially, or at least a much steeper gradient.
I have worked against similar requirements, and have some suggestions. Alot of this depends on what is possible with your data. It is difficult to make every case imaginable quick, but you can optimize for the common cases and have enough hardware grunt available for the others.
Summarise
We create summaries on a daily, weekly and monthly basis. For us, most of the transactions happen in the current day. Old transactions can also change. We keep a batch and under this the individual transaction records. Each batch has a status to indicate if the transaction summary (in table batch_summary) can be used. If an old transaction in a summarised batch changes, as part of this transaction the batch is flagged to indicate that the summary is not to be trusted. A background job will re-calculate the summary later.
Our software then uses the summary when possible and falls back to the individual transactions where there is no summary.
We played around with Oracle's materialized views, but ended up rolling our own summary process.
Limit the Requirements
Your requirements sound very wide. There can be a temptation to put all the query fields on a web page and let the users choose any combination of fields and output results. This makes it very difficult to optimize. I would suggest digging deeper into what they actually need to do, or have done in the past. It may not make sense to query on very unselective dimensions.
In our application for certain queries it is to limit the date range to not more than 1 month. We have in aligned some features to the date-based summaries. e.g. you can get results for the whole of Jan 2011, but not 5-20 Jan 2011.
Provide User Interface Feedback for Slow Operations
On occasions we have found it difficult to optimize some things to be shorter than a few minutes. We shirt a job off to a background server rather than have a very slow loading web page. The user can fire off a request and go about their business while we get the answer.
I would suggest using Materialized Views. Materialized Views allow you to store a View as you would a table. Thus all of the complex queries you need to have done are pre-calculated before the user queries them.
The tricky part is of course updating the Materialized View when the tables it is based on change. There's a nice article about it here: Update materialized view when urderlying tables change.
Materialized Views are not (yet) available without plugins in MySQL and are horribly complicated to implement otherwise. However, since you have Oracle I would suggest checking out the link above for how to add a Materialized View in Oracle.

when does LINQ outperforms ado.net

we know that linq is a layer built on top on the ado.net stack. it is very nice feature and makes database querying much better but linq is an additional layer and thus it adds some overhead to translate linq queries to sql queries and maps back the results while in ado.net we write the sql queries directly.
my question is when does linq performs faster than using the normal ado.net methods.
When the time saved in writing all those queries in raw SQL and managing all the other translation etc allows you to spend more time on finding performance bottlenecks.
LINQ isn't about outperforming SQL. It's about making code simpler and clearer, so you can concentrate on more important aspects. There may occasionally be times where the natural LINQ expression of query ends up with faster SQL than you'd have come up with yourself - although there are plenty of times the opposite will happen, too. You should still look at the SQL being generated, and profile it accordingly.
You will always be able to beat LINQ backed to a db with a stored procedure accessed from ADO and then either acted on directly or (if you must deal with objects) used to construct a an object with just the amount of data required for the task in hand.
However, LINQ lets us very quickly create a query which returns just that information needed for that task by returning anonymous objects.
To do the same with custom code per query would require either to not stop dealing with ADO at other layers (fraught in several ways) and/or to create a very large amount of objects that duplicate most of their functionality, but share no code.
So, while it can be beaten on performance, it can't be beaten in this case without a lot of rather repetitive code. And it can beat the more natural approach (to return entity objects with bloat we won't use) on performance.
Finally, even in cases where it doesn't win, it can still be faster to write, and clearer hot the operation relates to the way the entities are defined (this latter is the main reason I'm quite fond of it).

Pitfalls in prototype database design (for performance viability testing)

Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle 10.2.0.3.0 running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.
There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.
Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
edit
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.
Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!

One complex query vs Multiple simple queries

What is actually better? Having classes with complex queries responsible to load for instance nested objects? Or classes with simple queries responsible to load simple objects?
With complex queries you have to go less to database but the class will have more responsibility.
Or simple queries where you will need to go more to database. In this case however each class will be responsible for loading one type of object.
The situation I'm in is that loaded objects will be sent to a Flex application (DTO's).
The general rule of thumb here is that server roundtrips are expensive (relative to how long a typical query takes) so the guiding principle is that you want to minimize them. Basically each one-to-many join will potentially multiply your result set so the way I approach this is to keep joining until the result set gets too large or the query execution time gets too long (roughly 1-5 seconds generally).
Depending on your platform you may or may not be able to execute queries in parallel. This is a key determinant in what you should do because if you can only execute one query at a time the barrier to breaking up a query is that much higher.
Sometimes it's worth keeping certain relatively constant data in memory (country information, for example) or doing them as a separately query but this is, in my experience, reasonably unusual.
Far more common is having to fix up systems with awful performance due in large part to doing separate queries (particularly correlated queries) instead of joins.
I don't think that any option is actually better. It depends on your application specific, architecture, used DBMS and other factors.
E.g. we used multiple simple queries with in our standalone solution. But when we evolved our product towards lightweight internet-accessible solution we discovered that our framework made huge number of request and that killed performance cause of network latency. So we sufficiently reworked our framework for using aggregated complex queries. Meanwhile, we still maintained our stand-alone solution and moved from Oracle Light to Apache Derby. And once more we found that some of our new complex queries should be simplified as Derby performed them too long.
So look at your real problem and solve it appropriately. I think that simple queries are good for beginning if there are no strong objectives against them.
From a gut feeling I would say:
Go with the simple way as long as there is no proven reason to optimize for performance. Otherwise I would put the "complex objects and query" approach in the basket of premature optimization.
If you find that there are real performance implications then you should in the next step optimize the roundtripping between flex and your backend. But as I said before: This is a gut feeling, you really should start out with a definition of "performant", start simple and measure the performance.

Resources