Datareader or Dataset in Winform with long trips to database? - performance

I've got a Winform app that will be used in the US and China. The SQL Server 2005 database is in the US, so the data access is going to be slower for the people in China. I'm deciding between using a DataReader and a Dataset for best performance. The data will immediately be loaded into business objects upon retrieval.
Question: Which performs better (DataReader/DataSet) pulling data from a database that's far away? I have read that the DataReader goes back to the database for each .Read(), so if the connection is slow to begin with, will the DataSet be a better choice here?
Thanks

The performance of datareader vs a dataset will barely be measurable compared to the database roundtrips if you're expecting long distance/slow links.
DataSets might use more memory though, which might be a concern if you're dealing with a lot of data.

depend of amount of data. You cannot store in memory (dataset) a too large amount.
Two ways for your problem at my mind :
- parallelisation (System.Thread)
- backgroundworkers
The first can improve performance in linq to sql cases. The second can help end users have a better experience (non bloqued UI).

I think it doesn't matter since the connection is the bottleneck.
Both use the same mechanism to fetch the data (ADO.NET/Datareader).
If you can do it you might compress the query result on the server en THEN send it to the client. That would improve performance.

Depends on what database it is. Bad situation if it is Access.
Depends on how much data is moved around, what is the usage style? will users from China just read or do read/write on common data? do they need to see all data?
The idea is, separate the data (if it helps the scenario) and merge it back.

It doesn't matter which you choose since the DataSet uses the DataReader to fill. Try to avoid calling the db whereever possible, by caching results or by getting extra data. A few calls that get extra data will probably outperform alot of small pecks at the tables. Maybe a BackgroundWorker could preload some data that you know you will be using.

Just for other readers: the DataReader is MUCH more performant. Obviously, these users have not tried using both and actually tested the difference. Load 1,000 records with a DataReader and 1,000 with a DataSet. Then try limiting the records for the DataSet to 10 records (using the adapter's Fill method so that the 1,000 are loaded, but only 10 are populated/filled into the DataSet).
I really don't know why DataSets are so bad performance-wise during the fill operation, but the difference is huge. Its much faster to create your own collection and fill them with a DataReader than use the very bloated and slow DataSet.

Related

Tableau Online slow performance

I have a Redshift database with an insane quantity of data, we use a live connection and tableau_online. The problem is that the load times are really high, almost five minutes. What can I try to improve this?
There are hundreds of factors that impact performance. I would start with workbook optimization. Workbook Performance Tips
I have been in a similar situation myself, although with a different DB, basically you have to make a choice between a live connection to your DB, which, as you've seen, can suffer performance issues if you have a lot of data, or an extract.
Tableau want you to use extracts because this where they can really help you improve performance of the workbook over large data sets, but I have been in situations where there is a requirement for live data and Tableau's extracts schedules did not suit my needs.
If you have no options but to use a live connection then consider whether you could partition your data and connect the workbook to a part of it to improve performance, or possibly pre-aggregate some of the historical data to make it more manageable.
May also be worth thinking about whether you need the whole dashboard to connect to live data, or if you could feed live data via a smaller query to a couple of workbooks and have the rest feed off extracted data.
As I'm sure you can see, there is no one-stop solution, it depends what works best for you and the users of your reports.

Use Vertica Database for OLTP data?

Can Vertica Database be used for OLTP data?
And if so what are the pros and cons on doing this?
Looking for a Vertica vs Oracle fight :)Since Oracle license is so costly, would Vertica do it job for a better price ?
thx all
Using Vertica as a transactional database is a bad idea. It's designed to be a data warehousing tool. Essentially, it reads and writes data in an optimized fashion. Lots of transactions? That's not what it is designed to do.
I would recommend that you look into VoltDB. Michael Stonebreaker who is the force behind Vertica founded that company as well. His basic philosophy is that Oracle, SQL Server, et al do not do well for high performance since they are designed to do everything. The future is having databases designed for specific tasks.
So he had some concepts for a data warehousing which became Vertica. For transactional databases, there's VoltDB. Not owned by HP, for the record.
For the record, I haven't used VoltDB. From what I know, it isn't as mature as Vertica is as a solution but it looks like it has a ton of promise.
HP Vertica is a column store database. The nature of the way that data is organised within a column store does not lend itself to rapid writes.
HP Vertica gets around this by having a WOS (Write Optimised Store) and ROS (Read Optimised Store which is file based).
Data is moved out of the WOS into the ROS fairly rapidly and the ROS itself has a "merge up" process that takes small ROS files and merges them together to form larger and therefore more easily scanned files.
If you tried to use Vertica for OLTP then what would happen would be that you'd get loads of ROS containers and possibly hit the default limit of 1024 ROS containers very quickly.
If you fronted the store with some form a queuing mechanism to pass through records in larger batches then this would result in fewer and larger ROS files. It would work but if you wanted to take your OLTP system to be reading very close to its writing activity it would not fit the use case.
The WOS/ROS mechanism is a neat work around for the fundamental performance penalty of writes in a column store DB but fundamentally Vertica is not an OLTP DB but rather a data mart technology that can ingest data in near real time
I think there are different ways to read into this question.
Can you use Vertica as an OLTP database?
First I'll define this question a bit. An OLTP database means the database itself is responsible for the transaction processing, not simply receiving somewhat normalized data.
My answer here is absolutely not, unless perhaps it is a single user database. There is practically no RI, no RI locking, table locks on DELETE/UPDATE, and you're likely to accumulate a delete vector in normal OLTP type usage.
You can work around some of these with some extensive middleware programming (distributed locks, heavy avoidance of DELETE/UPDATE, etc). But why? There are tons of options out there that are not Oracle, don't carry a huge price tag but give you everything you need for OLTP.
Can you use Vertica to ingest and query OLTP data?
Yes, definitely. Best to use Vertica towards its strengths, though. Queries in Vertica tend to have a fair amount of overhead, and you can plow through large amounts of data with ease, even normalized. I would not be using Vertica to primary run point queries, grabbing a few rows here and there. It isn't that you can't, but you can't with the same concurrency as other databases that are meant for this purpose.
TL;DR Use the right tool for the right job. I really love using Vertica, but just because I like to swing a hammer doesn't mean that every problem is a nail.
This question is a little old now but i'll share my experience.
I would not suggest vertica as OLTP unless you very carefully consider your workload.
As mentioned in other answers, Vertica has 2 types of storage. ROS is the Read Optimized Storage and WOS is the Write Optimized Storage. WOS is purely in memory so it performs better for inserts but queries slower as all the small updates need to be queried and unioned. Vertica can handle small loads in theory but in practice it didn't work out very well for us performance wise. Also there are drawbacks to WOS namely being that when the database fails WOS is not necessarily preserved when it rolls back to last good epoch. (ROS isn't either but in practice you lose a lot less from ROS).
ROS is a lot more reliable and gives better read performance but you will never be able to handle more than a certain number of queries without a careful design. Although vertica is horizontally scalable, in practice large tables get segmented across all nodes and therefore queries must run on all nodes. So adding more nodes doesn't mean handling more concurrent queries it just means less work per query. If your tables are small enough to be unsegmented then this might not be an issue for you.
Also worth noting is the OLTP typically implies lots concurrent transactions so you'll need to plan resource pools very carefully. By default vertica has a planned concurrency for the general resource pool of the minimum of number of cores per server or RAM/2GB. Essentially what this value does is determine the default memory allocation PER NODE for a segmented query. Therefore by default vertica will not let you run more queries than cores. You can adjust this value but once you hit a cap on memory theres no much you can do because the memory is allocated per node so adding more nodes doesn't even help. If you hit any errors at all for resource pool memory allocations that is the first config your should look at.
Additionally, Vertica is bad with deletes and updates (which resolve to a delete and an insert in the background) so if these are a regular part of your workload then Vertica is probably a bad choice. Personally we use MySQL for our dimension tables that require deletes/updates and then sync that data periodically into vertica to use for joins.
Personally I use Vertica as an OLTP-ish realtime-ish database. We batch our loads into 5 minute intervals which makes vertica happy in terms of how many/large the inserts are. These batches are inserted using COPY DIRECT so that they avoid WOS entirely (only do this if they are large batches as this forces ROS container creation and can be bad if you do it too often). As many projections as we can have are unsegmented to allow better scale out since this makes queries hit only 1 node and allocate memory on only 1 node. It has worked well for us so far and we load about 5 billion rows a day with realtime querying from our UI.
Up_one - considering the telecom use-case - are you doing CDR or something else?
To answer your original question yes Vertica may be a great fit but it depends on how you are loading the data, how you are doing updates, what your data size is and what your SLA is. I am really familiar in this space because I implemented Vertica at a telecom that I worked for at the time.

What makes Oracle more scalable?

Oracle seems to have a reputation for being more scalable than other RDBMSes. After working with it a bit, I can say that it's more complex than other RDBMSes, but I haven't really seen anything that makes it more scalable than other RDBMSes. But then again, I haven't really worked on it in a whole lot of depth.
What features does Oracle have that are more scalable?
Oracle's RAC architecture is what makes it scalable where it can load balance across nodes and parallel queries can be split up and pushed to other nodes for processing.
Some of the tricks like loading blocks from another node's buffer cache instead of going to disc make performance a lot more scalable.
Also, the maintainability of RAC with rolling upgrades help make the operation of a large system more sane.
There is also a different aspect of scalability - storage scalability. ASM makes increasing the storage capacity very straightforward. A well designed ASM based solution, should scale past the 100s of terabyte size without needing to do anything very special.
Whether these make Oracle more scalable than other RDBMSs, I don't know. But I think I would feel less happy about trying to scale up a non-Oracle database.
Cursor sharing is (or was) a big advantage over the competition.
Basically, the same query plan is used for matching queries. An application will have a standard set of queries it issue (eg get the orders for this customer id). The simple way is to treat every query individually, so if you see 'SELECT * FROM ORDERS WHERE CUSTOMER_ID = :b1', you look at whether table ORDERS has an index on CUSTOMER_ID etc. As a result, you can spend as much work looking up meta data to get a query plan as actually retrieving the data. With simple keyed lookups, a query plan is easy. Complex queries with multiple tables joined on skewed columns are harder.
Oracle has a cache of query plans, and older/less used plans are aged out as new ones are required.
If you don't cache query plans, there's a limit to how smart you can make your optimizer as the more smarts you code into it, the bigger impact you have on each query processed. Caching queries means you only incur that overhead the first time you see the query.
The 'downside' is that for cursor sharing to be effective you need to use bind variables. Some programmers don't realise that and write code that doesn't get shared and then complain that Oracle isn't as fast as mySQL.
Another advantage of Oracle is the UNDO log. As a change is done, the 'old version' of the data is written to an undo log. Other database keep old versions of the record in the same place as the record. This requires VACUUM style cleanup operations or you bump into space and organisation issues. This is most relevant in databases with high update or delete activity.
Also Oracle doesn't have a central lock registry. A lock bit is stored on each individual data record. SELECT doesn't take a lock. In databases where SELECT locks, you could have multiple users reading data and locking each other or preventing updates, introducing scalability limits. Other databases would lock a record when a SELECT was done to ensure that no-one else could change that data item (so it would be consistent if the same query or transaction looked at the table again). Oracle uses UNDO for its read consistency model (ie looking up the data as it appeared at a specific point in time).
Tom Kyte's "Expert Oracle Database Architecture" from Apress does a good job of describing Oracle's architecture, with some comparisons with other rDBMSs. Worth reading.

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Is there a major performance gain by using stored procedures?

Is it better to use a stored procedure or doing it the old way with a connection string and all that good stuff? Our system has been running slow lately and our manager wants us to try to see if we can speed things up a little and we were thinking about changing some of the old database calls over to stored procedures. Is it worth it?
The first thing to do is check the database has all the necessary indexes set up. Analyse where your code is slow, and examine the relevant SQL statements and indexes relating to them. See if you can rewrite the SQL statement to be more efficient. Check that you aren't recompiling an SQL (prepared) statement for every iteration in a loop instead of outside it once.
Moving an SQL statement into a stored procedure isn't going to help if it is grossly inefficient in implementation. However the database will know how to best optimise the SQL and it won't need to do it repeatedly. It can also make the client side code cleaner by turning a complex SQL statement into a simple procedure call.
I would take a quick look at Stored Procedures are EVIL.
So long as your calls are consistent the database will store the execution plan (MS SQL anyway). The strongest remaining reason for using stored procedures are for easy and sure security management.
If I were you I'd first be looking for adding indices where required. Also run a profiling tool to examine what is taking long and if that sql needs to changed, e.g. adding more Where clauses or restricting result set.
You should consider caching where you can.
Stored procedures will not make things faster.
However, rearranging your logic will have a huge impact. The tidy, focused transactions that you design when thinking of stored procedures are hugely beneficial.
Also, stored procedures tend to use bind variables, where other programming languages sometimes rely on building SQL statements on-the-fly. A small, fixed set of SQL statements and bind variables is fast. Dynamic SQL statements are slow.
An application which is "running slow lately" does not need coding changes.
Measure. Measure. Measure. "slow" doesn't mean much when it comes to performance tuning. What is slow? Which exact transaction is slow? Which table is slow? Focus.
Control all change. All. What changed? OS patch? RDBMS change? Application change? Something changed to slow things down.
Check for constraints in scale. Is a table slowing down because 80% of the data is history that you use for reporting once a year?
Stored procedures are never the solution to performance problems until you can absolutely point to a specific block of code which is provably faster as a stored procedure.
stored procedures can be really help if they avoid sending huge amounts of data and/or avoid doing roundtrips to the server,so they can be valuable if your application has one of these problems.
After you finish your research you will realize there are two extreme views at opposite side of the spectrum. Historically the Java community has been against store procs due to the availability of frameworks such as hibernate, conversely the .NET community has used more stored procs and this legacy goes as far as the vb5/6 days. Put all this information in context and stay away from the extreme opinions on either side of the coin.
Speed should not be the primary factor to decide against or in favor of stored procs. You can achieve sp performace using inline SQL with hibernate and other frameworks. Consider maintenance and which other programs such as reports, scripts could use the same stored procs used by your application. If your scenario requires multiple consumers for the same SQL code, stored procedures are a good candidate, maintenance will be easier. If this is not the case, and you decide to use inline sql, consider externalizing it in config files to facilitate maintenance.
At the end of the day, what counts is what will make your particular scenario a success for your stakeholders.
If your server is getting noticeably slower in your busy season it may be because of saturation rather than anything inefficent in the database. Basic queuing theory tells us that a server gets hyperbolically slower as it approaches saturation.
The basic relationship is 1/(1-X) where X is the proportion of load. This describes the average queue length or time to wait before being served. Therefore a server that is getting saturated will slow down very rapidly when the load spikes.
A server that is 25% loaded will have an average service time of 1.333K for some constant K (loosely, K is the time for the machine to perform one transaction). A server that is 50% loaded will have an average service time of 2K and a server that is 90% loaded will have an average service time of 10K. Given that the slowdowns are hyperbolic in nature, it often doesn't take a large change in overall load to produce a significant degradation in response time.
Obviously this is somewhat simplistic as the server will be processing multiple requests concurrently (there are more elaborate queuing models for this situation), but the broad principle still applies.
So, if your server is experiencing transient loads that are saturating it, you will experience patches of noticeable slow-down. Note that these slow-downs need only be in one bottlenecked area of the system to slow the whole process down. If you are only experiencing this now in a busy season there is a possibility that your server has simply hit a constraint on a resource, rather than being particularly slow or inefficient.
Note that this possibility is not antithetical to the possibility of inefficiencies in the code. You may find that the way to ease the bottleneck is to tune some of your queries.
In order to tell if the system is bottlenecked, start gathering profiling information. If you can find resources with a large number of waits, this should give you a good starting point.
The final possibility is that you need to upgrade your server. If there are no major inefficiencies in the code (this might well be the case if profiling doesn't indicate any disproportionately large bottlenecks) you may simply need bigger hardware. I have no idea what your volumes are, but don't discount the possibility that you may have outgrown your server.
Yes, stored procs is a step forward towards acheiving good performance. The main reason is that stored procedures can be pre-compiled and their execution plan cached.
You however need to first analyse where your performance bottlenecks are really - so that you approach this exercise in a structured way.
As it has been suggested in one of the responses, try analyse using a profiler tool where the problem is - e.g do you need to create indexes...
Cheers
Like all of the above posts suggest, you first want to clean up your SQL statements, have appropriate indexes. caching can be tricky, I cant comment unless I have more detail on what you are trying to accomplish.
But one thing about sprocs, make sure you dont let it generate dynamic SQL statements
because for one, it will be pointless and it can be subjected to SQL Injection attacks...this has happened in one of the projects I looked into.
I would recommend sprocs for updates mainly, and then select statements.
good luck :)
You can never say in advance. You must do it and measure the difference because in 9 out of 10 cases, the bottleneck is not where you think.
If you use a stored procedure, you don't have to transmit the data. DBs are usually slow at executing [EDIT]complex[/EDIT] stored procedures [EDIT]with loops, higher math, etc[/EDIT]. So it really depends on how much work you would need to do, how slow your network is, how fast the DB executes this particular code, etc.

Resources