Tableau Online slow performance - performance

I have a Redshift database with an insane quantity of data, we use a live connection and tableau_online. The problem is that the load times are really high, almost five minutes. What can I try to improve this?

There are hundreds of factors that impact performance. I would start with workbook optimization. Workbook Performance Tips

I have been in a similar situation myself, although with a different DB, basically you have to make a choice between a live connection to your DB, which, as you've seen, can suffer performance issues if you have a lot of data, or an extract.
Tableau want you to use extracts because this where they can really help you improve performance of the workbook over large data sets, but I have been in situations where there is a requirement for live data and Tableau's extracts schedules did not suit my needs.
If you have no options but to use a live connection then consider whether you could partition your data and connect the workbook to a part of it to improve performance, or possibly pre-aggregate some of the historical data to make it more manageable.
May also be worth thinking about whether you need the whole dashboard to connect to live data, or if you could feed live data via a smaller query to a couple of workbooks and have the rest feed off extracted data.
As I'm sure you can see, there is no one-stop solution, it depends what works best for you and the users of your reports.

Related

How can I speed up PouchDB?

Since replacing Mongodb with Pouchdb in my ionic app, the app feels a little sluggish, and I would like to know if there is a way to speed it up. The database we are talking about currently contains less than a 100 documents and is slow even when the usage is purely local. We are using secondary indexes. Is this the cause of this performance drop? Would we be better off using allDocs() and then searching manually trough the database? I read it would be faster, but the posts were over a year old and things may have changed since then. I also tried using the websql adapter, but it didn't really affect the speed. Are they other adapters or things I could try?
On such a small database, a secondary index would not be faster than allDocs in my experience. But I would not think the performance difference would be noticeable (I have used both on a small local database). You might try "compacting" the databases regularly if you have not already as this can make the database size smaller and more efficient. Like you, I have tried different adapters (IndexDb and websql) but could not see much difference in speed.

What is the best way to extract big data to file?

I am using Oracle as a DBMS and Tuxedo for application server.
Customer has the need to export data from Oracle to SAMFILE for interface purpose.
Unfortunately, the total number of records size is huge (over 10 million) so
I was wondering what is the best practice to extract big amounts of data to a file on the database server.
I am used to creating a cursor and fetching a record then writing to file.
Is there a better i.e. faster way to handle this? It is a recurring task.
I suggest you read Adrian Billington's article on tuning UTL_FILE. It covers all the bases. Find it here.
The important thing is buffering records, so reducing the number of file I/O calls. You will need to benchmark the different implementations, to see which works best in your situation.
Pay attention to his advice on query performance. Optimising file I/O is pointless if most of the time is spent on data acquisition.

Datareader or Dataset in Winform with long trips to database?

I've got a Winform app that will be used in the US and China. The SQL Server 2005 database is in the US, so the data access is going to be slower for the people in China. I'm deciding between using a DataReader and a Dataset for best performance. The data will immediately be loaded into business objects upon retrieval.
Question: Which performs better (DataReader/DataSet) pulling data from a database that's far away? I have read that the DataReader goes back to the database for each .Read(), so if the connection is slow to begin with, will the DataSet be a better choice here?
Thanks
The performance of datareader vs a dataset will barely be measurable compared to the database roundtrips if you're expecting long distance/slow links.
DataSets might use more memory though, which might be a concern if you're dealing with a lot of data.
depend of amount of data. You cannot store in memory (dataset) a too large amount.
Two ways for your problem at my mind :
- parallelisation (System.Thread)
- backgroundworkers
The first can improve performance in linq to sql cases. The second can help end users have a better experience (non bloqued UI).
I think it doesn't matter since the connection is the bottleneck.
Both use the same mechanism to fetch the data (ADO.NET/Datareader).
If you can do it you might compress the query result on the server en THEN send it to the client. That would improve performance.
Depends on what database it is. Bad situation if it is Access.
Depends on how much data is moved around, what is the usage style? will users from China just read or do read/write on common data? do they need to see all data?
The idea is, separate the data (if it helps the scenario) and merge it back.
It doesn't matter which you choose since the DataSet uses the DataReader to fill. Try to avoid calling the db whereever possible, by caching results or by getting extra data. A few calls that get extra data will probably outperform alot of small pecks at the tables. Maybe a BackgroundWorker could preload some data that you know you will be using.
Just for other readers: the DataReader is MUCH more performant. Obviously, these users have not tried using both and actually tested the difference. Load 1,000 records with a DataReader and 1,000 with a DataSet. Then try limiting the records for the DataSet to 10 records (using the adapter's Fill method so that the 1,000 are loaded, but only 10 are populated/filled into the DataSet).
I really don't know why DataSets are so bad performance-wise during the fill operation, but the difference is huge. Its much faster to create your own collection and fill them with a DataReader than use the very bloated and slow DataSet.

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources