I'm working on a business intelligence project for banking transactions. After completing the ETL phase My supervisor asked me to search the difference between the Tabular and the multidimensional models and which one is more adaptable to our needs. after choosing to work with the tabular model I got noticed that I have to choose between import and live connection to connect power bi and our model.
So here are the questions that has come to my mind:
*How and when tabular model use memory?
*How and when Power BI import use memory?
*What should I exactly import into power bi from my tabular model?
*Is import mode import the model that is already use memory cache or something else?
*How much storage of memory do I need if the size of my Data Warehouse DB is approximately 7GB?
NB: I still not too familiar with Power BI So maybe I'm asking the questions in a wrong context.
I would be so grateful If anyone could help me in this.
I tried to use import mode to import my whole model but there is always a problem of memory.
Should I use live connection instead?
Your question isn't clear, so here are a few options for you.
SSAS Tabular, Azure Analysis Services (AAS) and Power BI use the same underlying engine for the tabular model, the vertipac engine. Power BI is a superset of SSAS Tabular, and currently has more focus from the internal project team. MS are currently trying to move customers from AAS to Power BI. See here.
my Data Warehouse DB is approximately 7GB
Importing the data will create a copy of the data from the data source, and hold it in memory. The dataset will not have a 1 to 1 relationship in size, as the vertipaq engine will compress the data down. So you will have to test this.
However you have don't just have to plan for the sufficient memory to hold the dataset, you have to remember that memory will be used in querying the data too. For example a FILTER function basically returns a table, that query table will be held in memory until the results of the measure are computed and returned. Memory will also be used when dataflows are being processed, even though they will be writing to blob storage and not being held in memory. There are data model size restrictions for Power BI Pro of 1GB, but the size restrictions are larger for Power BI Premium.
For direct query and live connection, it will have a far lower memory overhead than importing, as it will not be holding the full data model, just the total for the result set generated and returned via the data source. For most cases this will be quite low, but if you are returning detailed data, then it will take up more memory. You can also use for direct query modes you can use aggregations, to store a subset of data in Power BI, rather than query the data source.
If you are using SSAS Tabular/AAS you should not really use Import mode in Power BI, you'll be building the measures and data model twice. If you use SSAS Tabular/AAS, you should use Live Connection. If you wish to use Power BI, then use Direct Query, however you have to ensure that your data source can respond to the queries generated by Power BI quickly, so it should be in a star schema, indexed and enough scale to handle queries quickly.
Related
I am using a database worth of 500 GBs. I want to visualize different columns to study the relationship between them using Power BI. However, there are performance issues while loading graphs.
I am using in DQ mode.
Its annoying to wait for 10 minutes for each visual to load.
Could anyone tell me if its a good idea to use Power BI for visualisation/making dashboard out of 500GBs of data?
What is the maximum limit of database we can use in DQ mode to create visuals efficiently?
DQ doesn't have a defined limit, MS have shown demos using a Petabyte database in this case for long running queries on a database, you have a few options.
Understand what queries are being run, and optimise your indexing strategy, maybe for example add a covering index
Optimise your data source, by using a column store index to move it in memory
Create database or table(s) with a the necessary subset of data from your main data.
Examine what objects are being used, and remove nested logic, views on top of views etc, with scalar conditions etc
The petabyte example by MS also used aggregation mode (Mentioned by WB in their answer) to store a subset of the data
I have used Direct Query to sit over data sources that have been around the 200GB range, however these have been mostly standard Star Schema data warehouses, or a defined reporting table, both which had the relevant indexes, covering indexes, or Column Store Indexes to allow more efficient retrieval of data. Direct Query Mode will slow down due to the number of query's that it has the do on the data source based on the measure, relationships and the connection overhead. Another can be the number of visuals on page, as each visual is a query and each one has to run on the data source.
You might want to look at aggregates in Power BI. You can basically import aggregate tables to Power BI that would satisfy needs for most of your visuals and resort to Direct Query for details that you might rarely need. When properly configured, aggregations will be cached and visuals that hit the aggregation will make use of that while those that don't will seamlessly query the DQ source.
Also, VertiPaq engine with its columnar store is quite efficient at compresses data. So given some smart modelling (get rid of unneeded high cardinality columns), you might actually end up with a much smaller model than your original data for all import.
Your mileage may vary.
As to the dataset limit itself, I believe it's 1GB/dataset when uploading to the service.
What are the (relative) performance of the various Power BI Data Sources?
Specifically between SharePoint Online, Azure Blob and Azure Data Lake
We're looking at pushing some data into one of these for consumption by Power BI
As these are classes as file sources you will be limited to importing data and to a 1GB dataset sizes, and a refresh frequency of 8 times a day
It will depend on the volume and type, if it is csv files, there is nothing much between Blob and Datalake, it will do a base 1 x 1GB in about 5-8 minutes. That will be for a base read of the data, without any transformations. For multiple files, it will depend on the number
For SharePoint, will it be a list, or documents in a library? After testing about 30,000 items in a list can take about 20-30mins, but again it will depend on the structure, for example how wide it is.
If you are pushing data into something and it is a known structure, use an Azure SQL Database, then you can use direct query, so the data is always up to date.
We are using ElasticSearch for search capability in our product. This works fine.
Now we want to provide self service Business intelligence to our customers. Reporting on the operational database sucks due to performance impact. At the run-time, calculating average 'order resolution time' for 10 million records would not fetch the results in time. Traditional way is to create a data mart by loading the operational data using ETL and summarizing it. Then use any reporting engine, to offer metrics and reports to customers. This approach works but increases total cost of ownership for our customers.
I am wondering if anybody has used ElasticSearch as the intermediate data surface for reporting. Can Kibana serve the data exploration, visualization need?
We have the same needs.
Tools like Qlik, PowerBI, Tableau require to increase the overall insfrastructure stack and where you are designin solution to bring abroad without the possibility to share anyting they could be not the best possible option in terms of both costs
& complexity.
I have used devextreme by devexpress. Its server side approach using custom store is very efficient to handle & perform operations on large amount of data. In case of mysql and mssql db, I have myself performed grouping, sorting ,filtering, summaries on 10 million data using devextreme.
Apache Superset seems to be an answer. https://superset.apache.org/docs/intro
We are currently interested in evaluating datameer and have a few questions. Are there any datameer users that can answer these questions:
Since datameer works off HDFS, are the querying speeds similar to that of Hive? How does the querying speed compare with columnar databases?
Since Hadoop is known for high latency, is it advisable to use datameer for real time quering?
Thank you.
Ravi
Regarding 1:
Query speeds are comparable to Hive.
But Datameer is a lot faster in the design phase of your "query". Datameer provides a real time preview how the results of your "query" would look like, which is happening in memory and not on the cluster. The preview is based on a representative sample of your data. It's only a preview not the final results, but it gives you constant feedback if your analytics make sense while designing.
To test a Hive query you would have to execute it, which makes the design process very slow.
Datameer's big advantage over Hive is:
Loading data into Hadoop is much easier. No static schema creation, no ETL, etc. Just use a wizard to download data from your database, log files, social media, etc.
Designing analytics or making changes is a lot faster and can even be done by non technical users.
No need to install anything else since Datameer includes all you need for importing, analytics, scheduling, security, visualization etc. in one product
If you have real time requirements you should not pull data directly out of Datameer, Hive, Impala, etc.. Columnar storages make some processing faster but will still not be low latency. But you can use those tools together with a low latency database. Use Datameer/Hive/Impala for the heavy lifting to filter and pre aggregate big data into smaller data and then export that out into a database. In Datameer you could set this up very easily using one of Datameer's wizards.
Hope this helps,
Peter Voß (Datameer)
I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.