What are the (relative) performance of the various Power BI Data Sources? - performance

What are the (relative) performance of the various Power BI Data Sources?
Specifically between SharePoint Online, Azure Blob and Azure Data Lake
We're looking at pushing some data into one of these for consumption by Power BI

As these are classes as file sources you will be limited to importing data and to a 1GB dataset sizes, and a refresh frequency of 8 times a day
It will depend on the volume and type, if it is csv files, there is nothing much between Blob and Datalake, it will do a base 1 x 1GB in about 5-8 minutes. That will be for a base read of the data, without any transformations. For multiple files, it will depend on the number
For SharePoint, will it be a list, or documents in a library? After testing about 30,000 items in a list can take about 20-30mins, but again it will depend on the structure, for example how wide it is.
If you are pushing data into something and it is a known structure, use an Azure SQL Database, then you can use direct query, so the data is always up to date.

Related

Difference between SSAS and Power BI in Memory usage

I'm working on a business intelligence project for banking transactions. After completing the ETL phase My supervisor asked me to search the difference between the Tabular and the multidimensional models and which one is more adaptable to our needs. after choosing to work with the tabular model I got noticed that I have to choose between import and live connection to connect power bi and our model.
So here are the questions that has come to my mind:
*How and when tabular model use memory?
*How and when Power BI import use memory?
*What should I exactly import into power bi from my tabular model?
*Is import mode import the model that is already use memory cache or something else?
*How much storage of memory do I need if the size of my Data Warehouse DB is approximately 7GB?
NB: I still not too familiar with Power BI So maybe I'm asking the questions in a wrong context.
I would be so grateful If anyone could help me in this.
I tried to use import mode to import my whole model but there is always a problem of memory.
Should I use live connection instead?
Your question isn't clear, so here are a few options for you.
SSAS Tabular, Azure Analysis Services (AAS) and Power BI use the same underlying engine for the tabular model, the vertipac engine. Power BI is a superset of SSAS Tabular, and currently has more focus from the internal project team. MS are currently trying to move customers from AAS to Power BI. See here.
my Data Warehouse DB is approximately 7GB
Importing the data will create a copy of the data from the data source, and hold it in memory. The dataset will not have a 1 to 1 relationship in size, as the vertipaq engine will compress the data down. So you will have to test this.
However you have don't just have to plan for the sufficient memory to hold the dataset, you have to remember that memory will be used in querying the data too. For example a FILTER function basically returns a table, that query table will be held in memory until the results of the measure are computed and returned. Memory will also be used when dataflows are being processed, even though they will be writing to blob storage and not being held in memory. There are data model size restrictions for Power BI Pro of 1GB, but the size restrictions are larger for Power BI Premium.
For direct query and live connection, it will have a far lower memory overhead than importing, as it will not be holding the full data model, just the total for the result set generated and returned via the data source. For most cases this will be quite low, but if you are returning detailed data, then it will take up more memory. You can also use for direct query modes you can use aggregations, to store a subset of data in Power BI, rather than query the data source.
If you are using SSAS Tabular/AAS you should not really use Import mode in Power BI, you'll be building the measures and data model twice. If you use SSAS Tabular/AAS, you should use Live Connection. If you wish to use Power BI, then use Direct Query, however you have to ensure that your data source can respond to the queries generated by Power BI quickly, so it should be in a star schema, indexed and enough scale to handle queries quickly.

Load 600+ million records in Synapse Dedicated Pool with Oracle as Source

I am trying to do a full load a very huge table (600+ million records) which resides in an Oracle On-Prem database. My destination is Azure Synapse Dedicated Pool.
I have already tried following:
Using ADF Copy activity with Source Partitioning, as source table is having 22 partitions
I increased the Copy Parallelism and DIU to a very high level
Still, I am able to fetch only 150 million records in 3 hrs whereas the ask is to complete the full load in around 2 hrs as the source would be freezed to users during that time frame so that Synapse can copy the data
How a full copy of data can be done from Oracle to Synapse in that time frame?
For a change, I tried loading data from Oracle to ADLS Gen 2, but its slow as well
There are a number of factors to consider here. Some ideas:
how fast can the table be read? What indexing / materialized views are in place? Is there any contention at the database level to rule out?
Recommendation: ensure database is set up for fast read on the table you are exporting
as you are on-premises, what is the local network card setup and throughput?
Recommendation: ensure local network setup is as fast as possible
as you are on-premises, you must be using a Self-hosted Integration Runtime (SHIR). What is the spec of this machine? eg 8GB RAM, SSD for spooling etc as per the minimum specification. Where is this located? eg 'near' the datasource (in the same on-premises network) or in the cloud. It is possible to scale out SHIRs by having up to four nodes but you should ensure via the metrics available to you that this is a bottleneck before scaling out.
Recommendation: consider locating the SHIR 'close' to the datasource (ie in the same network)
is the SHIR software version up-to-date? This gets updated occasionally so it's good practice to keep it updated.
Recommendation: keep the SHIR software up-to-date
do you have Express Route or going across the internet? ER would probably be faster
Recommendation: consider Express Route. Alternately consider Data Box for a large one-off export.
you should almost certainly land directly to ADLS Gen 2 or blob storage. Going straight into the database could result in contention there and you are dealing with Synapse concepts such as transaction logging, DWU, resource class and queuing contention among others. View the metrics for the storage in the Azure portal to determine it is under stress. If it is under stress (which I think unlikely), consider multiple storage accounts
Recommendation: load data to ADLS2. Although this might seem like an extra step, it provides a recovery point and avoids contention issues by attempting to do the extract and load all at the same time. I would only load directly to the database if you can prove it goes faster and you definitely don't need the recovery point
what format are you landing in the lake? Converting to parquet is quite compute intensive for example. Landing to the lake does leave an audit trail and give you a position to recover from if things go wrong
Recommendation: use parquet for a compressed format. You may need to optimise the file size.
ultimately the best thing to do would be one big bulk load (say taking the weekend) and then do incremental upserts using a CDC mechanism. This would allow you to meet your 2 hour window.
Recommendation: consider a one-off big bulk load and CDC / incremental loads to stay within the timeline
In summary, it's probably your network but you have a lot of investigation to do first, and then a number of options I've listed above to work through.
wBob provided a good summary of things you good look at to increase your transfer speed. In addition to that, you could try to bulk export your data into chunks of data files, and in-parallel transfer the files to azure datalake or azure blob storage, this way you can maximize your network throughput.
Once the data is on the datalake, you can scale up your Synapse instance and take advantage of fast loads using the COPY command.
I faced the same problem in our organization, and the fastest way to get the data out of SQL Server was using bcp into a fast storage layer.

Maximum size of database for DQ mode in Power BI

I am using a database worth of 500 GBs. I want to visualize different columns to study the relationship between them using Power BI. However, there are performance issues while loading graphs.
I am using in DQ mode.
Its annoying to wait for 10 minutes for each visual to load.
Could anyone tell me if its a good idea to use Power BI for visualisation/making dashboard out of 500GBs of data?
What is the maximum limit of database we can use in DQ mode to create visuals efficiently?
DQ doesn't have a defined limit, MS have shown demos using a Petabyte database in this case for long running queries on a database, you have a few options.
Understand what queries are being run, and optimise your indexing strategy, maybe for example add a covering index
Optimise your data source, by using a column store index to move it in memory
Create database or table(s) with a the necessary subset of data from your main data.
Examine what objects are being used, and remove nested logic, views on top of views etc, with scalar conditions etc
The petabyte example by MS also used aggregation mode (Mentioned by WB in their answer) to store a subset of the data
I have used Direct Query to sit over data sources that have been around the 200GB range, however these have been mostly standard Star Schema data warehouses, or a defined reporting table, both which had the relevant indexes, covering indexes, or Column Store Indexes to allow more efficient retrieval of data. Direct Query Mode will slow down due to the number of query's that it has the do on the data source based on the measure, relationships and the connection overhead. Another can be the number of visuals on page, as each visual is a query and each one has to run on the data source.
You might want to look at aggregates in Power BI. You can basically import aggregate tables to Power BI that would satisfy needs for most of your visuals and resort to Direct Query for details that you might rarely need. When properly configured, aggregations will be cached and visuals that hit the aggregation will make use of that while those that don't will seamlessly query the DQ source.
Also, VertiPaq engine with its columnar store is quite efficient at compresses data. So given some smart modelling (get rid of unneeded high cardinality columns), you might actually end up with a much smaller model than your original data for all import.
Your mileage may vary.
As to the dataset limit itself, I believe it's 1GB/dataset when uploading to the service.

Storing Images / Media Files in Oracle

I want to store a large number of media files in oracle. I believe I can store these files in the form of blobs using pl/sql procedure. However I want to make sure there is no impact to resolution / quality of the media file. Also are there any considerations that I need to account for to store media files in Oracle DB?
By storing and retrieving files in a blob does not impact image quality or resolution. Oracle does treat them as binary objects and what you store is what you get when you retrieve.
Typically such modifications are done at application layer logic before storing data into blob. In case of text based files, compressing them and storing them would save some disk space, in case of images, typically resolution/image size is modified to reduce file size etc. These are the decisions taken while designing application, as part of application architecture to reduce overall storage requirement.
Also, consider if this is going to be right design for you. There are implications in terms of storage requirement, application performance and scalablity. There are several threads in stackoverflow discussing advantages of storing images in RDBMS vs NoSQL databases vs filesystems. Also average size of files do matter a lot.
some links:
Storing images in NoSQL stores
NoSQL- Is it suitable for storing images?
Storing very big files in database
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Resources