Maximum size of database for DQ mode in Power BI - performance

I am using a database worth of 500 GBs. I want to visualize different columns to study the relationship between them using Power BI. However, there are performance issues while loading graphs.
I am using in DQ mode.
Its annoying to wait for 10 minutes for each visual to load.
Could anyone tell me if its a good idea to use Power BI for visualisation/making dashboard out of 500GBs of data?
What is the maximum limit of database we can use in DQ mode to create visuals efficiently?

DQ doesn't have a defined limit, MS have shown demos using a Petabyte database in this case for long running queries on a database, you have a few options.
Understand what queries are being run, and optimise your indexing strategy, maybe for example add a covering index
Optimise your data source, by using a column store index to move it in memory
Create database or table(s) with a the necessary subset of data from your main data.
Examine what objects are being used, and remove nested logic, views on top of views etc, with scalar conditions etc
The petabyte example by MS also used aggregation mode (Mentioned by WB in their answer) to store a subset of the data
I have used Direct Query to sit over data sources that have been around the 200GB range, however these have been mostly standard Star Schema data warehouses, or a defined reporting table, both which had the relevant indexes, covering indexes, or Column Store Indexes to allow more efficient retrieval of data. Direct Query Mode will slow down due to the number of query's that it has the do on the data source based on the measure, relationships and the connection overhead. Another can be the number of visuals on page, as each visual is a query and each one has to run on the data source.

You might want to look at aggregates in Power BI. You can basically import aggregate tables to Power BI that would satisfy needs for most of your visuals and resort to Direct Query for details that you might rarely need. When properly configured, aggregations will be cached and visuals that hit the aggregation will make use of that while those that don't will seamlessly query the DQ source.
Also, VertiPaq engine with its columnar store is quite efficient at compresses data. So given some smart modelling (get rid of unneeded high cardinality columns), you might actually end up with a much smaller model than your original data for all import.
Your mileage may vary.
As to the dataset limit itself, I believe it's 1GB/dataset when uploading to the service.

Related

Difference between SSAS and Power BI in Memory usage

I'm working on a business intelligence project for banking transactions. After completing the ETL phase My supervisor asked me to search the difference between the Tabular and the multidimensional models and which one is more adaptable to our needs. after choosing to work with the tabular model I got noticed that I have to choose between import and live connection to connect power bi and our model.
So here are the questions that has come to my mind:
*How and when tabular model use memory?
*How and when Power BI import use memory?
*What should I exactly import into power bi from my tabular model?
*Is import mode import the model that is already use memory cache or something else?
*How much storage of memory do I need if the size of my Data Warehouse DB is approximately 7GB?
NB: I still not too familiar with Power BI So maybe I'm asking the questions in a wrong context.
I would be so grateful If anyone could help me in this.
I tried to use import mode to import my whole model but there is always a problem of memory.
Should I use live connection instead?
Your question isn't clear, so here are a few options for you.
SSAS Tabular, Azure Analysis Services (AAS) and Power BI use the same underlying engine for the tabular model, the vertipac engine. Power BI is a superset of SSAS Tabular, and currently has more focus from the internal project team. MS are currently trying to move customers from AAS to Power BI. See here.
my Data Warehouse DB is approximately 7GB
Importing the data will create a copy of the data from the data source, and hold it in memory. The dataset will not have a 1 to 1 relationship in size, as the vertipaq engine will compress the data down. So you will have to test this.
However you have don't just have to plan for the sufficient memory to hold the dataset, you have to remember that memory will be used in querying the data too. For example a FILTER function basically returns a table, that query table will be held in memory until the results of the measure are computed and returned. Memory will also be used when dataflows are being processed, even though they will be writing to blob storage and not being held in memory. There are data model size restrictions for Power BI Pro of 1GB, but the size restrictions are larger for Power BI Premium.
For direct query and live connection, it will have a far lower memory overhead than importing, as it will not be holding the full data model, just the total for the result set generated and returned via the data source. For most cases this will be quite low, but if you are returning detailed data, then it will take up more memory. You can also use for direct query modes you can use aggregations, to store a subset of data in Power BI, rather than query the data source.
If you are using SSAS Tabular/AAS you should not really use Import mode in Power BI, you'll be building the measures and data model twice. If you use SSAS Tabular/AAS, you should use Live Connection. If you wish to use Power BI, then use Direct Query, however you have to ensure that your data source can respond to the queries generated by Power BI quickly, so it should be in a star schema, indexed and enough scale to handle queries quickly.

vertica how restrict database size

Could you please help me with the following issue?
I have installed vertica cluster. I can't understand how I can restrict database size in time or in size. For example, data in a database must be deleted older than 30 days or when a database size of 100 GB is reached (what comes first)
There is no automated way of doing this, and no logical way of "restricting database size". You can't just trim "data" from a "database".
What are you talking about (in terms of limiting data outside of 30 days old) needs to be done on the table level. You would need some kind of date field and delete anything older than 30 days. However, I would advice against deleting rows in this way. It is non-performant and can cause queries against the table to be slow: see DELETE and UPDATE Performance Considerations. The best way of doing this would be to partition the table by day, and create an automated script (bash, python, etc) to each day drop the partition that corresponds with the date 30 days ago: see Dropping Partitions.
As for deleting data—if the size of the "database" goes above 100GB—this requirement is extremely vague and would be impossible to enforce. Let's say you have 50 tables, and the size of several of those tables grows so that the total size of the database is over 100GB, how would you decide which table to prune? This also must be done on a table by table level (or in this case—technically—on a projection level, since that is where the data is actually stored).
To see the compressed size (size on disk) of the database you can use this query:
SELECT SUM(used_bytes) / ( 1024^3 ) AS database_size_gb
FROM projection_storage;
However, since data can only be deleted with a DELETE or DROP PARTITION statement on a table, it would also be helpful to see the size of each table. You can do this by using this query:
SELECT projection_schema, anchor_table_name, SUM(used_bytes) / ( 1024^3 ) AS table_size_gb
FROM projection_storage
GROUP BY 1, 2
ORDER BY 3 DESC;
From the results you can decide which tables you want to prune.
A couple of notes (as a Vertica DBA):
Data is stored in projections. Having too many projections on a single table can not only cause queries to be slow but will also increase the overall data footprint. Avoid using too many projections (especially too many superprojections, don't have more than two per table, and most tables will only need one). Use the database designer or follow the guidelines in the documentation for creating custom projections: Design Fundamentals.
Also, another trick to keep database size down is to use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function. Unless your projections are created with the database designer, they will likely only contain the auto encoding. Using the DESIGNER_DESIGN_PROJECTION_ENCODINGS function will help you to pick the most optimal encoding for each column. I have seen properly encoded projections take up a mere 2% disk size compared to the previously un-optimized projection. That is rare, but in my experience you will still see at least a 20-40% reduction in size. Do not be afraid to use this function liberally. It is one of my favorite tools as a Vertica DBA.
Also

How to achieve Data Sharding in Endeca (data partitioning)

Currently Oracle Commerce Guided Search (Endeca) supports only language specific partitions (i.e., One MDEX per Language). For systems with huge data volume base (say ~100 million records of ~200 stores), does anyone successfully implemented data partitioning (sharding) based on logical group of data (i.e., One MDEX per group-of-stores) so that the large set of data can be divided into smaller sets of data?
If so, what precautions to be taken while indexing data and strategies for querying the Assembler?
Don't think this is possible. Endeca used to support the Adgidx which allowed you to split or shard the mdex but that is no longer supported. Oracles justification for removing this is that with multithreading and multi-core processors it is no longer necessary. Apache Solr, however, supports sharing
The large set of data can be broken into smaller sets, where each set would be attributed to a property, say record.type, which would identify the different sets. So, basically we are normalizing the records in the Endeca index.
Now, while querying endeca, we can use the concept of record relationship navigation queries, using record-record relationships by applying a relationship filter, to bring back records of different types.
However, you might have to obtain a RRN license to enable the RRN feature in the mdex engine.

Doing analytical queries on large dynamic sets of data

I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time.
The requirements are as follows -
Large sets of data can experience state changes. Updates need to be fast.
I should be able to aggregate data pivoted on various attributes.
Ideally - there should be a way to correlate individual data units to an aggregated results i.e. I want to drill down into the specific transactions that produced a certain aggregation.
(I am aware of the race conditions here, like the state of a data unit changing after an aggregation is performed ; but this is expected).
All aggregations are time based - i.e. sum of x on pivot y over a day, 2 days, week, month etc.
I am evaluating different technologies to meet these use cases, and would like to hear your suggestions. I have taken a look at Hive/Pig which fit the analytics/aggregation use case. However, I am concerned about the large bursts of updates that can come into the system at any time. I am not sure how this performs on HDFS files when compared to an indexed database (sql or nosql).
You'll probably arrive at the optimal solution only by stress testing actual scenarios in your environment, but here are some suggestions. First, if write speed is a bottleneck, it might make sense to write the changing state to an append-only store, separate from the immutable data, then join the data again for queries. Append-only writing (e.g., like log files) will be faster than updating existing records, primarily because it minimizes disk seeks. This strategy can also help with the problem of data changing underneath you during queries. You can query against a "snapshot" in time. For example, HBase keeps several timestamped updates to a record. (The number is configurable.)
This is a special case of the persistence strategy called Multiversion Concurrency Control - MVCC. Based on your description, MVCC is probably the most important underlying strategy for you to perform queries for a moment in time and get consistent state information returned, even while updates are happening simultaneously.
Of course, doing joins over split data like this will slow down query performance. So, if query performance is more important, then consider writing whole records where the immutable data is repeated along with the changing state. That will consume more space, as a tradeoff.
You might consider looking at Flexviews. It supports creating incrementally refreshable materialized views for MySQL. A materialized view is like a snapshot of a query that is updated periodically with the data which has changed. You can use materialized views to summarize on multiple attributes in different summary tables and keep these views transactionally consistent with each other. You can find some slides describing the functionality on slideshare.net
There is also Shard-Query which can be used in combination with InnoDB and MySQL partitioning, as well as supporting spreading data over many machines. This will satisfy both high update rates and will provide query parallelism for fast aggregation.
Of course, you can combine the two together.

Does Oracle 11g automatically index fields frequently used for full table scans?

I have an app using an Oracle 11g database. I have a fairly large table (~50k rows) which I query thus:
SELECT omg, ponies FROM table WHERE x = 4
Field x was not indexed, I discovered. This query happens a lot, but the thing is that the performance wasn't too bad. Adding an index on x did make the queries approximately twice as fast, which is far less than I expected. On, say, MySQL, it would've made the query ten times faster, at the very least. (Edit: I did test this on MySQL, and there saw a huge difference.)
I'm suspecting Oracle adds some kind of automatic index when it detects that I query a non-indexed field often. Am I correct? I can find nothing even implying this in the docs.
As has already been indicated, Oracle11g does NOT dynamically build indexes based on prior experience. It is certainly possible and indeed happens often that adding an index under the right conditions will produce the order of magnitude improvement you note.
But as has also already been noted, 50K (seemingly short?) rows is nothing to Oracle. The Oracle database in fact has a great deal of intelligence that allows it to scan data without indexes most efficiently. Every new release of the Oracle RDBMS gets better at moving large amounts of data. I would suggest to you that the reason Oracle was so close to its "best" timing even without the index as compared to MySQL is that Oracle is just a more intelligent database under the covers.
However, the Oracle RDBMS does have many features that touch upon the subject area you have opened. For example:
10g introduced a feature called AUTOMATIC SQL TUNING which is exposed via a gui known as the SQL TUNING ADVISOR. This feature is intended to analyze queries on its own, in depth and includes the ability to do WHAT-IF analysis of alternative query plans. This includes simulation of indexes which do not actually exist. However, this would not explain any performance differences you have seen because the feature needs to be turned on and it does not actually build any indexes, it only makes recommendations for the DBA to make indexes, among other things.
11g includes AUTOMATIC STATISTICS GATHERING which when enabled will automatically collect statistics on database objects as it deems necessary based on activity on those objects.
Thus the Oracle RDBMS is doing what you have suggested, dynamically altering its environment on its own based on its experience with your workload over time in order to improve performance. Creating indexes on the fly is just not one of the things is does yet. As an aside, this has been hinted to by Oracle in private sevearl times so I figure it is in the works for some future release.
Does Oracle 11g automatically index fields frequently used for full table scans?
No.
In regards the MySQL issue, what storage engine you use can make a difference.
"MyISAM relies on the operating system for caching reads and writes to the data rows while InnoDB does this within the engine itself"
Oracle will cache the table/data rows, so it won't need to hit the disk. depending on the OS and hardware, there's a chance that MySQL MyISAM had to physically read the data off the disk each time.
~50K rows, depending greatly on how big each row is, could conceivably be stored in under 1000 blocks, which could be quickly read into the buffer cache by a full table scan (FTS) in under 50 multi-block reads.
Adding appropriate index(es) will allow queries on the table to scale smoothly as the data volume and/or access frequency goes up.
"Adding an index on x did make the
queries approximately twice as fast,
which is far less than I expected. On,
say, MySQL, it would've made the query
ten times faster, at the very least."
How many distinct values of X are there? Are they clustered in one part of the table or spread evenly throughout it?
Indexes are not some voodoo device: they must obey the laws of physics.
edit
"Duplicates could appear, but as it
is, there are none."
If that column has neither a unique constraint nor a unique index the optimizer will choose an execution path on the basis that there could be duplicate values in that column. This is the value of declaring the data model as accuratley as possible: the provision of metadata to the optimizer. Keeping the statistics up to date is also very useful in this regard.
You should have a look at the estimated execution plan for your query, before and after the index has been created. (Also, make sure that the statistics are up-to-date on your table.) That will tell you what exactly is happening and why performance is what it is.
50k rows is not that big of a table, so I wouldn't be surprised if the performance was decent even without the index. Thus adding the index to equation can't really bring much improvement to query execution speed.

Resources