Has anyone else had issues (and solved them!) with very slow queries in Power BI when using the "Edit Query" (or Power M Query) side of PowerBI?
I am using multiple nested (using the Reference option - not Duplicate) queries (to get different aggregation levels). All based on a single table read from a large csv file.
I expected the data to be read once to the base table and then each derived table would extract data from the locally stored base table. But it seems to go back to the source data multiple times. This takes the run time to over 15 minutes.
Are there options to stop Power BI from going back to the source for each of these?
Answer: it turned out that the folder where the input csv was being held was extraordinarily slow. When we moved it to test alternatives, the speed problem went away.
I am not sure that you can avoid multiple rereads of an input Excel file in Power BI
Related
I'm using Oracle Report Builder 9.0.4.1.0 and I have a heavy report that has defined a large number of queries. I think not all that queries are used in the report and are not linked to any layout object.
Is there a easy way to detect what queries (or other objects) aren't used at all in a specific report? Instead of delete the query, compile and run and verify one by one if are used or not?
Thanks
If there is an easy way to do that, I don't know it. A long time ago, when Reports 1.x was used, report was saved in the database so you could write a query to fetch metadata you're interested in. I never did that, though, but - that would be an option. Now, all you have is a RDF (or a JSP) file.
However, a few suggestions, if I may.
Open Paper Layout Editor. Click the repeating frame and observe its property palette as it contains information about the group it belongs to. "Group" can be viewed in Data Model layout.
As there aren't that many repeating frames, you should be able to eliminate queries that don't have any frames, i.e. don't contribute to the final result.
Another option is to put a condition
WHERE 1 = 2
into every query so that they won't return any rows. Run the report and check what's missing - then remove that condition so that you'd get values. Move on to second query, and so forth. That's a little bit tedious and time consuming, but should still be faster than deleting queries.
You can return a report results to an XML file. Each query with data will contain something in XML-s tags.
enter image description here
I have some huge csv files and it takes me quite a long time to load them in Power BI. I assume that it's normal when it's the first time that I load them. But, here is the problem. Every time I alter the data in Query Editor and then close & apply my changes, Power BI will reload the whole files and take once again a long time. Isn't it possible that Power BI only "reload" or "reread" the altered data ? (I know the "Enable load" & "Include in report refresh" features but it doesn't help)
I don't know if I made myself clear, if not, let me know what you don't understand.
The main problem is here related to the performance of Power BI which always reload the whole file(s) when you alter it.
Thanks a lot.
There's no Power BI solution for this - ref the many popular Ideas on their Community site for "incremental load" etc.
My typical workarounds are:
Pre-load CSV data to SQL Server or similar. PBI development will be much quicker (e.g. effective test filters) and you can possibly pre-aggregate in SQL.
Pre-process Queries using Excel Power Query, saving the results as Excel Tables. You can copy and paste the Query definitions between PBI and Excel.
I inherited a report from a developer where he combined 5 reports into one SSRS report. It looks like he just copied and pasted each tablix from the original reports one below the other. This was done so that when the user exports to Excel they can have each report on a separate tab. I've never done a multiple SSRS report like this before so I'm just now analyzing how this whole thing works. A major problem I'm finding is that it runs extremely slow, about 10 minutes, seemingly because it has to run all 5 queries. Each stored procedure is listed separately as a data set. Does anyone know a better way to create multiple SSRS reports onto one page, or at least how to make this thing faster?
The first step to improving performance for an SSRS report is to determine what the bottleneck is. Run a query against the view named ExecutionLog4 in the ReportServer database. For each recent execution of a report, the view will give you a record that includes 3 critical fields: TimeDataRetrieval, TimeProcessing, and TimeRendering.
TimeDataRetrieval indicates how long (in milliseconds) it takes for all of the queries to run and return your datasets. If this number is high, then you will need to tune your queries or eliminate some of them to improve performance. You can run a profiler trace to identify which of the procedures is running slowly.
Keep in mind also that subreports fire their dataset queries each time they are rendered in the report. So even a minor performance hiccup in a subreports dataset gets magnified by the number of executions.
TimeProcessing indicates how much time the report server spends manipulating the retrieved data. If this number is high, you may want to consider performing aggregate calculations that are being run many times within a report to run on the SQL side.
TimeRendering indicates how long the server takes to actually render the report. If this number is high, consider avoiding or simplifying expressions used on visual properties that repeat over and over again. This scenario is less common than the other two, in my experience.
Furthermore, here are some tips I've picked up that help to avoid performance issues:
-Avoid using row visibility expressions if you expect a large number of rows to be returned.
-Hiding an object does not prevent dataset execution. If your datasets have similar structure, consider combining them and using object filters to limit what is displayed in different sections. Or use an IF statement in your stored procedure if you only intend to display one of several choices depending on data or parameters.
-Try to limit the number of column groupings in a large tablix. For each grouping in a tablix, you multiply the number of rows of data that may be returned to pivot into those groupings.
More info on SSRS performance can be found at
https://technet.microsoft.com/en-us/library/bb522806(v=sql.105).aspx
This was written for 2008R2, but seems mostly applicable to 2012 as well.
Give all that a shot, then post back here with a more specific question if you get stuck.
I have a (growing) table of data with 40.000 rows and 20 columns.
I need to group these data (by month and week) and to perform some simple operations (+ & /) between rows/columns.
I must be able to change the period in question and some specific rows to sum up. I know how to macro/pivot/formula, but I didn't started yet, and I would like the recalculation process to be the fastest possible, not that I click a button and then everything freezes for minutes.
Do you have any idea on what could be the most efficient solution?
Thank you
Excel have it's limits to store and analyze data at the same time.
If you're planning to build a growing database at MS Excel, at some point you will add so much data that the Excel files will not work. (or using them won't be time effective)
Before you get to that point you should be looking for alternate storage options as a scalable data solution.
They can be simple, like an Access DB, sqlite, PostgreSQL, Maria DB, or even PowerPivot (though this can have it's own issues).
Or more complex, like storing the data into a database, then adding an analysis cube and pulling smaller slices of data from these databases, into Excel for analysis and reporting.
Regardless of what you end up doing you will have to change how Excel interacts with the data.
You need to move all of the raw data to another system (Access or SQL are the easiest, but Excel supports a lot of other DB options) and pull smaller chunks of data back into Excel for time effective analysis.
Useful Links:
SQL Databases vs Excel
Using Access or Excel to manage your data
I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.