I am having a large dimension and it is taking me more and more time to process it. I would like to decrease the processing time as much as possible
there is literally hundreds of different articles on how to process ssas objects as efficient and fast as possible.
There are lots of tips and tricks that one can apply to speed up dimensions and cube processing. I managed to apply all or at least a big majority of them and I am still not happy with the result,.
I have a large dimension built on top of a table.
It has around 60 mil records and it keeps on growing fast.
It either add new rows to it or delete the existing ones. there are no updates possible
I am looking for a solution that will allow me to perform an incremental processing of my dimension.
I know that the data in the previous month will not be changed. I would like to do smth similar to partitioning of my cube but on the dimension.
I am using SLQ SERVER 2012 and to my knowledge dimension partitioning is not supported.
I am currently using process update on my dimension - I tried processing using by attribute and by table but both render almost the same result. I have hierarchies and relationships - some set to rigid. I am only using those attributes that are truly needed etc etc etc
process update has to read all the records in a dimension even those that i know have not changed. is there a way to partition a dimension? if I could tell SSAS to only process the last 3-4 weeks of data in my dimension and not touch the rest - it would greatly speed up my processing time.
I would appreciate your help
ok so I did a bit of research and I can confirm that incremental dimension processing is not supported.
it is possible to do process add on a dimension but if you have records that got deleted or updated you cannot do that
it would be a useful thing to have but MS hasn't developed it and I don't think it will
incremental processing of any table is however possible in tabular cubes
so if you have a similar requirement and your cube is not too complex then creating a tabular cube is the way to go
Related
Due to being locked down by Corona, I don't have easy access to my more knowledgeable colleagues, so I'm hoping for a few possible recommendations here.
We do quarterly and yearly "freezes" of a number of statistical entities with a large number (1-200) of columns. Everyone then uses these "frozen" versions as a common basis for all statistical releases in Denmark. Currently, we simply create a new table for each version.
There's a demand to test if we can consolidate these several hundred tables to 26 entity-based tables to make programming against them easier, while not harming performance too much.
A "freeze" is approximately 1 million rows and consists of: Year + Period + Type + Version.
For example:
2018_21_P_V1 = Preliminary Data for 2018 first quarter version 1
2019_41_F_V2 = Final Data for 2019 yearly version 2
I am simply not very experienced in the world of partitions. My initial thought was to partition on Year + Period and Subpartiton on Type + Version, but I am no longer sure this is the right approach, nor do I have a clear picture of which partitioning type would solve the problem best.
I am hoping someone can recommend an approach as it would help me tremendously and save me a lot of time "brute force" testing a lot of different combinations.
Based on your current situation which you explained I highly recommend that "USE THE PARTITIONING". No doubt.
It's highly effective and easy to use. You can read Oracle documentation about partitioning or search on the web for that to understand how to start.
In general, when you partition a table, Oracle looks at each partition as a separate table so don't worry about the speed of fetching data.
The most important step is to choose the best field(s) to establish your partitions based on. I used the date format (20190506) in number or int data type for my daily basis. Or (201907) for a monthly basis. You should design and test it.
The next is to decide about the sub-partitions. In some cases, you don't really need one. It depends on your data structure and your expectations from the data. What do you want to do with the data? Which fields are more important? (used in where clause, ...)
Then make some index(es) for each partition. Very important.
Another important point is that using partitions may have some changes in the way you code in pl/sql. For example, you can not use 2 or more partitions in a single query at the same time. You should select and fetch data from different partitions one by one.
And don't worry about 1 million records. I used partitioning for tables way larger than this and it works fine.
Goodluck
What is the best practice for including Created By, Created Timestamp, Modified By, Modified Timestamp into a dimensional model?
The first two never change. The last two will change slowly for some data elements but rapidly for other data elements. However, I'd prefer a consistent approach so that reporting users become familiar with it.
Assume that I really only care about the most recent value; I don't need history.
Is it best to put them into a dimension knowing that, for highly-modified data, that dimension is going to change often? Or, is it better to put them into the fact table, treating the unchanging Created information much the same way a sales order number becomes a degenerate dimension?
In my answer I will assume that these ADDITIONAL Columns do NOT define the validity of the Dimensional record and that you are talking about a Slowly Changing Dimension type 1
So we are in fact talking about dimensional metadata here, about who / which process created or modified the dimensional row.
I would always put this kind of metadata in the dimension because it:
Is related to changes in the dimension. These changes happen independent of the fact table
In general it is advised to keep Fact tables as small as possible. If your Fact table would contain 5 Dimensions, this would lead to adding 5*4=20 extra columns to your fact table which will seriously bloath it and impact performance.
I need some suggestions on using d3.js for visualizing big data. I am pulling data from hbase and storing in a json file for visualizing using d3.js. When I pull the data of few hours the size of json file is around 100MB and can be easily visualized by d3.js but the filtering using dc.js and crossfilter is little slow. But when I pull the dataset of 1 week the json file size becomes more than 1GB and try to visualize using d3.js, dc.js and crossfilter then the visualization is not working properly and the filtering is also not possible. Can anyone give me any idea whether there is a good solution to this or I need to work on different platform instead of d3?
I definitely agree with what both Mark and Gordon have said before. But I must add what I have learnt in the past months as I scaled up a dc.js dashboard to deal with pretty big datasets.
One bottleneck is, as pointed out, the size of your datasets when it translates into thousands of SVG/DOM or Canvas elements. Canvas is lighter on the browser, but you still have a huge amount of elements in memory, each with their attributes, click events, etc.
The second bottleneck is the complexity of your data. The responsiveness of dc.js depends not only on d3.js, but also on crossfilter.js. If you inspect the Crossfilter example dashboard, you will see that the size of the data they use is quite impressive: over 230000 entries. However, the complexity of those data is rather low: just five variables per entry. Keeping your datasets simple helps scaling up a lot. Keep in mind that five variables per each entry here means about one million values in the browser's memory during visualization.
Final point, you mention that you pull the data in JSON format. While that is very handy in Javascript, parsing and validating big JSON files is quite demanding. Besides, it is not the most compact format. The Crossfilter example data are formatted as a really simple and tight CSV file.
In summary, you will have to find the sweet spot between size and complexity of your data. One million data values (size times complexity) is perfectly feasible. Increase that by one order of magnitude and your application might still be usable.
As #Mark says, canvas versus DOM rendering is one thing to consider. For sure the biggest expense in Web visualization is DOM elements.
However, to some extent crossfilter can mitigate this by aggregating the data into a smaller number of visual elements. It can get you up into the hundreds of thousands of rows of data. 1GB might be pushing it, but 100s of megabytes is possible.
But you do need to be aware of what level you are aggregating at. So, for example, if it's a week of time series data, probably bucketing by the hour is a reasonable visualization, for 7*24 = 168 points. You won't actually be able to perceive many more points, so it is pointless asking the browser to draw thousands of elements.
I'm looking for a way to quickly obtain an accurate average inventory for reporting purposes.
BACKGROUND:
I'm looking for a way to determine Gross Margin Return On Investment (GMROI) for inventory items where the inventory levels are not Constantin with time (ie some items maybe out of stock then over stocked, whilst others will be constant and never out of stock)
GMROI = GrossProfit/AverageInvenotry
say over 1 year
These need to be obtained on the fly, batch processing is not an option.
THE PROBLEM:
Given the relational database used only has current stock levels.
I can calculate back to a historic stock say:
HistoricStock=CurrentStock-Purchase+Sales
But I really want an average invertory not just a single point in time.
I could calculate back a series of points then average them but I'm worried about the calculation overhead (to a lesser extent the accuracy), given I want to do this on the fly.
I could create a data warehouse and bank the data but I'm concerned about blowing out the database size (ie StockHolding Per Barcode Per Location Per Day for say 2 years)
From memory the integral of the inventory/time graph divided by the time interval would give the average inventory but how do you integrate real world data without a formula Or lots of small time strips?
Any Ideas or References would be appreciate
Thanks B.
In general this seems like a good case for developing an inventory fact table, but exactly how you would implement it depends a lot on your data and source systems.
If you haven't already, I would get the Data Warehouse Toolkit; chapter 3 is about inventory data management. As you mentioned, you can create an inventory fact table and load a daily snapshot of inventory levels from the source system, then you can easily calculate whatever averages you need from the data warehouse, not from the source system.
You mentioned that you're concerned about the volume of data, although you didn't say how many rows per day you would add. But data warehouses can be designed to handle very large tables using table partitioning or similar techniques, and you could also calculate "running averages" after adding each day's data if the calculation takes a very long time for any reason.
Possible duplicate:
Database design: Calculating the Account Balance
I work with a web app which stores transaction data (e.g. like "amount x on date y", but more complicated) and provides calculation results based on details of all relevant transactions[1]. We are investing a lot of time into ensuring that these calculations perform efficiently, as they are an interactive part of the application: i.e. a user clicks a button and waits to see the result. We are confident, that for the current levels of data, we can optimise the database fetching and calculation to complete in an acceptable amount of time. However, I am concerned that the time taken will still grow linearly as the number of transactions grow[2]. I'd like to be able to say that we could handle an order of magnitude more transactions without excessive performance degradation.
I am looking for effective techniques, technologies, patterns or algorithms which can improve the scalability of calculations based on transaction data.
There are however, real and significant constraints for any suggestion:
We currently have to support two highly incompatible database implementations, MySQL and Oracle. Thus, for example, using database specific stored procedures have roughly twice the maintenance cost.
The actual transactions are more complex than the example transaction given, and the business logic involved in the calculation is complicated, and regularly changing. Thus having the calculations stored directly in SQL are not something we can easily maintain.
Any of the transactions previously saved can be modified at any time (e.g. the date of a transaction can be moved a year forward or back) and calculations are expected to be updated instantly. This has a knock-on effect for caching strategies.
Users can query across a large space, in several dimensions. To explain, consider being able to calculate a result as it would stand at any given date, for any particular transaction type, where transactions are filtered by several arbitrary conditions. This makes it difficult to pre-calculate the results a user would want to see.
One instance of our application is hosted on a client's corporate network, on their hardware. Thus we can't easily throw money at the problem in terms of CPUs and memory (even if those are actually the bottleneck).
I realise this is very open ended and general, however...
Are there any suggestions for achieving a scalable solution?
[1] Where 'relevant' can be: the date queried for; the type of transaction; the type of user; formula selection; etc.
[2] Admittedly, this is an improvement over the previous performance, where an ORM's n+1 problems saw time taken grow either exponentially, or at least a much steeper gradient.
I have worked against similar requirements, and have some suggestions. Alot of this depends on what is possible with your data. It is difficult to make every case imaginable quick, but you can optimize for the common cases and have enough hardware grunt available for the others.
Summarise
We create summaries on a daily, weekly and monthly basis. For us, most of the transactions happen in the current day. Old transactions can also change. We keep a batch and under this the individual transaction records. Each batch has a status to indicate if the transaction summary (in table batch_summary) can be used. If an old transaction in a summarised batch changes, as part of this transaction the batch is flagged to indicate that the summary is not to be trusted. A background job will re-calculate the summary later.
Our software then uses the summary when possible and falls back to the individual transactions where there is no summary.
We played around with Oracle's materialized views, but ended up rolling our own summary process.
Limit the Requirements
Your requirements sound very wide. There can be a temptation to put all the query fields on a web page and let the users choose any combination of fields and output results. This makes it very difficult to optimize. I would suggest digging deeper into what they actually need to do, or have done in the past. It may not make sense to query on very unselective dimensions.
In our application for certain queries it is to limit the date range to not more than 1 month. We have in aligned some features to the date-based summaries. e.g. you can get results for the whole of Jan 2011, but not 5-20 Jan 2011.
Provide User Interface Feedback for Slow Operations
On occasions we have found it difficult to optimize some things to be shorter than a few minutes. We shirt a job off to a background server rather than have a very slow loading web page. The user can fire off a request and go about their business while we get the answer.
I would suggest using Materialized Views. Materialized Views allow you to store a View as you would a table. Thus all of the complex queries you need to have done are pre-calculated before the user queries them.
The tricky part is of course updating the Materialized View when the tables it is based on change. There's a nice article about it here: Update materialized view when urderlying tables change.
Materialized Views are not (yet) available without plugins in MySQL and are horribly complicated to implement otherwise. However, since you have Oracle I would suggest checking out the link above for how to add a Materialized View in Oracle.