Qlikview- Performance improvement 35M+ records - performance

I am stuck on a really clumsy issue of performance and latency in a Qlikview Dashboard with 35M records.
I wanted to know the ways to actually improve my model/ front end expression to improve speed.
To get a perspective.
I have weather data for each day for 8 years with respective States of 20+ countries.
Tables-
State_code
Area polygon(from KML file)
Date
Stat (Min/Max/Avg)
Weather_field- Temperature/Wind Direction/ Pressure
Value (numeric)
So user has the ability to select any date range in between the 8 years. Filters from Stat, Country, Weather_field etc.
I used a nested if loop to get all the combinations possible like
Stats=max, weather_field= temperature, date range( start and end date) calculate max(value)
How to prove it. Please help

Related

PowerBI - Displaying the average of row figures in a matrix

I've been Googling around this problem for hours and haven't found a solution that suits my needs.
I have a large data set with agent activities and the total time in seconds each activity lasts. I'm pulling this together in a matrix, to display agent names on the left and the start date of each week across the top like so:
This is working as intended (I've used a measure to convert the seconds into hours) but I need the average of the displayed weeks as another column, or to replace the Total column.
I've tried solutions involving DAX measures but none are applicable, likely because I'm using a custom column (WeekStart) to roll up my numbers into weeks. Adding more complexity is I have 2 filters on the matrix; one to exclude any weeks older that 5 weeks in the past and another to exclude any future weeks.
In Excel I'd just add another column next to the table, averaging the 5 cells to the left of it. I could add it to the data table with a SUMIFS checking the Activity date is within the week range and dividing the result by 5. I can't do either of these in PowerBI and I'm new to the software so I'm at a loss as to how to do this.

DAX formula, crossfilter function nor returning expected result

I'm obtaining wrong results from a DAX formula and I can't understand why.
In my database I have articles that are composed by multiple tools, which are produced from blank tools. One blank can be used to produce multiple tools. I need to calculate blank sales by 3 time periods: last 6, last 12 and last 24 months.
This is my Power BI model:
The time period table I used for the time period slicer and the measure look like this :
To obtain Blank's sales volumes, I created 3 measures:
When I use the last formula, which I thought would have returned the right amount of Blank sold by article by time period, I obtain strange results.
When I select "last 24 months" time period, everything looks fine:
When I select "Last 12 months", the total is fine, but the total by article is wrong:
Finally, if I select "Last 6 months" time period, all the results are totally wrong:
The curious fact is that I checked the result by executing a sql query on the database, and the DAX formula returns the right result (so 1466 for the selected time period), but only when used in a card, without filtering it by Article number.
I have no other filters that affect the visuals.
Could you help me understand why I'm not obtaining the right result, or suggest a better way to reach the desired results?
I'm guessing (at least part of) the problem is that you are backing up from different end dates because LASTDATE(Sales[DocumentDate]) can return different values for different ArticleNo.
I'm not sure what value you actually want for that date, possibly LASTDATE('Dates Table'[Date]), but I'm pretty sure you want it consistent across different ArticleNo.

Functions to calculate max windspeed, average winspeed and median

I am totally new to data structure and algorithms. As much as i am learning and trying to pick up the various functions i was asked to display 5 sets of data structure based on a sample data in csv.
Each file contains a year’s worth of data for multiple sensors. Data for each date-time recording are on separate rows. Within each row, the value for each sensor is separated by a comma. There are a total of 105,120 rows per year/file. Currently the client has 10 years of data which is a million records.
I am supposed to find out:
The maximum wind speed of a specified month and year.
The median wind speed of a specified year.
Average wind speed for each month of a specified year. Display the data in the order of month (Jan, Feb, Mar, ...)
Total solar radiation for each month of a specified year. Display the data in a descending order of the solar radiation (i.e. month with the highest total solar radiation will display first).
Given a date, show the times for the highest solar radiation for that date. There can be one or more time values with the same highest solar radiation. Display the list of times in descending order (e.g. 24:00, 23:00, 22:00, etc.)
As i am new to Data structure. I have been cracking hard on the type of algorithms to propose on the above.
I am thinking if i can use:
BST Binary Search Tree to solve Qns 1
Linear for Qns 2
Constant to sort and linear to find the average for Qns 3
Both linear for Qns 4 and 5.
Anyone have a better suggestion or sample pseudo code to share on this. Or how should i start.
Regards, Heaptie

Multidimensional analysis in Hive/Impala

I have a denormalized table say Sales that looks like:
SalesKey,
SalesOfParts, SalesOfEquipments, CostOfSales as some numeric measures
Industry, Country, State, Sales area, Equipment id, customer id, year of sale, month of sale and some more similar dimensions. (Total of 12 dimensions)
I need to support aggregation queries on the Sales, like total number of sales in a year, month... total cost of them etc.
Also these aggregates need to be filtered, i.e. something like total sales in year 2013, 04 belonging to Manufacturing industry of XYZ customer.
I have these dimension tables and facts in hive/impala.
I do not think I can make a cube on all the dimensions. I read a paper to see how to do OLAP over multiple dimensions :
http://www.vldb.org/conf/2004/RS14P1.PDF
Which basically suggests to materialize cubes over small fragments and do some kind of runtime computation when query spans multiple cubes.
I am not sure how to implement this model in Hive/Impala. Any pointers/suggestions will be awesome.
EDIT: I have about 10 million rows in the Sales table, and the dimensions are not comparable to 100, but are around 12 ( might go upto 15) but have a good cardinality each.
I would build cubes using a 3rd-party software. For example, icCube is an in-memory OLAP server that can handle with no issue at all 10mio of rows over 12 dimensions. Then the response time would be sub-second in all dimensions. Moving out from Hive 10mio of rows does not seem to be an issue (you could use the JDBC driver for that purpose). icCube is specifically designed to handle properly high sparsity.

Best approach: transfer daily values from one year to another

I will try to explain what I want to accomplish. I am looking for an algorithm or approach, not the actual implementation in my specific system.
I have a table with actuals (incoming customer requests) on a daily basis. These actuals need to be "copied" into the next year, where they will be used as a basis for planning the amount of requests in the future.
The smallest timespan for planning, on a technical basis, is a "period", which consists of at least one day. A period always changes after a week or after a month. This means, that if a week is both in May and June, it will be split in two periods.
Here's an example:
2010-05-24 - 2010-05-30 Week 21 | Period_Id 123
2010-05-31 - 2010-05-31 Week 22 | Period_Id 124
2010-06-01 - 2010-06-06 Week 22 | Period_Id 125
We did this to reduce the amount of data, because we have a few thousand items that have 356 daily values. For planning, this is reduced to "a few thousand x 65" (or whatever the period count is per year). I can aggregate a month, or a week, by combining all periods that belong to one month. The important thing about this is, I could still use daily values, then find the corresponding period and add it there if necessary.
What I need, is an approach on aggregating the actuals for every (working)day, week or month in next years equivalent period. My requirements are not fixed here. The actuals have a certain distribution, because there are certain deadlines and habits that are reflected in the data. I would like to be able to preserve this as far as possible, but planning is never completely accurate, so I can make a compromise here.
Don't know if this is what you're looking for, but this is a strategy for calculating the forecasts using flexible periods:
First define a mapping for each day in next year to the corresponding day in this year. Then when you need a forecast for period x you take all days in that period and sum the actuals for the matching days.
With this you can precalculate every week/month but create new forecasts if the contents of periods change.
Map weeks to weeks. The first full week of this year to the first full week of the next. Don't worry about "periods" and aggregation; they are irrelevant.
Where a missing holiday leaves a hole in the data, just take the values for the same day of the previous week or the next week, and do the same at the beginning/end of the year.
Now for each day of the week, combine the results for the year and look for events more than, say, two standard deviations from the mean (if you don't know what that means then skip this step), and look for correlations with known events like holidays. If a holiday doesn't show an effect in this test then ignore it. If you find an effect, shift it to compensate for the different date next year. Don't worry about higher-order effects, you don't have enough data to pin them down.
Now draw in periods wherever you like and aggregate all you want.
Don't make any promises about the accuracy of these predictions, there's no way to know it. Don't worry about whether this is the best possible way; it isn't, but it's as good as any you're likely to find. You can spend as much more time and effort fine-tuning this as you wish; it might raise expectations but it's not likely to make the results much more accurate-- it's about as likely to make them worse.
There is no A-priori way to answer that question. You have to look at your data, and decide what the important parameters (day of week, week number, month, season, temperature outside?) using the results.
For example, if many of your customers are jewish/muslim, then the gregorian calendar, and ISO-week numbers and all that won't help you much, because jewish/muslim holidays (and so users behaviour) are determined using other calendars.
Another example - Trying to predict iPhone search volume according to last year's search doesn't sound like a good idea. It seems that the important timescales are much longer than a year (the technology becoming mainstream over the years) and much shorter than a year (Specific events that affect us for days-weeks).

Resources