I have transactional data which contains customer information as well as stores they shopped from. I can count the number of different stores each customer used by a simple DISTINCTCOUNT([Site Name]) measure.
There are millions of customers and I want to make a simple summary table which shows the sum of # customers who visited X number of stores. Like a histogram. Maximum stores they visited is 6, minimum is 1.
I know there are multiple ways to do this but I am new to DAX and can't do what I think yet.
The easiest way:
Assuming your DISTINCTCOUNT([Site Name]) measure is called CustomerStoreCount ...
Add a new dimension table, StoreCount, to your model containing a single column, StoreCount. Populate it with the values 1,2,3,4,5,6 (... up to maximum number of stores.)
Create a measure, ThisStoreCount = MAX(StoreCount[StoreCount]).
Create a base customer count measure, TotalCustomers:=DISTINCTCOUNT(CustomerTable[Customer])
Create a contextual measure, CustomersWhoVisitedXNumberOfStores := CALCULATE ( TotalCustomers, FILTER(VALUES(CustomerTable[Customer]), ThisStoreCount = CustomerStoreCount) )
On your pivot table / reporting tool, etc. use StoreCount[StoreCount] on the axes and CustomersWhOVisitedXNumberOfStores as the measure.
So basically walk through the customer list (since there's no relationship between StoreCount and CustomerTable), compare that customer's CustomerStoreCount with the maximum StoreCount[StoreCount] value, which for each StoreCount[StoreCount] value is ... drum roll itself. If it matches, keep it, otherwise filter it out; you end up with a count of customers whose store visits equals the value of StoreCount[StoreCount].
And of course the more general modeling hint: when you want to display a metric by something (i.e. customer count by number of stores visited), that something is an attribute, not a metric.
Related
I'm trying to create a calculated column based on a derived measure in SSAS cube, this measure which will count the number of cases per order so for one order if it has 3 cases it will have the value 3.
Now I'm trying to create a bucket attribute which says 1caseOrder,2caseOrder,3caseOrder,3+caseOrder. I tried the below one
IF([nrofcase] = 1, "nrofcase[1]", IF([nrofcase] = 2, "nrofcase[2]",
IF([nrofcase] = 3, "nrofcase[3]", "nrofcase[>3]") )
But it doesn't work as expected, when the level of the report is changed from qtr to week it was suppose to recalculate on different level.
Please let me know if it case work.
Calculated columns are static. When the column is added and when the table is processed, the value is calculated and stored. The only way for the value to change is to reprocess the model. If the formula refers to a DAX measure, it will use the measure without any of the context from the report (eg. no row filters or slicers, etc.).
Think of it this way:
Calculated column is a fact about a row that doesn't change. It is known just by looking at a single row. An example of this is Cost = [Quantity] * [Unit Price]. Cost never changes and is known by looking at the Quantity and Unit Price columns. It doesn't matter what filters or context are in the report. Cost doesn't change.
A measure is a fact about a table. You have to look at multiple rows to calculate its value. An example is Total Cost = SUM(Sales[Cost]). You want this value to change depending on the context of time, region, product, etc., so it's value is not stored but calculated dynamically in the report.
It sounds like for your data, there are multiple rows that tell you the number of cases per order, so this is a measure. Use a measure instead of a calculated column.
I have created a matrix report that has a Row Group of Person and a column group of Subject with Child groups of SubmittedBy and Grading Name. The data that is being pulled through is a mixture of numerical and alphabetic data. I am needing to do a count per row of how many of a certain criteria appear. For example, I wish to count by person how many 1's they have achieved, or how many A* they have achieved (this is all grading data from a school). Is there a way to do this with matrix data?
If there is a way, can you then get a count using certain criteria, for example, if grading Name = focus count how many WB grades there are per person (row)?
This is the Design
This shows a subsection of data (as it is a very large dataset), but with SchoolID, Forename and Surname cut off to preserve anonymity.
A simple way would be to add Total to the Subject row group and use the count() function:
[
Tableau:
This may seem simple, but I ran out of the usual tricks I've used in other systems.
I want a variance column. Essentially adding a member 'Variance' to the Act/Plan dimension which only contains the members 'Actual' and 'Plan'
I've come in where the data structure and reporting is set up like so:
Actual | Plan
Profit measure
measure 2
measure 3
etc
The goal is to have a Variance column (calculated and not part of the Actual/Plan dimension)
Actual | Plan | Variance
Profit measure
measure 2
measure 3
etc
There are solutions where it works for one measure only, and I've looked into that.
ie, create calculated field as such
Profit_Actual | Profit_Plan | Variance
You put this on the columns, and you get a grid that I want... except a grid with only 1 measure.
This does not work if I want to run several measures on rows. Essentially the solution above will only display the Profit measure, not Measure 1_Actual , Measure 2_Plan etc.
So I tried a trick where I grouped a the 3 calculated measures, ie Profit_Actual | Profit_Plan | Profit_Variance as 'Profit_Measure'
Created a parameter list - 'Actual', 'Plan', 'Variance'
Now I can half achieve my goal, by having the parameter on columns and the 'Profit Measure' on Rows (so I can have Measure 123_group etc down on rows too). Trouble is, I found that parameters are single select only. Only if it can display all options in the custom paramater at once, I would've solved my problem.
Any ideas on how I can achieve the Variance column I want?
Virtually adding a member to a dimension/Calculated fieds/tricks/workaround
Thank you
Any leads is appreciated
Gemmo
Okay. First thing, I had a really hard time trying to understand how your data is organized, try to be more clear (say how each entry in your database looks like, and not how a specific view in Tableau looks like).
But I think I got it. I guess you have a collection of entries, and each entry has a number of measure fields (profits and etc.) and an Act/Plan field, to identify whether that entry is an actual value or a planned value. Is that correct?
Well, if that's the case, I'm sorry to say you have to calculate a variance field for each dimension. Think about it, how your original dataset is structured. Do you think you can add a single field "Variance" to represent the variance of each measure? Well, you can, store the values in a string, and then collect it back using some string functions, but it's not very practical. The problem is that each entry have many measures, if it had only 1 measure, than 1 single variance field would suffice.
So, if you can re-organize your data, what would be an easier to work set (but with many more entries) is something with the fields: Measure, Value, Actual/Plan. The measure field would have a string to identify what you're measuring in that entry. Value would be a number to represent the actual measure. And the Actual/Plan is the same. For instance:
Measure Value Actual/Plan
Profit 100 Actual
So, each line in your current model would become n entries, where n is the number of measures you have right now. So a larger dataset in a way, but easier to work with. Think about, now you can have a calculated field, and use some table calculations to calculate the variance only for that measure and/or Actual/Plan. Just use WINDOW_VAR, and put Measure and/or Actual/Plan in the partition.
Table calculations are awesome, take a look at this to understand it better. http://onlinehelp.tableausoftware.com/current/pro/online/en-us/help.htm#calculations_tablecalculations_understanding_addressing.html
I generally like to have my data staged such that Actual is its own column and Plan is its own column in the data being fed to Tableau. It makes calculations so much easier.
If your data is such that there is a column called "Actual/Plan" and every row is populated with either "Actual" or "Plan" and there is another column called "Value" or "Measure" that is populated with the values, you can force Tableau to make them columns assuming you can't or won't rearrange your data.
Create a calculated field called "Actual" with the following calc:
IF [Actual/Plan] = 'Actual' THEN [Value] END
Similarly, create a calculated field called "Plan" with the following calc:
IF [Actual/Plan] = 'Plan' THEN [Value] END
Now, you can finally create your "Variance" and "Variance %" calculations (respectively):
SUM([Actual]) - SUM([Plan])
[Variance] / SUM([Plan])
We have a scenario like this:
Millions of records (Record 1, Record 2, Record 3...)
Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
Membership gradually changes over time, i.e. a record may be reassigned to another group.
We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:
To which group did Record 15544 belong, three years ago? (Call this Group g).
What records belonged to Group g, three years ago?
Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:
Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
Find all records that have at any time belonged to Group g.
For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.
This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?
ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.
How about creating two tables:
(recordID, time-> groupID) - key is recordID, time - sorted by
recordID, and secondary by time (Let that be map1)
(groupID, time-> List) - key is groupID, time - sorted by
recordID, and secondary by time (Let that be map2)
At each record change:
Retrieve the current groupID of the record you are changing
set t <- current time
create a new entry to map1 for old group: (oldGroupID,t,list') - where list' is the same list, but without the entry you just moved out from there.
Add a new entry to map1 for new group: (newGroupId,t,list'') - where list'' is the old list for the new group, with the changed record added to it.
Add a new entry (recordId,t,newGroupId) to map1
During query:
You need to find the entry in map2 that is 'closest' and smaller than
(recordId,desired_time) - this is classic O(logN) operation in
sorted data structure.
This will give you the group g the element belonged to at the desired time.
Now, look in map1 similarly for the entry with key closest but smaller than (g,desired_time). The value is the list of all records that are at the group at the desired time.
This requires quite a bit of more space (at constant factor though...), but every operation is O(logN) - where N is the number of record changes.
An efficient sorted DS for entries that are mostly stored on disk is a B+ tree, which is also implemented by many relational DS implementations.
This is applicable to Google App Engine, but not necessarily constrained for it.
On Google App Engine, the database isn't relational, so no aggregate functions (such as sum, average etc) can be implemented. Each row is independent of each other. To calculate sum and average, the app simply has to amortize its calculation by recalculating for each individual new write to the database so that it's always up to date.
How would one go about calculating percentile and frequency distribution (i.e. density)? I'd like to make a graph of the density of a field of values, and this set of values is probably on the order of millions. It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
Is there some algorithm to calculate or approximate density/frequency/percentile distribution that can be calculated over a period of time?
By the way, the data is indeterminate in that the maximum and minimum may be all over the place. So the distribution would have to take approximately 95% of the data and only do a density based on that.
Getting the whole row (with that limit of 1000 at a time...) over and over again in order to get a single number per row is sure unappealing. So denormalize the data by recording that single number in a separate entity that holds a list of numbers (to a limit of I believe 1 MB per query, so with 4-byte numbers no more than 250,000 numbers per list).
So when adding a number also fetch the latest "added data values list" entity, if full make a new one instead, append the new number, save it. Probably no need to be transactional if a tiny error in the statistics is no killer, as you appear to imply.
If the data for an item can be changed have separate entities of the same kind recording the "deleted" data values; to change one item's value from 23 to 45, add 23 to the latest "deleted values" list, and 45 to the latest "added values" one -- this covers item deletion as well.
It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
This is the most obvious approach to me, why are you are you trying to avoid it?