olap4J - calculations on member grouping - mondrian

I'm trying to write an olap4j (Mondrian) query that will group the rows by ranges.
Assume we have counts of cards per child and the children ages.
i want to sum the cards amount by age ranges, so i will have counts for ages 0-5,5-10,10-15 and so on.
Is this can be done with olap4j?

You need to define calculated members for that:
With member [Age].[0-4] as [Age].[0]:[Age].[4]
member [Age].[5-9] as [Age].[5]:[Age].[9]
etc.
Alternatively, you may want to re-design your dimension table. I'm guessing you have age as a degenerate dimension in the fact table. I suggest creating a separate dimension dim_age with a structure like this:
age_id, age, age_group
0, null, null
1, 0, 0-4
2, 1, 0-4
(...)
Then it's easy to define a first level on the dimension based on the age_group.

Related

Sum attributes of relation tables after performing division to them

I couldn't come up with an appropriate title, excuse me for which.
The situation is the following:
I've got two tables: montages and orders, where Montage belongs to Order.
My goal is to build a single mysql query which to return a single float value to represent a sum of values for multiple montages. For each montage in the query I need to divide the budget of its order by the number of montages which belong to the same order. The result of this division should be an attribute of ecah montage. Finally, I want to sum those attributes and retrieve a single value.
I've tried a lot of variation of something like the following, but none seemed to be written in correct syntax, so I kept getting errors:
$sum = App\Montage::where(/*this doesn't matter*/)
->join('orders', 'montages.order_id', '=', 'orders.id') //join the orders table
->select('montages.*, orders.budget') //include the budget column
->selectRaw('count(* where order_id = order_id) as all') //count all the montages of the same order and assing that count to the current montage
->selectRaw('(orders.budget / all) as daily_payment') //divide the budget of the order by the count; store the result as `daily_payment`
->sum('daily_payment') //sum the daily payments
I'm really lost with the proper syntax and can't figure it out. I'd estimate that to be a rather trivial sql task for people who know their stuff, but unfortunately I don't seem to be one of them... Any help is greatly appreciated!

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

RethinkDB: Can I group by fields between dates efficiently?

I'd like to group by multiple fields, between two timestamps.
I tried something like:
r.table('my_table').between(r.time(2015, 1, 1, 'Z'), r.now(), {index: "timestamp"}).group("field_a", "field_b").count()
Which takes a lot of time since my table is pretty big. I started thinking about using index in the 'group' part of the query, then I remembered it's impossible to use more than one index in the same rql.
Can I achieve what I need efficiently?
You could create a compound index, and then efficiently compute the count for any of the groups without computing all of them:
r.table('my_table').indexCreate('compound', function(row) {
return [row('field_a'), row('field_b'), row('timestamp')];
})
r.table('my_table').between(
[field_a_val, field_b_val, r.time(2015, 1, 1, 'Z')],
[field_a_val, field_b_val, r.now]
).count()

Calculated Measure in Analysis Services

The AdventureWorksDW has the construct of the Financial Reporting Fact table. I have a similar fact table where the fact contains only the FK to the dimension tables and a value. The measure gets it's context from an DimAccount dimension. Are there any code samples that show how to do a simple ratio in a calculated member between two measures of the AdventureWorks Financial Reporting sample?
So basically I would like to see say Total Long term Debt / Total Assets from AdventureWorksDW? What I need is the expression or MDX.
Thanks in advance.
Use a query like this:
with member [Account].[Accounts].[Balance Sheet].[Dept by Assets] as
IIf([Account].[Accounts].[Assets] <> 0,
[Account].[Accounts].[Long Term Liabilities] / [Account].[Accounts].[Assets],
null
)
,format_string = "0.00%"
select {
[Account].[Accounts].[Assets],
[Account].[Accounts].[Long Term Liabilities],
[Account].[Accounts].[Dept by Assets]
}
on columns,
{ [Measures].[Amount] }
on rows
from [Adventure Works]
You can define members in any hierarchy, not only in the measures. In the definition, you should use the parent member before the name of the new member, to tell AS the position in the hierarchy. This is more important for CREATE MEMBER in the cube calculation script than for WITH MEMBER, as it influences the position where the client tool will display it.

Looping within the results of a Pig Group By

Let's say I have a game with player ids. Each id can have multiple character names (playerNames) and we have a score for each of those names. I would like to total all the scores per playerName, and calculate the percentage score per player name per id.
So, for instance:
id playerName playerScore
01 Test 45
01 Test2 15
02 Joe 100
would output
id {(playerName, playerScore, percentScore)}
01 {(Test, 45, .75), (Test2, 15, .25)}
02 {(Joe, 100, 1.0)}
Here's how I did it:
data = LOAD 'someData.data' AS (id:int, playerName:chararray, playerScore:int);
grouped = GROUP data BY id;
withSummedScore = FOREACH grouped GENERATE SUM(data.playerScore) AS summedPlayerScore, FLATTEN(data);
withPercentScore = FOREACH withSummedScore GENERATE data::id AS id, data::playerName AS playerName, (playerScore/summedPlayerScore) AS percentScore;
percentScoreIdroup = GROUP withPercentScore By id;
Currently, I do this with 2 GROUP BY statements, and I was curious if they were both necessary, or if there's a more efficient way to do this. Can I reduce this to a single GROUP BY? Or, is there a way I can iterate over the bag of tuples and add percentScore to all of them without flattening the data?
No, you can not do this without 2 GROUP, and the reason is more fundamental than just Pig:
To get the total number of points you need a linear pass through the player's scores.
Then, you need another linear pass over the player's scores to calculate the fraction. You can not do this before you know the sum.
Having said that, if the player's number of playerNames is small, I'd write a UDF that takes a bag of player scores and outputs a bag of score-per-playerName tuples, since each GROUP will generate a reducer and the process becomes ridiculously slow. A UDF that takes the bag would have to do those 2 linear passes as well, but if the bags are small enough, it won't matter and it'll certainly be an order of magnitude faster than creating another reducer.

Resources