Handling multiple grains within a star schema - etl

I'm trying to model a business process that is inherently measured at multiple grains. Usually, this would necessitate one fact table per grain. Because this a single business process and only one of the dimensions is at a mixed grain (for some records) I'm not sure a separate fact table makes the most sense.
The process itself is based on measuring a Research Application. Each application can have applicants, funders, collaborators, and so forth. Additionally, each application can be managed by an organisation. For the M:N relationships I'm using bridge tables and weighting factors. The problem lies with the organisation dimension, which models a slightly ragged hierarchy as fixed depth attributes.
dim_organisation
id, organisation, faculty, school, division, unit
Each fact record has the same dimensionality with the exception of this dimension. Sometimes the application is managed by a faculty (level 2 in the hierarchy), and sometimes by a school (level 3 in the hierarchy). Furthermore, the fact record itself will only contain the business key for one of those levels e.g. school_code or faculty_code.
Here's how I believe the problem can and should be solved but I'd like some validation of this approach and / or some better proposals if necessary:
The initial dim_organisation table is populated via an external, master data source. The data is always balanced, i.e., there's no missing levels in the data, but it's ragged, so that some entries end at school, whereas others go right down to the unit level:
id, organisation, faculty, school, division, unit
1, org A, faculty A, school A, NULL, NULL
2. org A, faculty B, school B, division B, unit B
3. org A, faculty C, NULL, NULL, NULL
Because these records are at different grains I've copied down the last non-NULL level to complete the hierarchy:
id, organisation, faculty, school, division, unit
1, org A, faculty A, school A, school A, school A
2. org A, faculty B, school B, division B, unit B
3. org A, faculty C, faculty C, faculty C, faculty C
This ensures that every record in the org_dimension is at the same grain and is a standard approach for handling slightly ragged hierarchies. In addition, each of these levels has their own code e.g. L456 for a level 4 division or L521 for a level 5 unit. These are the business keys obtained from the source system.
Therefore, I can only refer to a single record in the dimension by combining all of the level codes accordingly. At the moment I'm creating a hash key on these level codes and storing the value in a lookup column on the dimension.
Assuming this approach is correct, I then have fact records coming in as follows:
application_id, organisation_id, applicant_id, ...
1, L456, 99
2, L321, 50
3, L549, 20
As you can see, the application fact is linked to my organisation dimension at different grains e.g. Level 4, Level 3, Level 5 and so forth. Because of the changes I've made to the dimension, I believe I now need to do the following:
1. Lookup the level code from dim_organisation.
2. Return the parent levels.
3. Copy down the level value associated with the fact to level 5.
4. Hash the keys and lookup the corresponding dimensional record.
For example:
1. Lookup L456 to return Division e.g. "Research and Engineering".
2. Return parents: "UoM" -> "Faculty of R&D" -> "School of Engineering".
3. Copy levels: L1 -> L2 -> L3 -> "Research and Engineering" (L4) -> "Research Engineering" (L5).
4. Now we have all the levels (parents + cascaded) to give us a unique record to look up in dim_organisation.
I'd like to know if this approach makes sense or if there is a better and more intuitive way of doing this? It's slightly messy because of the source data that I'm dealing with but that's the data reality I have to work with.

You've done good with your dimension by pushing the ragged hierarchy down to the lowest grain. Now have your fact record reference the unique row indicator for the dimension.
1, org A, faculty A, school A, school A, school A
2. org A, faculty B, school B, division B, unit B
3. org A, faculty C, faculty C, faculty C, faculty C
If the fact event is related to school A, the fact would store row id #1.
The only caveat to this approach is that the real level of the dim should be identifiable by the content. In other words, if School A is West Side High School and faculty C is Mr West, you wouldn't want them both described as "WEST". If the content of each level is fully descriptive then this model will work just fine.
I have used this exact same approach to model an organizational hierarchy containing up to 10 levels of reporting.

Related

Database Relational Algebra: How to find actors who have played in ALL movies produced by "Universal Studios"?

Given the following relational schemas, where the primary keys are in bold:
movie(movieName, whenMade);
actor(actorName, age);
studio(studioName, location, movieName);
actsIn(actorName, movieName);
How do you find the list of actors who have played in EVERY movie produced by "Universal Studios"?
My attempt:
π actorName ∩ (σ studioName=“Universal Studios” studio) |><| actsIn, where |><| is the natural join
Are you supposed to use cartesian product and/or division? :\
Here are the two steps that you should follow:
Write an expression to find the names of movies produced by “Universal Studio” (the result is a relation with a single attribute)
Divide the relation actsIn by the result of the relation obtained at the first step.
This should give you the expected result (i.e. a relation with the actor names that have played in every movie of the “Universal Studio”).

Design of Bayesian networks: Understanding the difference between "States" and "Nodes"

I'm designing a small Bayesian Network using the program "Hugin Lite".
The problem is that I have difficulty understanding the difference between "Nodes"(visual circles) and "States"(witch are the "fields" of a node).
I will write an example where it is clear,and another which I can't understand.
The example I understand:
There are two women (W1 and W2) and one men (M).
M get a child with W1. Child's name is: C1
Then M get a child with W2. Child's name is: C2
The resulting network is:
The four possibles STATES of every Node (W1,W2,M,C1,C2) are:
AA: the person has two genes "A"
Aa/aA: the person has one gene "A" and one gene "a"
aa: the person has two genes "a"
Now the example that I can't understand:
The data given:
Total(authorized or not) of payments while a person is in a foreign country (travelling): 5% (of course the 95% of transactions are transactions made in the home country)
NOT AUTHORIZED payments while TRAVELLING: 1%
NOT AUTHORIZED payments while in HOME COUNTRY: 0,2%
NOT AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 10%
AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 1%
TOTAL (authorized of not authorized) payments while TRAVELLING and to a FOREIGN country: 90%
What I've drawn is the following.
But I'm not sure if it's correct or not. What do you think? Then I'm supposed to fulfill a "probability table" for each node. But what should I write?
Probability table:
Any hint about the network correctness and how to fullfill the table is really appreciated.
Nodes are Random Variables (RV), that is is "things" that can have different states thus with certain levels of uncertainty therefore you assign probabilities to those states. So for example if you talk of RV of Person it could have different states such as [Man or Woman] with their corresponding probabilities, if you want to relate it to another RV Credit Worthiness [Good,Bad] then you can "marry" Person and Credit Worthiness to have a combination of both RV and the combination of states.
This is homework so I don't want to just tell you the answer. Instead, I'll make an observation, and ask a few questions. The observation is that you want your arrows goig from cause to effect.
So. Is the payment authorization status a/the cause of the location? Or is the location a/the cause of the payment authorization?
Also, do you really need four variables for each of travelling, home, foreign, and local? Or might some smaller number of variables suffice?

Database design for bus reservation

I'm developing a reservation module for buses and I have trouble designing the right database structure for it.
Let's take following case:
Buses go from A to D with stopovers at B and C. A Passenger can reserve ticket for any route, ie. from A to B, C to D, A to D, etc.
So each route can have many "subroutes", and bigger contain smaller ones.
I want to design a table structure for routes and stops in a way that would help easily search for free seats. So if someone reserves seat from A to B, then seats from B to C or D would be still be available.
All ideas would be appreciated.
I'd probably go with a "brute force" structure similar to this basic idea:
(There are many more fields that should exist in the real model. This is only a simplified version containing the bare essentials necessary to establish relationships between tables.)
The ticket "covers" stops through TICKET_STOP table, For example, if a ticket covers 3 stops, then TICKET_STOP will contain 3 rows related to that ticket. If there are 2 other stops not covered by that ticket, then there will be no related rows there, but there is nothing preventing a different ticket from covering these stops.
Liberal usage or natural keys / identifying relationships ensures two tickets cannot cover the same seat/stop combination. Look at how LINE.LINE_ID "migrates" alongside both edges of the diamond-shaped dependency, only to be merged at its bottom, in the TICKET_STOP table.
This model, by itself, won't protect you from anomalies such as a single ticket "skipping" some stops - you'll have to enforce some rules through the application logic. But, it should allow for a fairly simple and fast determination of which seats are free for which parts of the trip, something like this:
SELECT *
FROM
STOP CROSS JOIN SEAT
WHERE
STOP.LINE_ID = :line_id
AND SEAT.BUS_NO = :bus_no
AND NOT EXIST (
SELECT *
FROM TICKET_STOP
WHERE
TICKET_STOP.LINE_ID = :line_id
AND TICKET_STOP.BUS_ID = :bus_no
AND TICKET_STOP.TRIP_NO = :trip_no
AND TICKET_STOP.SEAT_NO = SEAT.SEAT_NO
AND TICKET_STOP.STOP_NO = STOP.STOP_NO
)
(Replace the parameter prefix : with what is appropriate for your DBMS.)
This query essentially generates all combinations of stops and seats for given line and bus, then discards those that are already "covered" by some ticket on the given trip. Those combinations that remain "uncovered" are free for that trip.
You can easily add: STOP.STOP_NO IN ( ... ) or SEAT.SEAT_NO IN ( ... ) to the WHERE clause to restrict the search on specific stops or seats.
From the perspective of bus company:
Usually one route is considered as series of sections, like A to B, B to C, C to D, etc. The fill is calculated on each of those sections separately. So if the bus leaves from A full, and people leave at C, then user can buy ticket at C.
We calculate it this way, that each route has ID, and each section belongs to this route ID. Then if user buys ticket for more than one section, then each section is marked. Then for the next passenger system checks if all sections along the way are available.

Algorithm for multicriterial arrangement

Let me describe problem in a form of a small fiction story.
The story
In a Brave New World new cities are built in a couple of days and only need to be populated. Moreover, there's no more long boring hiring process, no interviews and subjective decisions - every person passes several tests and their results are used to find best employees.
When new city is built, number of companies place their offices there and ask Super Mind to find best employees for them given a way to calculate person's score for their particular company. People on their side ask Super Mind to find work for them. They give him list of companies where they would like to work together with corresponding priorities. Super Mind is very humanistic, so its task is to find such arrangement that people get to the best companies they want, even if some companies will left without employees at all.
Formal definition
Now let me define the task more formally.
E - number of employees seeking for a job.
C - number of companies.
S(e,c) - score of employee e for company c.
Pr(e,c) - priority of company c in a personal "wishlist" of employee e.
P(c) - # of positions available in company c.
Task: obtain list of (e, c) tuples given following conditions:
employees with higher S(e,c) should always go first (e.g. if there's only one position left in company c and there are 2 candidates for it, it should be guaranteed that employee with higher score gets to this position).
employees should get to the company with highest priority available for them.
My algorithm
The only algorithm I can think of that guarantees all conditions is as follows. First I create list of all possible applications from employees to companies (A(e,c,s,p)), where s is a score of employee e for company c and p is company priority for this employee. Then I sort all applications by total score and run next recursive procedure:
def arrange(As, Ps, not_approved, approved):
# As - list of applications left
# Ps - map of type (company -> # of positions left)
# not_approved - set of not approved applications
# approved - set of approved applications (hold intermediate result)
if (empty(As))
return approved
a = head(As)
As_rest = tail(As)
if (cant_be_hired(a)) # if no places left in company from this application
return arrange(As_rest, Ps, not_approved + a, approved)
else if (highest_priority(a)) # if this application has highest of left priorities
return arrange(As_rest, Ps(c) - 1, not_approved, approved + a)
else
# if application can be accepted, but it has higher priorities left,
# check what will happen if we do not accept this application
check_result = arrange(As_left, Ps, not_approved + a, approved)
if (employee_is_hired_for_better_job(a, check_result))
# if employee can be hired to a job with higher priority,
# just return check_result - it is already an answer
return check_result
else
# otherwise accept this application and proceed for rest of them
return arrange(As_rest, Ps, not_approved, approved + a)
But, of course, this algorithm has very large computational complexity. Dynamic programming with caching check results helps a bit, but this is still too slow.
I was thinking of some kind of conditional optimization algorithm that always converges, however I'm not so closely familiar with this field to find appropriate one.
So, is there better algorithm?

Looking for a model to represent this problem, which I suspect may be NP-complete

(I've changed the details of this question to avoid NDA issues. I'm aware that if taken literally, there are better ways to run this theoretical company.)
There is a group of warehouses, each of which are capable of storing and distributing 200 different products, out of a possible 1000 total products that Company A manufactures. Each warehouse is stocked with 200 products, and assigned orders which they are then to fill from their stock on hand.
The challenge is that each warehouse needs to be self-sufficient. There will be an order for an arbitrary number of products (5-10 usually), which is assigned to a warehouse. The warehouse then packs the required products for the order, and ships them together. For any item which isn't available in the warehouse, the item must be delivered individually to the warehouse before the order can be shipped.
So, the problem lies in determining the best warehouse/product configurations so that the largest possible number of orders can be packed without having to order and wait for individual items.
For example (using products each represented by a letter, and warehouses capable of stocking 5 product lines):
Warehouse 1: [A, B, C, D, E]
Warehouse 2: [A, D, F, G, H]
Order: [A, C, D] -> Warehouse 1
Order: [A, D, H] -> Warehouse 2
Order: [A, B, E, F] -> Warehouse 1 (+1 separately ordered)
Order: [A, D, E, F] -> Warehouse 2 (+1 separately ordered)
The goal is to use historical data to minimize the number of individually ordered products in future. Once the warehouses had been set up a certain way, the software would just determine which warehouse could handle an order with minimal overhead.
This immediately strikes me as a machine learning style problem. It also seems like a combination of certain well known NP-Complete problems, though none of them seem to fit properly.
Is there a model which represents this type of problem?
If I understand correctly, you have to separate problems :
Predict what should each warehouse pre-buy
Get the best warehouse for an order
For the first problem, I point you to the netflix prize : this was almost the same problem, and great solutions have been proposed. (My datamining handbook is at home and I can't remember for precise keyword to google, sorry.Try "data mining time series" )
For the second one, this is a problem for Prolog.
Set a cost for separately ordering an item
Set a cost, for, idk, proximity to the customer
Set the cost for already owning the product to 0
Make the rule to get a product : buy it if you don't have it, get it if you do
Make the rule to get all products : foreach product, rule above
get the cost of this rule
Gently ask Prolog to get a solution. If it's not good enough, ask more.
If you don't want to use Prolog, there are several constraints libraries out there. Just google "constraint library <insert your programming language here>"
The first part of the problem (which items are frequently ordered together) is sometimes known as the co-occurrence problem, and is a big part of the data mining literature. (My recollection is that the problem is in NP, but there exist quite good approximate algorithms).
Once you have co-occurrence data you are happy with, you are still left with the assignment of items to warehouses. It's a little like the set-covering problem, but not quite the same. This problem is NP-hard.

Resources