How to deal with reporting slowly changing dimensions - etl

For a client I am creating a data warehouse in which we have some slowly changing dimensions (or facts if that is even a thing?). For example we want to report the annually recurring revenue (ARR) for subscriptions and we want to have both the currently active and the expired subscriptions in there. So that we can see the ARR over a timeline.
The data we retrieve looks like this:
subscription_id
account_id
ARR
start_date
end_date
1
1
10
01-01-2022
31-03-2022
2
2
20
01-01-2022
31-12-2022
3
1
5
01-04-2022
31-11-2022
So in this case the same account (account_id 1) renewed a subscription at the 01-04-2022. In the report of 2022 we want to see the ARR for all months in 2022. I've looked into slowly changing dimensions, however something I can not really see in that concept is how to report both the currently active license and the history in a dashboard. If we for example want to visualize the ARR in all of 2022 per month in a dashboarding tool we want to see both subscriptions for account_id 1 over the course of the year, not just the currently active one. This seems to be very tricky to do in most dashboarding tools.
To overcome this I've done the following. I created a calendar table with an interval of 1 month and I cross join it with the table above to generate a fact table. The end result would look like:
timestamp
account_id
ARR
01-01-2022
1
10
01-01-2022
2
20
01-02-2022
1
10
...
...
...
01-11-2022
1
10
01-11-2022
2
20
01-11-2022
2
20
This makes it really easy for the user of the reporting tool to filter on a specific month and show the ARR between the dates and over multiple subscriptions. It does however generate a lot of extra data, but at the moment the storage space is not an issue. And it makes it more of a transactional style table, but the ARR is not really a transaction (i.e. it is not really a sold product on a specific date).
My question is: Are there better ways of generating a fact table where the source data contains a date range?

Related

Query to prevent booking overlap

I'm doing an app in Apex Oracle and trying to find a query that could prevent people from booking a room already booked. I managed to find a query that can prevent picking a date that starts or ends in between the booking time but I can't find how to prevent overlaping. By that I mean if someone books a conference room feb 2nd to feb 5th, someone can book the same room from feb 1st to feb 7th. That is what I'm trying to prevent. Thanks for the help!
Here's my first query
SELECT RES_ID_LOC FROM WER_RES
WHERE (CAST(RES_DATE_ARRIVE AS DATE) < CAST(TRY_RESERVE_START_DATE AS DATE) OR CAST(RES_DATE_DEPART AS DATE)
CAST(TRY_RESERVE_START_DATE AS DATE))
AND (CAST(RES_DATE_ARRIVE AS DATE) < CAST(TRY_RESERVE_END_DATE AS DATE) OR CAST(RES_DATE_DEPART AS DATE) > CAST(TRY_RESERVE_END_DATE AS DATE))
The main issue you'll have here is concurrency, namely (in chronological order)
User 1
runs overlap check query, see Room 5 is free, and inserts a row to book it
User 2
runs overlap check query, see Room 5 is free, and inserts a row to book it
User 1
commits
User 2
commits
and voila! You have a data corruption, even though the code all ran as you expected.
To avoid this, you'll need some way to lock a resource that multiple might want to book. Thus lets say you have a ROOMS table (list of available rooms) and a BOOKINGS table which is a child of ROOM.
Then your logic will need be something like:
select from ROOM where ROOM_NO = :selected_room for update;
This gives someone exclusive access to the room to check for bookings.
Now you can run your overlap check on that room against the BOOKINGS table. If that passes, then you insert your booking and commit the change to release the lock on the ROOMS row.
As an aside, take care with simply casting strings to dates, because you're at the whim of the format mask of the item matching that default of the database. Better to explicitly use a known format mask and TO_DATE

I would like to create an efficient Bigtable row key

I would like to create an optimal row key in Bigtable. I have a table channel_data with 3 columns: channel_id,date,fan_count.
channel_id
date
fan_count
1
2022-03-01
5000
1
2022-03-02
6000
2
2022-03-01
200
2
2022-03-02
300
3
2022-03-03
1000
Users of our application can set up brands/buckets by adding multiple channels. Users can choose any random channel_id.
I want to design an efficient row key to fetch aggregated fan_count in a date range for a brand.
Let's say the user creates a brand with channel_id 1 and 3 and wish to see sum of all fans for the time period 2022-03-01 to 2022-03-03
The result should be 5000+6000+1000=12000
You have a few options here. Because you're looking to do queries based on date, you should probably make that the end part of your rowkey so you can scope down by brand first. You could also use timestamped cells to store multiple values for each channel. Perhaps a week or month of data, so it is grouped together in that way, but this isn't necessary.
Perhaps a rowkey like channel_id/yyyy-mm-dd is what you'd want. You can choose to store the date and channel info in the table, but it isn't necessary since you'd have it in your ids. You can just treat Bigtable like a key/value store in this instance which might be more optimal depending on your scenario.
If you choose to store a month of data per row, you would just make the rowkey something like channel_id/yyyy-mm and just timestamp each value for the day.
Either way for your queries, if you need multiple channels, then you could just do multiple reads or a multi-prefix scan. Let me know if this helps clarify the schema design and if you have more questions.

Grouping then getting the sum of data in dynamic dated columns

I am currently working on a resource tracker for my company and I have each individuals capacity figure by week (weeks are in the columns and each person's information is in the row). I need to be able to sum all the time in a specific month for each job role to be able to report on.
I have currently thought about grouping the dates by selecting 4 weeks but due to my fields being dynamic and there being some 5 week months, it would not be able to accurately be able to report that months figures.
Unfortunately, you can't pivot the information due to the dates been in the columns rather than the rows.
I have yet to find any formula/code that can be used to get that information.
In the picture, I have added the information that I would like to be able to dynamically sum. The red outlines the month and the green outlines the job role information.
So I would like to be able to sum all that information under "July" and then the same for the other months so I can give my stakeholders a monthly figure of how many days capacity there is for each person/job role in that month.
=ARRAYFORMULA(QUERY({INDIRECT("Sheet1!B3:B"&COUNTA(Sheet1!B3:B)+2),
MMULT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(Sheet1!E2:Z),
"where month(Col1)+1=7 and year(Col1)=2019", 0)),
"where Col1 is not null offset 1", 0), ROW(INDIRECT("A1:A"&COUNTA(
FILTER(Sheet1!A2:2, MONTH(Sheet1!A2:2)=7, YEAR(Sheet1!A2:2)=2019))))^0)},
"select Col1, sum(Col2) group by Col1 label sum(Col2)''", 0))

Cognos 11 Crosstab - need a value that doesn't have a reference to the column values

Crosstab report works 99%.
About 20 rows, all but one are ok.
5 columns - Company Division.
The rows are things like cost, revenue, revenue 2, etc.
All the rows that work have three attributes I'm using to select them:
Fiscal Year
Period
Solution.
The problem is there is table that lists an YTD rate for each period. This table is not Division Specific; it's company wide.
All the tables are linked to the accounting period table that has fiscal year and period. So the overall query limits data to fiscal year (?pFiscalYear?) and period <= ?pPeriod?, based on prompt page results.
The source table has this:
FY_CD PD_NO ACT_CURR_RT ACT_YTD_RT
2018 1 0.36121715 0.36121715
2018 2 0.32471476 0.34255512
2018 3 0.25240906 0.31210183
2018 4 0.33154745 0.31925874
Note the YTD rate is not an average of any of the other numbers.
When I select the ACT_YTD_RT, as a row, I want the ACT_YTD_RT that matches the selected period.
What I get is the average if I set the aggregation to average or the lowest if I set it to other aggregations. So sometimes, it looks right (if I run for period 1,2,3, as the rate kept falling), and sometimes it's wrong (period 4
returns .3121 instead of .3192).
I've tried a number of different methods and can generate garbage data (totals, min, max, average) and crossjoins but can't figure out how to get the value I'm looking for.
I want YTD_RT where fiscal year =?pFiscal? and period = ?pPeriod?.
I tried a straight if then clause:
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
but I get an error like this:
'ACT_YTD_RT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (SQLSTATE=42000, SQLERRORCODE=8120)
If I create another query that generates the right response and try to include it, I get a crossjoin error that the query I'm referencing is trying to crossjoin several other items in the crosstab query.
A union doesn't work (different number of columns).
Not sure how a join would work since the division doesn't exist in the rate table.
I maybe could create a view in the database that did a crossjoin of the division table and the rate table, add that to the framework and then I wouldn't have a crossjoin since the solution would be in the rate "table" (really view), but that seems wrong somehow.
If I could just write a freaking parameterized query direct to the database I'd be done. But in Cognos 11 crosstabs I can't find a place for a SQL query object. And that shouldn't be necessary.
I've spent hours and hours chasing this in circles.
Anybody have any ideas?
Thanks
Paul
So the earlier problem was that this:
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
Generated an error like this:
'ACT_YTD_RT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (SQLSTATE=42000, SQLERRORCODE=8120)
To fix the above, I had to add a cross join of the division table and the rate table as a view in the database. Then add that to the framework. Then build the data item this way:
total (
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
)
And now the "total" provides the missing group by. And the crossjoin in the database provides the division information so the crosstab is happy.
I still think there should have been an easier way to do this, but I have a functioning hammer at the moment.

Complex Queries in ELK?

I've successfully set-up ELK stack. ELK gives me great insights on data. However, I'm not sure how I'll fetch the following result.
Let say, I've a column user_id and action. The values in action can be installed , activated, engagement and click. So, I want that if a particular user has performed an activity installed on 21 May and 21 June, then while fetching results for the month of June, ELK should not return those users who has already performed that activity earlier before. For eg, for the following table:-
Date UserID Activityin the previous month
1 May 1 Activated
3 May 2 Activated
6 May 1 Click
8 May 2 Activated
11 June 1 Activated
12 June 1 Activated
13 June 1 Click
User1 and User2 has activated on 1May and 3May respectively. User2 has also activated on 8May. So, when I filter the users for the month of May having activity Activated, it should return me count 2, ie
1 May 1 Activated
3 May 2 Activated
User2 on 8May is being removed because it has performed that same activity before.
Now if I write the same query for the month of June, it should return me nothing, because the same users have perform the same activity earlier as well.
How can I write this query in ELK?
This type of relational query is not possible with ElasticSearch.
You would need to add another column (FirstUserAction) and either populate it when the data is loaded, or schedule a task (in whatever scripting/programming language you're comfortable with) to periodically calculate and update the values for this column.

Resources