One of our data sources sends a feed with an aggregate of data per day. A periodic snapshot. For example:
shop, day, sales
bobs socks, 2019-01-01, 45,
bobs socks, 2019-01-02, 50,
bobs socks, 2019-01-03, 10,
janes coats,2019-01-01, 500,
janes coats,2019-01-02, 55,
janes coats,2019-01-03, 100
I know of two ways to model this in a data vault raw vault:
Multi-Active Satellite
Here we allow each satellite to have multiple rows per hub key.
create table dbo.HubShop (
ShopName nvarchar(50) not null,
primary key pk_HubShop (ShopName)
)
create table dbo.SatDailyShopSales (
ShopName nvarchar(50) not null,
SalesDate date not null,
SalesAmount money not null,
LoadTimestamp datetime2(7) not null,
primary key pk_SatDailyShopSales (ShopName, SalesDate, LoadTimestamp)
)
This is easy to implement but we now have a bi-temporal element to the satellite.
Snapshot Hub
create table dbo.HubShop (
ShopName nvarchar(50) not null,
primary key pk_HubShop (ShopName)
)
create table dbo.HubSnapshot (
SalesDate date not null,
primary key pk_HubSnapshot (SalesDate)
)
create table dbo.LinkDailyShopSnapshot (
LinkDailyShopSnapshotHash binary(32) not null,
ShopName nvarchar(50) not null,
SalesDate date not null,
primary key pk_LinkDailyShopSnapshot (LinkDailyShopSnapshotHash)
)
create table dbo.SatDailyShopSales (
LinkDailyShopSnapshotHash binary(32) not null,
SalesAmount money not null,
LoadTimestamp datetime2(7) not null,
primary key pk_SatDailyShopSales (LinkDailyShopSnapshotHash, LoadTimestamp)
)
This second solution adds an extra hub which just stores a list of dates and a link for the intersection between date and shop.
The second solution feels cleaner but requires more joins.
Which is the correct model? Are there any better solutions?
as far as my understanding of the Data Vault modelling approach goes the Satellites are there to store the accurate time-slices of your data-warehouse.
This means that if i am given a specific date and i select all hubs, links (with no or enddate <= specific date). And then their corresponding entry with max(loaddate) & loaddate <= specific date, i should have the full representation of the current real world data state.
Applied to your question this means that your second solution fits these requirements. Because you can still import "changes" in the source system as new time slices, therefore modeling the correct timeline of information in the dwh.
To formulate it as an example, lets say you source system has the state:
shop, day, sales
bobs socks, 2019-01-01, 45,
bobs socks, 2019-01-02, 50,
bobs socks, 2019-01-03, 10,
janes coats,2019-01-01, 500,
janes coats,2019-01-02, 55,
janes coats,2019-01-03, 100
and you import this data on 2019-01-03 23:30:00.
On Jannuary the 4th 12:10:00 though "janes couts" salesteam corrects the numbers to only 90 sales.
In your first solution this leaves you with updating the satellite entry with hub key "janes coats" and loaddate "2019-01-03" to 90 effectively loosing your accurate dwh history.
so your DWH only stores the following afterwards:
shop, day, sales
bobs socks, 2019-01-01, 45,
bobs socks, 2019-01-02, 50,
bobs socks, 2019-01-03, 10,
janes coats,2019-01-01, 500,
janes coats,2019-01-02, 55,
janes coats,2019-01-03, 90
whereas in your second solution you simply insert a new satellite timeslice for store snapshot hash (for business key "janes coats" with date"2019-01-03") with loaddate "2019-01-03 12:10:00" and sales 90.
LINK
shop, day, ID (think of ID as a hash)
bobs socks, 2019-01-01, 1
bobs socks, 2019-01-02, 2
bobs socks, 2019-01-03, 3
janes coats,2019-01-01, 4
janes coats,2019-01-02, 5
janes coats,2019-01-03, 6
SALES Satellite
Link ID, loaddate, sales
1, 2019-01-03 23:30:00, 45
2, 2019-01-03 23:30:00, 50
3, 2019-01-03 23:30:00, 10
4, 2019-01-03 23:30:00, 500
5, 2019-01-03 23:30:00, 55
6, 2019-01-03 23:30:00, 100 !
6, 2019-01-04 12:10:00, 90 !
So you can easily see in your system that you got the correction of sales numbers at 2019-01-04 12:10:00 and that they were 100 before that.
The way I think of it is the only allowed update action in the Data Vault model is setting an EndDate in a Link Table and that deletes are never allowed. The you have a full DWH history available and reproduceable.
Related
I'm trying to have a ranking of the supplier based on their cases sold from the last 12 months (MAT12_cs) in a matrix in Power BI.
Here is a sample data:
Table_sales
Supplier, Product, Account, Rep, MAT12_cs
Sup1, Prod1, Acc1, Rep1, 56
Sup1, Prod1, Acc2, Rep2, 45
Sup1, Prod2, Acc1, Rep1, 43
Sup1, Prod2, Acc2, Rep2, 66
Sup2, Prod3, Acc1, Rep1, 15
Sup2, Prod4, Acc3, Rep2, 104
Sup3, Prod5, Acc4, Rep3, 86
Sup3, Prod5, Acc1, Rep1, 80
Here is the result I'm expecting:
Supplier, MAT12_cs, Rank
Sup1, 210, 1
Sup3, 166, 2
Sup2, 119, 3
Total, 495
I tried RANKX in a measure:
Rank = RANKX(Table_sales,SUM(MAT12_CS))
It gives 1 everywhere.
I tried something like this but something is missing to make it work I think:
Rank =
VAR ProdSales = SUM('Table_sales'[MAT12_cs])
VAR tblSales =
SUMMARIZE (
'Table_sales',
'Table_sales'[Supplier],
"Total Sales", SUM ( 'Table_sales'[MAT12_cs] )
)
RETURN
IF(ProdSales>0,COUNTROWS(FILTER(tblSales,[Total Sales]>ProdSales))+1,BLANK())
This gives me totals I don't by what I should replace countrows with to have a ranking.
Create a measure (I am calling your table "Sales" for short):
Total Sale = SUM ( Sales[MAT12_cs] )
Create another measure:
Sale Rank =
IF (
HASONEVALUE ( Sales[Supplier] ),
RANKX ( ALL ( Sales[Supplier] ), [Total Sale] )
)
Put these measures into a matrix or table against suppliers. Result:
Explanation:
You must use ALL(Table) instead of just 'Table' in RANKX. Without ALL, RANKX will not see the entire data (as it must, to rank all sales), it will only see filtered table. For example, in the first row, you will only see sales for supplier 1, because your table "Sales" is filtered in this row by Sup1. As a result, RANKX is ranking just one record, that's why you are getting 1s in each line. When we use ALL, RANKX will (correctly) see all data.
After getting access to all suppliers, RANKX iterates them one by one, and for each supplier calculates their sales and then ranks them.
HASONEVALUE part is needed to remove ranking from the totals.
I have one Oracle DB with ~40 tables. Some of them have IDs = 1, 2, 3, 4, 5... and constraints.
Now I want to "copy" this data from all tables to another Oracle DB which already has the same tables.
The problem is that another DB also has records (can be the same IDs = 1, 2, 3, 77, 88...) and I don't want to lose them.
Is there some automated way to copy data from one table to another with IDs shifting and constraints?
1, 2, 3, 77, 88 +
**1, 2, 3, 4, 5**
=
1, 2, 3, 77, 88, **89, 90, 91, 92, 93**
Or I need to do it by myself?
insert into new.table
select new.sequence_id.nextval, t.* from old.table t
save new.id - old.id mapping and etc etc etc for all 40 tables?
That's a bit dirty solution but if all IDs are numeric you can first update old IDs to negative number ID = -1 * ID (or just do it in select statement on the fly) then do insert. In that case you have all your IDs consistent, constraints are valid and they can live together with new data.
Firs, you need expdp, is second you ned remap schema new schema name in impdp
I am currently doing some testing and am in the need for a large amount of data (around 1 million rows)
I am using the following table:
CREATE TABLE OrderTable(
OrderID INTEGER NOT NULL,
StaffID INTEGER,
TotalOrderValue DECIMAL (8,2)
CustomerID INTEGER);
ALTER TABLE OrderTable ADD CONSTRAINT OrderID_PK PRIMARY KEY (OrderID)
CREATE SEQUENCE seq_OrderTable
MINVALUE 1
START WITH 1
INCREMENT BY 1
CACHE 10000;
and want to randomly insert 1000000 rows into it with the following rules:
OrderID needs to be be sequential (1, 2, 3 etc...)
StaffID needs to be a random number between 1 and 1000
CustomerID needs to be a random number between 1 and 10000
TotalOrderValue needs to be a random decimal value between 0.00 and 9999.99
Is this even possible to do? I can I could generate all of these using this update statement? however generating a million rows in 1 go I am not sure on how to do this
Thanks for any help on this matter
This is how i would randomly generate the number on update:
UPDATE StaffTable SET DepartmentID = DBMS_RANDOM.value(low => 1, high => 5);
For testing purposes I created the table and populated it in one shot, with this query:
CREATE TABLE OrderTable(OrderID, StaffID, CustomerID, TotalOrderValue)
as (select level, ceil(dbms_random.value(0, 1000)),
ceil(dbms_random.value(0,10000)),
round(dbms_random.value(0,10000),2)
from dual
connect by level <= 1000000)
/
A few notes - it is better to use NUMBER as data type, NUMBER(8,2) is the format for decimal. It is much more efficient for populating this kind of table to use the "hierarchical query without PRIOR" trick (the "connect by level <= ..." trick) to get the order ID's.
If your table is created already, insert into OrderTable (select level...) (same subquery as in my code) should work just as well. You may be better off adding the PK constraint only after you create the data though, so as not to slow things down.
A small sample from the table created (total time to create the table on my cheap laptop - 1,000,000 rows - was 7.6 seconds):
SQL> select * from OrderTable where orderid between 500020 and 500030;
ORDERID STAFFID CUSTOMERID TOTALORDERVALUE
---------- ---------- ---------- ---------------
500020 666 879 6068.63
500021 189 6444 1323.82
500022 533 2609 1847.21
500023 409 895 207.88
500024 80 2125 1314.13
500025 247 3772 5081.62
500026 922 9523 1160.38
500027 818 5197 5009.02
500028 393 6870 5067.81
500029 358 4063 858.44
500030 316 8134 3479.47
I have 2 SQLite databases, Salesmen and Sales. Here's what the original CSV files looked like, but I've already put them in SQLite (just so you can see how the tables are layed out):
Salesmen Table
id, name
1, john
2, luther
3, bob
Sales Table
id, salesmen_id, sales_amount
1, 1, 100
2, 3, 20
3, 2, 35
4, 3, 25
5, 1, 55
6, 2, 200
7, 2, 150
My question is how do I write a function in ruby that will return all the Salesmen names, sorted by their total sales amount? I know this requires using a join, but I'm not entirely sure how the query should look like.
I want the new table to look like this:
New Table
name, total_sales
luther, 385
john, 155
bob, 45
The new sqlite query should be in this format:
$db.execute %q{
SELECT account_name, units, unit_price
FROM accounts, positions
...
}
Thanks in advance
I think this is what you want
SELECT name, sum(sales_amount)
FROM salesmen INNER JOIN sales on sales.salesmen_id = salesmen.id
GROUP BY salesmen_id
I am trying to get a running total calculation for a set of data that does not have any Dates. I have a column with Product Categories and a column with Sales Dollars for each Product Category from the Fact Table. I need to add a running total column as shown below. I am new to DAX and looking for some help on calculating a running total.
Category Sales $ Cumulative Sales $
FOOD SERVICE 9051 9051
HOT FOOD 1880 10931
GRILL 1815 12746
FRESH SANDWICHES 1189 13935
FRESH BAKERY 1100 15035
PACKAGED BAKERY 1074 16109
COLD SNACKS 645 16754
FAST FOOD 388 17142
FRESH BAKERY MULTI-DAY 252 17394
ENTREES/SALAD 180 17574
NACHOS 126 17700
BREAD 120 17820
Grand Total 17820
I think you have to have it ordered by something. If you want it ordered by Sales as you have it above you could create a calculated column to store the order as follows:
= RANKX ( ALL ( Table1 ), [Sales], [Sales] )
Then create a new calculated column for your cummulative total:
=
CALCULATE (
SUM ( [Sales] ),
FILTER (
ALL ( Table1 ),
[CalculatedColumn2] <= EARLIER ( Table1[CalculatedColumn2] )
)
)