Data Warehouse error related to a group by mistake - oracle

So I have 2 tables:Clients_Test and Vacations_Test.I need to create a warehouse that has 2 dimensions:
d1_month and d2_destination.The main table that shoud contain the fields of these 2 dimensions is the fact table.Also the fact needs to have 2 more fields:the total destinations of each client + the total price.
Problem is after I insert data into the fact table the group by line breaks down and I don't know why.Tried to group them using other fields but still it keeps me back.
CREATE TABLE CLIENTS_TEST(
IDclient NUMBER(7) PRIMARY KEY,
name_client VARCHAR2(40) ,
adress VARCHAR2(50),
city VARCHAR2(30),
county VARCHAR2(20));
CREATE TABLE VACATIONS_TEST(
number_contract NUMBER(6) PRIMARY KEY,
IDclient NUMBER(7) REFERENCES CLIENTI_TEST(IDclient),
date_contract DATE,
destination VARCHAR2(40),
price NUMBER(10));
INSERT INTO CLIENTS_TEST VALUES (1, 'name1','adr1', 'Timisoara', 'Timis');
INSERT INTO CLIENTS_TEST VALUES (2, 'name2','adr2', 'Arad', 'Arad');
INSERT INTO CLIENTS_TEST VALUES (3, 'name3','adr3', 'Cluj', 'Cluj');
INSERT INTO CLIENTS_TEST VALUES (4, 'name4','adr4', 'Arad', 'Arad');
INSERT INTO CLIENTS_TEST VALUES (5, 'name5','adr5', 'Timisoara', 'Timis');
INSERT INTO CLIENTS_TEST VALUES (6, 'name6','adr6', 'Cluj', 'Cluj');
INSERT INTO VACATIONS_TEST VALUES (11, 2, TO_DATE('2/05/2020','dd/mm/yyyy'), 'dest2', 4000);
INSERT INTO VACATIONS_TEST VALUES (12, 3, TO_DATE('4/05/2020','dd/mm/yyyy'), 'dest5', 4000);
INSERT INTO VACATIONS_TEST VALUES (13, 6, TO_DATE('8/05/2020','dd/mm/yyyy'), 'dest3', 3000);
INSERT INTO VACATIONS_TEST VALUES (14, 1, TO_DATE('10/05/2020','dd/mm/yyyy'), 'dest4', 3000);
INSERT INTO VACATIONS_TEST VALUES (15, 3, TO_DATE('12/05/2020','dd/mm/yyyy'), 'dest1', 2000);
INSERT INTO VACATIONS_TEST VALUES (16, 5, TO_DATE('15/05/2020','dd/mm/yyyy'), 'dest2', 4000);
INSERT INTO VACATIONS_TEST VALUES (17, 2, TO_DATE('18/05/2020','dd/mm/yyyy'), 'dest4', 5000);
INSERT INTO VACATIONS_TEST VALUES (18, 6, TO_DATE('21/05/2020','dd/mm/yyyy'), 'dest5', 3000);
INSERT INTO VACATIONS_TEST VALUES (19, 1, TO_DATE('24/05/2020','dd/mm/yyyy'), 'dest1', 4000);
INSERT INTO VACATIONS_TEST VALUES (20, 4, TO_DATE('27/05/2020','dd/mm/yyyy'), 'dest3', 6000);
INSERT INTO VACATIONS_TEST VALUES (21, 3, TO_DATE('3/06/2020','dd/mm/yyyy'), 'dest5', 4000);
INSERT INTO VACATIONS_TEST VALUES (22, 4, TO_DATE('6/06/2020','dd/mm/yyyy'), 'dest2', 3000);
INSERT INTO VACATIONS_TEST VALUES (23, 6, TO_DATE('7/06/2020','dd/mm/yyyy'), 'dest4', 6000);
INSERT INTO VACATIONS_TEST VALUES (24, 2, TO_DATE('9/06/2020','dd/mm/yyyy'), 'dest5', 5000);
INSERT INTO VACATIONS_TEST VALUES (25, 5, TO_DATE('11/06/2020','dd/mm/yyyy'), 'dest2', 4000);
INSERT INTO VACATIONS_TEST VALUES (26, 3, TO_DATE('14/06/2020','dd/mm/yyyy'), 'dest1', 3000);
INSERT INTO VACATIONS_TEST VALUES (27, 2, TO_DATE('17/06/2020','dd/mm/yyyy'), 'dest4', 6000);
INSERT INTO VACATIONS_TEST VALUES (28, 6, TO_DATE('19/06/2020','dd/mm/yyyy'), 'dest5', 4000);
INSERT INTO VACATIONS_TEST VALUES (29, 1, TO_DATE('21/06/2020','dd/mm/yyyy'), 'dest3', 2000);
INSERT INTO VACATIONS_TEST VALUES (30, 5, TO_DATE('27/06/2020','dd/mm/yyyy'), 'dest4', 3000);
INSERT INTO VACATIONS_TEST VALUES (31, 4, TO_DATE('1/07/2020','dd/mm/yyyy'), 'dest5', 3000);
INSERT INTO VACATIONS_TEST VALUES (32, 3, TO_DATE('4/07/2020','dd/mm/yyyy'), 'dest1', 2000);
INSERT INTO VACATIONS_TEST VALUES (33, 1, TO_DATE('6/07/2020','dd/mm/yyyy'), 'dest3', 2000);
INSERT INTO VACATIONS_TEST VALUES (34, 4, TO_DATE('10/07/2020','dd/mm/yyyy'), 'dest1', 3000);
INSERT INTO VACATIONS_TEST VALUES (35, 6, TO_DATE('12/07/2020','dd/mm/yyyy'), 'dest2', 4000);
INSERT INTO VACATIONS_TEST VALUES (36, 5, TO_DATE('15/07/2020','dd/mm/yyyy'), 'dest3', 3000);
INSERT INTO VACATIONS_TEST VALUES (37, 4, TO_DATE('22/07/2020','dd/mm/yyyy'), 'dest4', 2000);
INSERT INTO VACATIONS_TEST VALUES (38, 2, TO_DATE('24/07/2020','dd/mm/yyyy'), 'dest1', 4000);
INSERT INTO VACATIONS_TEST VALUES (39, 1, TO_DATE('27/07/2020','dd/mm/yyyy'), 'dest5', 2000);
INSERT INTO VACATIONS_TEST VALUES (40, 5, TO_DATE('29/07/2020','dd/mm/yyyy'), 'dest4', 4000);
Create table d1_month (month_contract number(2) primary key);
Insert into d1_month select distinct extract(month from data_contract) from vacations_test;
Create table d2_destination(destination varchar2(40) Primary key);
insert into d2_destination select distinct destination from vacations_test;
Create table fact (month_contract number(2) references d1_month(month_contract),destination varchar2(40) references d2_destination(destination),
nr_vacations number(10),total_price number(20),Primary key(nr_vacations,total_price));
Insert into fact
select extract(month from VACATIONS_TEST.date_contract), VACATIONS_TEST.destination,count(VACATIONS_TEST.IDclient),
sum(VACATIONS_TEST.price)
from vacations_test, clients_test
WHERE VACATIONS_TEST.IDclient=CLIENTS_TEST.IDclient
group by VACATIONS_TEST.destination, extract(month from VACATIONS_TEST.date_contract); // error

It is not that GROUP BY fails, but the fact that primary key is being violated:
SQL> insert into fact (month_contract, destination, nr_vacations, total_price)
2 select
3 extract(month from vacations_test.date_contract),
4 vacations_test.destination,
5 count(vacations_test.idclient),
6 sum(vacations_test.price)
7 from vacations_test, clients_test
8 where vacations_test.idclient = clients_test.idclient
9 group by vacations_test.destination,
10 extract(month from vacations_test.date_contract)
11 ;
insert into fact (month_contract, destination, nr_vacations, total_price)
*
ERROR at line 1:
ORA-00001: unique constraint (SCOTT.SYS_C007160) violated
SQL>
What to do?
if fact table's primary key isn't correctly set, change it
if primary key is OK but data violate it, then make sure that query returns unique combination of the two primary key columns: [nr_vacations, total_price]
how? Maybe by including the where clause to the query, or ... who knows? I don't, but you should.

Related

Procedure which returns several rows conditionally

I have a procedure which will return a number as out parameter(let's call it out_parameter_result).
According to this number I need to add rows conditionally.
Pseudocode example(don't mind about conditions):
if(bitand(out_parameter_result, 1) = 1)
result.add(select 1 from dual)
if(bitand(out_parameter_result, 2) = 2)
result.add(select 2 from dual)
if(bitand(out_parameter_result, 4) = 4)
result.add(select 4 from dual)
return cursor(or resultset) which will contain 1,2,4.
Different from original, but works fine in my case.
SELECT * FROM TABLE WHERE id IN (
DECODE(bitand(v_info, 1), 1, 0, 1),
DECODE(bitand(v_info, 2), 2, 0, 2),
DECODE(bitand(v_info, 4), 4, 0, 3)

Any form for a year-to-date or rolling sum function in Power Query?

I'm quite newby to Power Query. I have a column for the date, called MyDate, format (dd/mm/yy), and another variable called TotalSales. Is there any way of obtaining a variable TotalSalesYTD, with the sum of year-to-date TotalSales for each row? I've seen you can do that at Power Pivot or Power Bi, but didn't find anything for Power Query.
Alternatively, is there a way of creating a variable TotalSales12M, for the rolling sum of the last 12 months of TotalSales?
I wasn't able to test this properly, but the following code gave me your expected result:
let
initialTable = Table.FromRows({
{#date(2020, 5, 1), 150},
{#date(2020, 4, 1), 20},
{#date(2020, 3, 1), 54},
{#date(2020, 2, 1), 84},
{#date(2020, 1, 1), 564},
{#date(2019, 12, 1), 54},
{#date(2019, 11, 1), 678},
{#date(2019, 10, 1), 885},
{#date(2019, 9, 1), 54},
{#date(2019, 8, 1), 98},
{#date(2019, 7, 1), 654},
{#date(2019, 6, 1), 45},
{#date(2019, 5, 1), 64},
{#date(2019, 4, 1), 68},
{#date(2019, 3, 1), 52},
{#date(2019, 2, 1), 549},
{#date(2019, 1, 1), 463},
{#date(2018, 12, 1), 65},
{#date(2018, 11, 1), 45},
{#date(2018, 10, 1), 68},
{#date(2018, 9, 1), 65},
{#date(2018, 8, 1), 564},
{#date(2018, 7, 1), 16},
{#date(2018, 6, 1), 469},
{#date(2018, 5, 1), 4}
}, type table [MyDate = date, TotalSales = Int64.Type]),
ListCumulativeSum = (numbers as list) as list =>
let
accumulator = (listState as list, toAdd as nullable number) as list =>
let
previousTotal = List.Last(listState, 0),
combined = listState & {List.Sum({previousTotal, toAdd})}
in combined,
accumulated = List.Accumulate(numbers, {}, accumulator)
in accumulated,
TableCumulativeSum = (someTable as table, columnToSum as text, newColumnName as text) as table =>
let
values = Table.Column(someTable, columnToSum),
cumulative = ListCumulativeSum(values),
columns = Table.ToColumns(someTable) & {cumulative},
toTable = Table.FromColumns(columns, Table.ColumnNames(someTable) & {newColumnName})
in toTable,
yearToDateColumn =
let
groupKey = Table.AddColumn(initialTable, "$groupKey", each Date.Year([MyDate]), Int64.Type),
grouped = Table.Group(groupKey, "$groupKey", {"toCombine", each
let
sorted = Table.Sort(_, {"MyDate", Order.Ascending}),
cumulative = TableCumulativeSum(sorted, "TotalSales", "TotalSalesYTD")
in cumulative
}),
combined = Table.Combine(grouped[toCombine]),
removeGroupKey = Table.RemoveColumns(combined, "$groupKey")
in removeGroupKey,
rolling = Table.AddColumn(yearToDateColumn, "TotalSales12M", each
let
inclusiveEnd = [MyDate],
exclusiveStart = Date.AddMonths(inclusiveEnd, -12),
filtered = Table.SelectRows(yearToDateColumn, each [MyDate] > exclusiveStart and [MyDate] <= inclusiveEnd),
sum = List.Sum(filtered[TotalSales])
in sum
),
sortedRows = Table.Sort(rolling, {{"MyDate", Order.Descending}})
in
sortedRows
There might be more efficient ways to do what this code does, but if the size of your data is relatively small, then this approach should be okay.
For the year to date cumulative, the data is grouped by year, then sorted ascendingly, then a running total column is added.
For the rolling 12-month total, the data is grouped into 12-month windows and then the sales are totaled within each window. The totaling is a bit inefficient (since all rows are re-processed as opposed to only those which have entered/left the window), but you might not notice it.
Table.Range could have been used instead of Table.SelectRows when creating the 12-month windows, but I figured Table.SelectRows makes less assumptions about the input data (i.e. whether it's sorted, whether any months are missing, etc.) and is therefore safer/more robust.
This is what I get:

How to execute this sorting process in pyspark?

I have tried map, mapValues and sort, but nothing works.
The question is described as follows:
"by the similarity (the second one in the value), if same, choose the user that has smallest ID (the first one in the value)."
And the list of Key-Value pair is :
[
(18, [(2, 0.5)]),
(30, [(19, 0.5), (6, 0.25)]),
(6, [(30, 0.25), (20, 0.2), (19, 0.2)]),
(19, [(30, 0.5), (8, 0.2), (6, 0.2)]),
(2, [(18, 0.5)]),
(26, [(9, 0.2)]),
(9, [(26, 0.2)])
]
I want to get:
[
(18, [(2, 0.5)]),
(30, [(19, 0.5), (6, 0.25)]),
(6, [(30, 0.25), (19, 0.2)]),
(19, [(30, 0.5), (6, 0.2)]),
(2, [(18, 0.5)]),
(26, [(9, 0.2)]),
(9, [(26, 0.2)])
]
Thank you a lot!
Pretty much straight forward. Just need to figure out the necessary transformations:
data = [(18, [(2, 0.5)]),
(30, [(19, 0.5), (6, 0.25)]),
(6, [(30, 0.25), (20, 0.2), (19, 0.2)]),
(19, [(30, 0.5), (8, 0.2), (6, 0.2)]),
(2, [(18, 0.5)]),
(26, [(9, 0.2)]),
(9, [(26, 0.2)])]
rdd1 = sc.parallelize(data)
rdd2 = rdd1.flatMapValues(lambda x:x)
rdd3 = rdd2.map(lambda x: ((x[0], x[1][1]),x[1][0]))
rdd4 = rdd3.reduceByKey(min)
rdd5 = rdd4.map(lambda x: (x[0][0], (x[1], x[0][1])))
rdd6 = rdd5.reduceByKey(lambda x,y: [x,y])
rdd6.collect()
[(9, (26, 0.2)),
(26, (9, 0.2)),
(18, (2, 0.5)),
(30, [(6, 0.25), (19, 0.5)]),
(2, (18, 0.5)),
(6, [(30, 0.25), (19, 0.2)]),
(19, [(30, 0.5), (6, 0.2)])]

How to compare values across multiple columns?

Assume all value columns have the same datatype. I would like the highest of all values with the id in the results of a SELECT query.
Table Structure:
table_a: id, value1, value2, value3, value4, value5
Example data:
id, value1, value2, value3, value4, value5
2, 125, 256, 133, 400, 67
3, 14, 14, 14, 3, 6
4, 325, 441, 441, 975, 3
Example desired results:
id, highest_value
2, 400
3, 14
4, 975
I started down the path of a CASE statement but that got messy fast. I tired a sub-select but failed in getting that to work. Is there a clean way to compare several column values to each other?
In this case greatest function will do the work.
with t1(id1, val1, val2, val3, val4, val5) as
(
select 2, 125, 256, 133, 400, 67 from dual union all
select 3, 14, 14, 14, 3, 6 from dual union all
select 4, 325, 441, 441, 975, 3 from dual
)
select id1
, greatest(val1, val2, val3, val4, val5) Res
from t1
Result:
Id1 Res
---------------
2 400
3 14
4 975

Opposite of a GROUP in apache pig latin?

Let's say I have the following input in apache pig:
(123, ( (1, 2), (3, 4) ) )
(666, ( (8, 9), (10, 11), (3, 4) ) )
and I want to convert these 2 rows into the following 7 rows:
(123, (1, 2) )
(123, (3, 4) )
(666, (8, 9) )
(666, (10, 11) )
(666, (3, 4) )
i.e. this is sorta 'doing the opposite of a GROUP'. Is this possible in pig latin?
Take a look at FLATTEN. It does what you probably need.
However, using your notation above, it looks like the list of tuples is a tuple. This should be a bag for this to work properly.
Instead of:
(123, ( (1, 2), (3, 4) ) )
(666, ( (8, 9), (10, 11), (3, 4) ) )
You should be representing your data as:
(123, { (1, 2), (3, 4) } )
(666, { (8, 9), (10, 11), (3, 4) } )
Then, once it is this form, you can do:
O = FOREACH grouped GENERATE $0, FLATTEN($1);

Resources