How to calculate day specific data from accumulated data - oracle

I have an Oracle table with values for two accounts and each record will have Date field. First day of the week will have only data relevant to the Day 1 but when we see the data for Day2 in a week it has accumulated data. So we need to subtract Day2 data from previous day data to calculate exact data for Day2.Similar approach for Day3..Day7.
Please suggest the best approach in SQL query to handle this requirement. I am very sorry to bother you. I am totally new to SQL.Really appreciate your valuable inputs.As an example, there are 6 columns with header are given below
Center Entity Bonus Year Period Incentive
MANUFACTURING NEW YORK 1200 FY18 31-12-2017 120
MANUFACTURING NEW YORK 1500 FY18 01-01-2018 250
MANUFACTURING NEW YORK 1800 FY18 01-01-2018 320
So assuming Dec 31, 2017 is the first day of the week, the data record will show only data for that day 1. When we move on to Day 2 of the week i.e. Jan 01, 2018, it has accumulated data which includes Day 1 and day2. So we need to subtract Day2 data from Day1 data to calculate exact data for data 2. 1500 - 1200 = 300 is the exact value for Day 2. Similar approach we need to follow for Day3, day4, Day5,Day6 and Day7.
Expected output is given below
Center Entity Bonus Year Period Incentive
MANUFACTURING NEW YORK 1200 FY18 01-01-2018 120
MANUFACTURING NEW YORK 300 FY18 01-01-2018 130
MANUFACTURING NEW YORK 300 FY18 01-01-2018 70

You could use a simple LAG() function with NVL().
select Center, Entity, Bonus,Year,Period,
Incentive - NVL( LAG(Incentive , 1) OVER ( ORDER BY Period ), 0) Incentive
FROM yourtable;
DEMO

You can do a self join with you table on period and do the subtraction from previous date's data.
looks like you have typo in the test data and result, the dates should be incremental as stated in description, but it has duplicate.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE t
(Center varchar2(13), Entity varchar2(8), Bonus int, Year varchar2(4), Period DATE, Incentive int)
;
INSERT INTO t (Center, Entity, Bonus, Year, Period, Incentive)
VALUES ('MANUFACTURING', 'NEW YORK', 1200, 'FY18', DATE '2017-12-31', 120);
INSERT INTO t (Center, Entity, Bonus, Year, Period, Incentive)
VALUES ('MANUFACTURING', 'NEW YORK', 1500, 'FY18', DATE '2018-01-01', 250);
INSERT INTO t (Center, Entity, Bonus, Year, Period, Incentive)
VALUES ('MANUFACTURING', 'NEW YORK', 1800, 'FY18', DATE '2018-01-02', 320);
;
Query:
select t1.center,
t1.entity,
t1.bonus - nvl (t2.bonus,0) bonus,
t1.year,
t1.period,
t1.incentive - nvl(t2.incentive,0) incentive
from t t1
left outer join t t2
on t1.period = t2.period + 1
order by t1.period
Results:
| CENTER | ENTITY | BONUS | YEAR | PERIOD | INCENTIVE |
|---------------|----------|-------|------|----------------------|-----------|
| MANUFACTURING | NEW YORK | 1200 | FY18 | 2017-12-31T00:00:00Z | 120 |
| MANUFACTURING | NEW YORK | 300 | FY18 | 2018-01-01T00:00:00Z | 130 |
| MANUFACTURING | NEW YORK | 300 | FY18 | 2018-01-02T00:00:00Z | 70 |

Related

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

Reorder factored matrix columns in Power BI

I have a matrix visual in Power BI. The columns are departments and the rows years. The values are counts of people in each department each year. The departments obviously don't have a natural ordering, BUT I would like to reorder them using the total column count for each department in descending order.
For example, if Department C has 100 people total over the years (rows), and all the other departments have fewer, I want Department C to come first.
I have seen other solutions that add an index column, but this doesn't work very well for me because the "count of people" variable is what I want to index by and that doesn't already exist in my data. Rather it's a calculation based on individual people which each have a department and year.
If anyone can point me to an easy way of changing the column ordering/sorting that would be splendid!
| DeptA | DeptB | DeptC
------|-------|-------|-------
1900 | 2 | 5 | 10
2000 | 6 | 7 | 2
2010 | 10 | 1 | 12
2020 | 0 | 3 | 30
------|-------|-------|-------
Total | 18 | 16 | 54
Order: #2 #3 #1
I don't think there is a built-in way to do this like there is for sorting the rows (there should be though, so go vote for a similar idea here), but here's a possible workaround.
I will assume your source table is called Employees and looks something like this:
Department Year Value
A 1900 2
B 1900 5
C 1900 10
A 2000 6
B 2000 7
C 2000 2
A 2010 10
B 2010 1
C 2010 12
A 2020 0
B 2020 3
C 2020 30
First, create a new calculated table like this:
Depts = SUMMARIZE(Employees, Employees[Department], "Total", SUM(Employees[Value]))
This should give you a short table as follows:
Department Total
A 18
B 16
C 54
From this, you can easily rank the totals with a calculated column on this Depts table:
Rank = RANKX('Depts', 'Depts'[Total])
Make sure your new Depts table is related to the original Employees table on the Department column.
Under the Data tab, use Modeling > Sort by Column to sort Depts[Department] by Depts[Rank].
Finally, replace the Employees[Department] with Depts[Department] on your matrix visual and you should get the following:

DAX - Finding Previous Year Equivalent, based on Fiscal Calendar retaining time/row hierarchy

I have 2 tables that have a many to one relationship defined by a 'FiscalWeekEndDate' value, like so;
WordsTable
FiscalWeekEndDate | WordGroup | IndexedVolume
01/01/2017 | Dining | 1,000
01/01/2017 | Shopping | 2,000
08/01/2017 | Dining | 2,000
08/01/2017 | Sports | 5,000
FiscalDatesTable
FiscalWeekEndDate | FiscalWeek | FiscalMonth | FiscalQuarter | FiscalYear
01/01/2017 | 21 | 5 | 2 | 2017
08/01/2017 | 22 | 5 | 2 | 2017
I'm trying to create a simple PY equivalent measure in DAX (previous year), that works at all levels of time hierarchy - i.e. PY for chosen fiscal week, month, quarter or year. I don't want a YTD (year to date) measure.
This is what I have so far;
Previous Year Indexed Volume:=
CALCULATE([IndexedVolume],
FILTER(ALL('FiscalDatesTable'),'FiscalWeeksTable'[FiscalWeek]=
MAX('FiscalDatesTable'[FiscalWeek]) && 'FiscalDatesTable'[FiscalYear] =
MAX('FiscalDatesTable'[FiscalYear]) - 1))
Which is returning this type of result;
Fiscal Year 2017: 7,000
Fiscal Quarter 2: 7,000
Fiscal Month 5: 7,000
Fiscal Week 22: 7,000
Fiscal Week 21: 3,000
Basically, the Max Fiscal Week value (22: 7,000) is returned for the Month, Quarter or Year dimension. I understand that's a problem due to my DAX Filter, but unsure what else to try.
The desired result would give me the correct Previous Year IndexedVolume value, regardless of the time dimension chosen. i.e. like below;
Fiscal Year 2017: 10,000
Fiscal Quarter 2: 10,000
Fiscal Month 5: 10,000
Fiscal Week 22: 7,000
Fiscal Week 21: 3,000
A few hours of research and I finally cracked it by introducing a 'PreviousFiscalWeekEndDate' into the logic.
Here is the approach if anyone finds it useful.
Previous Year Indexed Volume:=
CALCULATE([IndexedVolume],
FILTER (
ALL ( 'FiscalDatesTable' ),
CONTAINS (
VALUES ( 'FiscalDatesTable'[PreviousYearFiscalWeekEndDate] ),
'FiscalDatesTable'[PreviousYearFiscalWeekEndDate],
'FiscalDatesTable'[FiscalWeekEndDate]
)
)
)

sort_array order by a different column, Hive

I have two columns, one of products, and one of the dates they were bought. I am able to order the dates by applying the sort_array(dates) function, but I want to be able to sort_array(products) by the purchase date.
Is there a way to do that in Hive?
Tablename is
ClientID Product Date
100 Shampoo 2016-01-02
101 Book 2016-02-04
100 Conditioner 2015-12-31
101 Bookmark 2016-07-10
100 Cream 2016-02-12
101 Book2 2016-01-03
Then, getting one row per customer:
select
clientID,
COLLECT_LIST(Product) as Prod_List,
sort_array(COLLECT_LIST(date)) as Date_Order
from tablename
group by 1;
As:
ClientID Prod_List Date_Order
100 ["Shampoo","Conditioner","Cream"] ["2015-12-31","2016-01-02","2016-02-12"]
101 ["Book","Bookmark","Book2"] ["2016-01-03","2016-02-04","2016-07-10"]
But what I want is the order of the products to be tied to the correct chronological order of purchases.
It is possible to do it using only built-in functions, but it is not a pretty site :-)
select clientid
,split(regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',cast(date as string),product)))),'[^:]*:([^,]*(,|$))','$1'),',') as prod_list
,sort_array(collect_list(date)) as date_order
from tablename
group by clientid
;
+----------+-----------------------------------+------------------------------------------+
| clientid | prod_list | date_order |
+----------+-----------------------------------+------------------------------------------+
| 100 | ["Conditioner","Shampoo","Cream"] | ["2015-12-31","2016-01-02","2016-02-12"] |
| 101 | ["Book2","Book","Bookmark"] | ["2016-01-03","2016-02-04","2016-07-10"] |
+----------+-----------------------------------+------------------------------------------+

Create a table of dates in rows

I have a Oracle Database;
and I want to create a table with two columns, one contain id and the other contain incremented dates in rows.
i want to specify in my PL/SQL code the limit dates, and the code will generate the rows between the two limit dates (from and to).
This is an example output result :
+-----+--------------------+
| id |dates |
+-----+--------------------+
| 1 |01/02/2011 04:00:00 |
+-----+--------------------+
| 2 |01/02/2011 05:00:00 |
+-----+--------------------+
| 3 |01/02/2011 06:00:00 |
+-----+--------------------+
| 4 |01/02/2011 07:00:00 |
+-----+--------------------+
| 5 |01/02/2011 08:00:00 |
....
...
..
| 334 |05/03/2011 023:00:00|
+-----+--------------------+
You haven't exactly deluged us with details, but this is the sort of construct you want:
select level as id
, &&start_date + ((level-1) * (1/24) as dates
from dual
connect by level <= ((&&end_date - &&start_date)*24)
/
This assumes your input values are whole days, You will need to adjust the maths if your start or end date contains a time component.
You would need to start with a date baseline:
vBaselineDate := TRUNC(SYSDATE);
OR
vBaselineDate := TO_DATE('28-03-2013 12:00:00', 'DD-MM-YYYY HH:MI:SS');
Then increment the baseline by adding fractions of a day depending on how large you want the range, eg: 1 minute, 1 hour etc.
FOR i IN 1..334 LOOP
INSERT INTO mytable
(id, dates)
VALUES
(i, (vBaselineDate + i/24));
END LOOP;
COMMIT;
1/24 = 1 hour.
1/1440 = 1 minute;
Hope this helps.

Resources