Merging datasets with 2 different time variables in SAS - time

Hye Guys,
for those regularly browsing this site sorry for already another question (however I did solve my last question myself!)
I have another problem with merging datasets, it seems that accounting for time in datasets is a real pain in the ass. I succesfully managed to merge on months in my previous datasets, however it seems I have a final dataset which only has quarter as a time count variable. So where all my normal databases have month 1- xxx as an indicator of time, this database had quarter as an indicator of time.
I still want to add the variables of this last database, let's call it TVOL, into my WORK database.
Quick summary
QUARTER: Quarter 0 = JAN1996-MAR1996
Month: Month 0 = JAN1996
Example: TVOL
TVOL _______ Ticker __________ Quarter
1500 _______ AA ________________ -1
52546 _______ BB ________________ 15
Example: WORK
BETA _______ Ticker __________ Month
1.52 _______ AA ________________ 2
1.54_______ BB ________________ 3
Example: Merged:
BETA _______________ TVOL _______ Ticker __________ Month
1.52 ________________ 500 _________ AA ________________ 2
I now want to merge this 2 tables using following relationship
if the month is in quarter 1, the data of quarter 0 has to be used, so if i have an observation i nWORK with date 2FEB1996 the TVOL of quarter -1 has to be put behind this observation.
Something like IF month = quarter i use data quarter i-1.
Also, as TVOL is measured quarterly and I have to put in monthly I have to take the average, so (TVOL/3) should be added as a variable.
Thanks!

Ok, so I solved my problem!
data test;
set test;
Quarter=intck('qtr','01apr96'd,recdats);
put _all_;
run;
proc sort data=test;
by ticker quarter;
run;
proc sort data=wtvol;
by ticker quarter;
run;
data test;
merge test(in=a) wtvol(in=b);
by ticker quarter;
frommerg=a;
fromwtvol=b;
run;
data test;
set test;
if frommerg=0 then delete;
run;
data test;
set test;
if fromwtvol = 0 then delete;
run;
data test;
set test;
drop frommerg fromwtvol;
run;
I created a quarter variable in my base dataset and merged the 2 sets based on quarter and ticker.

Related

Oracle - determine and return the specfic hour of data with the highest sum of the values

I think I can do this in a more roundabout way using arrays, scripting, etc...BUT is it possible to sum up (aggregate) all the values for each "hour" of data in a database for a given field? Basically, I am trying to determine which hour in a day's worth of data had the highest sum...preferably without having to loop through 24 times for each day I want to look at. For example...let's say I have a table called "table", that contains columns for times and values as the follows:
Time Value
00:00 1
00:15 1
00:30 2
00:45 2
01:00 1
01:15 1
01:30 1
01:45 1
If I summed up by hand, I would get the following
Sum for 00 Hour = 6
Sum for 01 Hour = 4
So, in this example 00 Hour would be my "largest sum" hour. I'd like to end up returning simply which hour had the highest sum, and what that value was...the other hours don't matter in this case.
Can this be done all in a single ORACLE query, or does it need to be done outside the query with some scripting and working with the times and values separately? If not a single, maybe even just grab the sum for each hour, and I can run multiple queries - one for each hour? Then push each hour to an array, and just use the max of that array? I know there is a SUM() function in oracle, but how to tell it to "sum all the hours and just return the hour with the highest sum" escapes me. Hope all this makes sense. lol
Thanks for any advice to make this easier. :-)
The following query should do what you are looking for:
SELECT SUBSTR(time, 1, 2) AS HOUR,
SUM(amount) AS TOTAL_AMOUNT
FROM test_data
GROUP BY SUBSTR(time, 1, 2)
ORDER BY TOTAL_AMOUNT DESC
FETCH FIRST ROW WITH TIES;
The query uses the SUM function but grouping by the hour part of your time column. Then it orders the results by the summed amounts descending, only returning the maximum value.
Here is a DBFiddle showing the query in use (LINK)

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

Hive Script - How to transform table / find average of certain records according to one columns name?

I want to transform a Hive table by aggregating based on averages. However, I don't want the average value of an entire column, I want the average of the records in that column that have the same type in another column.
Here's an example, easier than trying to explain:
TABLE I HAVE:
Timestamp CounterName CounterValue MaxCounterValue MinCounterValue
00:00 Counter1 3 3 100:00 Counter2 4 5 2
00:00 Counter3 1 4 1
00:00 Counter4 6 6 100:05 Counter1 3 5 200:05 Counter2 2 2 200:05 Counter3 4 5 400:05 Counter4 6 6 5.......
TABLE I WANT:
CounterName AvgCounterValue MaxCounterValue MinCounterValue
Counter1 3 5 1Counter2 3 5 2Counter3 2.5 5 1Counter4 6 6 1
So I have a list of a bunch of counters, which each have multiple records (one per 5 minute time period). Every time each counter is logged, it has a value, a max value during that 5 minutes, and a min value. I want to aggregate this huge table so that it just has one record for each counter, which records the overall average value for that counter from all the records in the table,and then the overall min/max value of the counter in the table.
The reason this is difficult is because all the documentation says is how to aggregate by the average of a column in one table - I don't know how to split it up in groups.
Here's the script I've started with:
FROM HighCounters INSERT OVERWRITE TABLE MdsHighCounters
SELECT
HighCounters.CounterName AS CounterName,
HighCounters.CounterValue AS CounterValue
HighCounters.MaxCounterValue AS MaxCounterValue,
HighCounters.MinCounterValue AS MinCounterValue
GROUP BY HighCounters.CounterName;
And I don't know where to go from there... any ideas? Thanks!!
I think I solved my own problem:
FROM HighCounters INSERT OVERWRITE TABLE MdsHighCounters
SELECT
HighCounters.CounterName AS CounterName,
AVG(HighCounters.CounterValue) AS CounterValue,
MAX(HighCounters.MaxCounterValue) AS MaxCounterValue,
MIN(HighCounters.MinCounterValue) AS MinCounterValue
GROUP BY HighCounters.CounterName;
Does this look right to you?

Postgres timeline simulator

I want to order search results by (age group, rank), and have age groups of 1 day, 1 week, 1 month, 6 months etc. I know I can get the "days old" with
SELECT NOW()::DATE - created_at::DATE FROM blah
and am thinking to do a CASE statement based on that, but am I barking up the right tree performance wise? Is there a nicer way?
You can also create separate table with intervals definition and labels. However this comes at cost of extra join to get the data.
create table distance (
d_start int,
d_end int,
d_description varchar
);
insert into distance values
(1,7,'1 week'),
(8,30,'1 month'),
(31,180,'6 months'),
(181,365,'1 year'),
(366,999999,'more than one year')
;
with
sample_data as (
select *
from generate_series('2013-01-01'::date,'2014-01-01'::date,'1 day') created_at
)
select
created_at,
d_description
from
sample_data sd
join distance d on ((current_date-created_at::date) between d.d_start and d.d_end)
;
Using this function to update an INT column stored on the table for performance reasons,and running an occasional update task. What's nice that way is that it's only necessary to run it against a small subset of the data once per hour (anything <~ 1 week old), and every 24 hours can just run it against anything > 1 week old (perhaps even a weekly task for even older stuff.)
CREATE OR REPLACE FUNCTION age_group(_date timestamp) RETURNS int AS
$$
DECLARE
days_old int;
age_group int;
BEGIN
days_old := current_date - _date::DATE;
age_group := CASE
WHEN days_old < 2 THEN 0
WHEN days_old < 8 THEN 1
WHEN days_old < 30 THEN 2
WHEN days_old < 90 THEN 3
ELSE 4
END;
RETURN age_group;
END;
$$
LANGUAGE plpgsql;

Eliminate pairs of observations under the condition, that observations can have more than one possible partner observation

In my current project we got several occasions where we had to implement a matching based on varying conditions. First a more detailed description of the Problem.
We got a table test:
key Value
1 10
1 -10
1 10
1 20
1 -10
1 10
2 10
2 -10
Now we want to apply a rule, so that inside a group (defined by value of key) pairs with a sum of 0 should be eliminated.
The expected result would be:
key value
1 10
1 20
Sort order is not relevant.
The following code is an example of our solution.
We want to eliminate observations with my_id 2 and 7 and additionaly 2 of the 3 Observations with amount 10.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
/* get all possible matches represented by pairs of my_id */
proc sql noprint;
create table zwischen_erg as
select a.my_id as a_id,
b.my_id as b_id
from test as a inner join
test as b on (a.alias=b.alias)
where a.amount=-b.amount;
quit;
/* select ids of matches to eliminate */
proc sort data=zwischen_erg ;
by a_id b_id;
run;
data zwischen_erg1;
set zwischen_erg;
by a_id;
if first.a_id then tmp_id1 = 0;
tmp_id1 +1;
run;
proc sort data=zwischen_erg;
by b_id a_id;
run;
data zwischen_erg2;
set zwischen_erg;
by b_id;
if first.b_id then tmp_id2 = 0;
tmp_id2 +1;
run;
proc sql;
create table delete_ids as
select zwischen_erg1.a_id as my_id
from zwischen_erg1 as erg1 left join
zwischen_erg2 as erg2 on
(erg1.a_id = erg2.a_id and
erg1.b_id = erg2.b_id)
where tmp_id1 = tmp_id2
;
quit;
/* use delete_ids as filter */
proc sql noprint;
create table erg as
select a.*
from test as a left join
delete_ids as b on (a.my_id = b.my_id)
where b.my_id=.;
quit;
The algorithm seems to work, at least nobody found input data that caused a error.
But nobody could explain to me why it works and I dont understand in detail how it is working.
So i got a couple of questions.
Does this algorithm eliminate the pairs in a correct manner for all possible combinations of input data?
If it does work correct, how does the algorithm work in detail? Especially the part
where tmp_id1 = tmp_id2.
Is there a better algorithm to eliminate corresponding pairs?
Thanks in advance and happy coding
Michael
As an answer to your third question. The following approach seems simpler to me.
And probably more performant. (since i have no joins)
/*For every (absolute) value, find how many more positive/negative occurrences we have per key*/
proc sql;
create view V_INTERMEDIATE_VIEW as
select key, abs(Value) as Value_abs, sum(sign(value)) as balance
from INPUT_DATA
group by key, Value_abs
;
quit;
*The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate;
/*Now output*/
data OUTPUT_DATA (keep=key Value);
set V_INTERMEDIATE_VIEW;
Value = sign(balance)*Value_abs; *Put the correct value back;
do i=1 to abs(balance) by 1;
output;
end;
run;
If you only want pure SAS (so no proc sql), you could do it as below. Note that the idea behind it remains the same.
data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW;
set INPUT_DATA;
value_abs = abs(value);
run;
proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA;
by key value_abs; *we will encounter the negatives of each value and then the positives;
run;
data OUTPUT_DATA (keep=key value);
set INTERMEDIATE_DATA;
by key value_abs;
retain balance 0;
balance = sum(balance,sign(value));
if last.value_abs then do;
value = sign(balance)*value_abs; *set sign depending on what we have in excess;
do i=1 to abs(balance) by 1;
output;
end;
balance=0; *reset balance for next value_abs;
end;
run;
NOTE: thanks to Joe for some useful performance suggestions.
I don't see any bugs after a quick read. But "zwischen_erg" could have a lot of unnecessary many-to-many matches which would be inefficient.
This seems to work (but not guaranteed), and might be more efficient. Also shorter, so perhaps easier to see whats going on.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
proc sort data=test;
by alias amount;
run;
data zwischen_erg;
set test;
by alias amount;
if first.amount then occurrence = 0;
occurrence+1;
run;
proc sql;
create table zwischen as
select
a.my_id,
a.alias,
a.amount
from zwischen_erg as a
left join zwischen_erg as b
on a.amount = (-1)*b.amount and a.occurrence = b.occurrence
where b.my_id is missing;
quit;

Resources