Merging datasets with 2 different time variables in SAS - time
Hye Guys,
for those regularly browsing this site sorry for already another question (however I did solve my last question myself!)
I have another problem with merging datasets, it seems that accounting for time in datasets is a real pain in the ass. I succesfully managed to merge on months in my previous datasets, however it seems I have a final dataset which only has quarter as a time count variable. So where all my normal databases have month 1- xxx as an indicator of time, this database had quarter as an indicator of time.
I still want to add the variables of this last database, let's call it TVOL, into my WORK database.
Quick summary
QUARTER: Quarter 0 = JAN1996-MAR1996
Month: Month 0 = JAN1996
Example: TVOL
TVOL _______ Ticker __________ Quarter
1500 _______ AA ________________ -1
52546 _______ BB ________________ 15
Example: WORK
BETA _______ Ticker __________ Month
1.52 _______ AA ________________ 2
1.54_______ BB ________________ 3
Example: Merged:
BETA _______________ TVOL _______ Ticker __________ Month
1.52 ________________ 500 _________ AA ________________ 2
I now want to merge this 2 tables using following relationship
if the month is in quarter 1, the data of quarter 0 has to be used, so if i have an observation i nWORK with date 2FEB1996 the TVOL of quarter -1 has to be put behind this observation.
Something like IF month = quarter i use data quarter i-1.
Also, as TVOL is measured quarterly and I have to put in monthly I have to take the average, so (TVOL/3) should be added as a variable.
Thanks!
Ok, so I solved my problem!
data test;
set test;
Quarter=intck('qtr','01apr96'd,recdats);
put _all_;
run;
proc sort data=test;
by ticker quarter;
run;
proc sort data=wtvol;
by ticker quarter;
run;
data test;
merge test(in=a) wtvol(in=b);
by ticker quarter;
frommerg=a;
fromwtvol=b;
run;
data test;
set test;
if frommerg=0 then delete;
run;
data test;
set test;
if fromwtvol = 0 then delete;
run;
data test;
set test;
drop frommerg fromwtvol;
run;
I created a quarter variable in my base dataset and merged the 2 sets based on quarter and ticker.
Related
Oracle - determine and return the specfic hour of data with the highest sum of the values
I think I can do this in a more roundabout way using arrays, scripting, etc...BUT is it possible to sum up (aggregate) all the values for each "hour" of data in a database for a given field? Basically, I am trying to determine which hour in a day's worth of data had the highest sum...preferably without having to loop through 24 times for each day I want to look at. For example...let's say I have a table called "table", that contains columns for times and values as the follows: Time Value 00:00 1 00:15 1 00:30 2 00:45 2 01:00 1 01:15 1 01:30 1 01:45 1 If I summed up by hand, I would get the following Sum for 00 Hour = 6 Sum for 01 Hour = 4 So, in this example 00 Hour would be my "largest sum" hour. I'd like to end up returning simply which hour had the highest sum, and what that value was...the other hours don't matter in this case. Can this be done all in a single ORACLE query, or does it need to be done outside the query with some scripting and working with the times and values separately? If not a single, maybe even just grab the sum for each hour, and I can run multiple queries - one for each hour? Then push each hour to an array, and just use the max of that array? I know there is a SUM() function in oracle, but how to tell it to "sum all the hours and just return the hour with the highest sum" escapes me. Hope all this makes sense. lol Thanks for any advice to make this easier. :-)
The following query should do what you are looking for: SELECT SUBSTR(time, 1, 2) AS HOUR, SUM(amount) AS TOTAL_AMOUNT FROM test_data GROUP BY SUBSTR(time, 1, 2) ORDER BY TOTAL_AMOUNT DESC FETCH FIRST ROW WITH TIES; The query uses the SUM function but grouping by the hour part of your time column. Then it orders the results by the summed amounts descending, only returning the maximum value. Here is a DBFiddle showing the query in use (LINK)
In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?
In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data: +----------+----------+----------+ | Employee | Date | Balance | +----------+----------+----------+ | John | 20181029 | 1800.2 | | John | 20181105 | 2937.74 | | John | 20181106 | 3000 | | John | 20181110 | 1500 | | John | 20181119 | -755.5 | | John | 20181120 | -800 | | John | 20181121 | 1200 | | John | 20181122 | -400 | | John | 20181123 | -900 | | John | 20181202 | -1300 | +----------+----------+----------+ If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example. Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance. daily_employee_balance use default; drop table if exists daily_employee_balance; create table if not exists daily_employee_balance ( employee_id string, employee string, iso_date date, balance double ); Insert Sample Data in original table daily_employee_balance insert into table daily_employee_balance values ('103','John','2018-10-25',1800.2), ('103','John','2018-10-29',1125.7), ('103','John','2018-11-05',2937.74), ('103','John','2018-11-06',3000), ('103','John','2018-11-10',1500), ('103','John','2018-11-19',-755.5), ('103','John','2018-11-20',-800), ('103','John','2018-11-21',1200), ('103','John','2018-11-22',-400), ('103','John','2018-11-23',-900), ('103','John','2018-12-02',-1300); Step 2: Dimension Table You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet. use default; drop table if exists dimension_date; create external table dimension_date( date_id int, iso_date string, year string, month string, month_desc string, end_of_month_flg string ); Insert some sample data for entire month of Nov 2018: insert into table dimension_date values (6880,'2018-11-01','2018','2018-11','November','N'), (6881,'2018-11-02','2018','2018-11','November','N'), (6882,'2018-11-03','2018','2018-11','November','N'), (6883,'2018-11-04','2018','2018-11','November','N'), (6884,'2018-11-05','2018','2018-11','November','N'), (6885,'2018-11-06','2018','2018-11','November','N'), (6886,'2018-11-07','2018','2018-11','November','N'), (6887,'2018-11-08','2018','2018-11','November','N'), (6888,'2018-11-09','2018','2018-11','November','N'), (6889,'2018-11-10','2018','2018-11','November','N'), (6890,'2018-11-11','2018','2018-11','November','N'), (6891,'2018-11-12','2018','2018-11','November','N'), (6892,'2018-11-13','2018','2018-11','November','N'), (6893,'2018-11-14','2018','2018-11','November','N'), (6894,'2018-11-15','2018','2018-11','November','N'), (6895,'2018-11-16','2018','2018-11','November','N'), (6896,'2018-11-17','2018','2018-11','November','N'), (6897,'2018-11-18','2018','2018-11','November','N'), (6898,'2018-11-19','2018','2018-11','November','N'), (6899,'2018-11-20','2018','2018-11','November','N'), (6900,'2018-11-21','2018','2018-11','November','N'), (6901,'2018-11-22','2018','2018-11','November','N'), (6902,'2018-11-23','2018','2018-11','November','N'), (6903,'2018-11-24','2018','2018-11','November','N'), (6904,'2018-11-25','2018','2018-11','November','N'), (6905,'2018-11-26','2018','2018-11','November','N'), (6906,'2018-11-27','2018','2018-11','November','N'), (6907,'2018-11-28','2018','2018-11','November','N'), (6908,'2018-11-29','2018','2018-11','November','N'), (6909,'2018-11-30','2018','2018-11','November','Y'); Step 3: Fact Table Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance. This will re-create the original table with missing dates and populate the missing balance with earlier known balance. --inner query to get all the possible dates --outer self join query will populate the missing dates and balance drop table if exists f_employee_balance; create table f_employee_balance stored as orc tblproperties ("orc.compress"="SNAPPY") as select q1.employee_id, q1.iso_date, nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance, month, year from ( select distinct r.employee_id, d.iso_date as iso_date, d.month, d.year from daily_employee_balance r, dimension_date d )q1 left outer join daily_employee_balance r on (q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date); Step 4: Analytics The query below will give you the true average for by month: select employee_id, monthly_avg, month, year from ( select employee_id, row_number() over (partition by employee_id,year,month) as row_num, avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from f_employee_balance)q1 where row_num = 1 order by year, month; Step 5: Conclusion You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.
Hive Script - How to transform table / find average of certain records according to one columns name?
I want to transform a Hive table by aggregating based on averages. However, I don't want the average value of an entire column, I want the average of the records in that column that have the same type in another column. Here's an example, easier than trying to explain: TABLE I HAVE: Timestamp CounterName CounterValue MaxCounterValue MinCounterValue 00:00 Counter1 3 3 100:00 Counter2 4 5 2 00:00 Counter3 1 4 1 00:00 Counter4 6 6 100:05 Counter1 3 5 200:05 Counter2 2 2 200:05 Counter3 4 5 400:05 Counter4 6 6 5....... TABLE I WANT: CounterName AvgCounterValue MaxCounterValue MinCounterValue Counter1 3 5 1Counter2 3 5 2Counter3 2.5 5 1Counter4 6 6 1 So I have a list of a bunch of counters, which each have multiple records (one per 5 minute time period). Every time each counter is logged, it has a value, a max value during that 5 minutes, and a min value. I want to aggregate this huge table so that it just has one record for each counter, which records the overall average value for that counter from all the records in the table,and then the overall min/max value of the counter in the table. The reason this is difficult is because all the documentation says is how to aggregate by the average of a column in one table - I don't know how to split it up in groups. Here's the script I've started with: FROM HighCounters INSERT OVERWRITE TABLE MdsHighCounters SELECT HighCounters.CounterName AS CounterName, HighCounters.CounterValue AS CounterValue HighCounters.MaxCounterValue AS MaxCounterValue, HighCounters.MinCounterValue AS MinCounterValue GROUP BY HighCounters.CounterName; And I don't know where to go from there... any ideas? Thanks!!
I think I solved my own problem: FROM HighCounters INSERT OVERWRITE TABLE MdsHighCounters SELECT HighCounters.CounterName AS CounterName, AVG(HighCounters.CounterValue) AS CounterValue, MAX(HighCounters.MaxCounterValue) AS MaxCounterValue, MIN(HighCounters.MinCounterValue) AS MinCounterValue GROUP BY HighCounters.CounterName; Does this look right to you?
Postgres timeline simulator
I want to order search results by (age group, rank), and have age groups of 1 day, 1 week, 1 month, 6 months etc. I know I can get the "days old" with SELECT NOW()::DATE - created_at::DATE FROM blah and am thinking to do a CASE statement based on that, but am I barking up the right tree performance wise? Is there a nicer way?
You can also create separate table with intervals definition and labels. However this comes at cost of extra join to get the data. create table distance ( d_start int, d_end int, d_description varchar ); insert into distance values (1,7,'1 week'), (8,30,'1 month'), (31,180,'6 months'), (181,365,'1 year'), (366,999999,'more than one year') ; with sample_data as ( select * from generate_series('2013-01-01'::date,'2014-01-01'::date,'1 day') created_at ) select created_at, d_description from sample_data sd join distance d on ((current_date-created_at::date) between d.d_start and d.d_end) ;
Using this function to update an INT column stored on the table for performance reasons,and running an occasional update task. What's nice that way is that it's only necessary to run it against a small subset of the data once per hour (anything <~ 1 week old), and every 24 hours can just run it against anything > 1 week old (perhaps even a weekly task for even older stuff.) CREATE OR REPLACE FUNCTION age_group(_date timestamp) RETURNS int AS $$ DECLARE days_old int; age_group int; BEGIN days_old := current_date - _date::DATE; age_group := CASE WHEN days_old < 2 THEN 0 WHEN days_old < 8 THEN 1 WHEN days_old < 30 THEN 2 WHEN days_old < 90 THEN 3 ELSE 4 END; RETURN age_group; END; $$ LANGUAGE plpgsql;
Eliminate pairs of observations under the condition, that observations can have more than one possible partner observation
In my current project we got several occasions where we had to implement a matching based on varying conditions. First a more detailed description of the Problem. We got a table test: key Value 1 10 1 -10 1 10 1 20 1 -10 1 10 2 10 2 -10 Now we want to apply a rule, so that inside a group (defined by value of key) pairs with a sum of 0 should be eliminated. The expected result would be: key value 1 10 1 20 Sort order is not relevant. The following code is an example of our solution. We want to eliminate observations with my_id 2 and 7 and additionaly 2 of the 3 Observations with amount 10. data test; input my_id alias $ amount; datalines4; 1 aaa 10 2 aaa -10 3 aaa 8000 4 aaa -16000 5 aaa 700 6 aaa 10 7 aaa -10 8 aaa 10 ;;;; run; /* get all possible matches represented by pairs of my_id */ proc sql noprint; create table zwischen_erg as select a.my_id as a_id, b.my_id as b_id from test as a inner join test as b on (a.alias=b.alias) where a.amount=-b.amount; quit; /* select ids of matches to eliminate */ proc sort data=zwischen_erg ; by a_id b_id; run; data zwischen_erg1; set zwischen_erg; by a_id; if first.a_id then tmp_id1 = 0; tmp_id1 +1; run; proc sort data=zwischen_erg; by b_id a_id; run; data zwischen_erg2; set zwischen_erg; by b_id; if first.b_id then tmp_id2 = 0; tmp_id2 +1; run; proc sql; create table delete_ids as select zwischen_erg1.a_id as my_id from zwischen_erg1 as erg1 left join zwischen_erg2 as erg2 on (erg1.a_id = erg2.a_id and erg1.b_id = erg2.b_id) where tmp_id1 = tmp_id2 ; quit; /* use delete_ids as filter */ proc sql noprint; create table erg as select a.* from test as a left join delete_ids as b on (a.my_id = b.my_id) where b.my_id=.; quit; The algorithm seems to work, at least nobody found input data that caused a error. But nobody could explain to me why it works and I dont understand in detail how it is working. So i got a couple of questions. Does this algorithm eliminate the pairs in a correct manner for all possible combinations of input data? If it does work correct, how does the algorithm work in detail? Especially the part where tmp_id1 = tmp_id2. Is there a better algorithm to eliminate corresponding pairs? Thanks in advance and happy coding Michael
As an answer to your third question. The following approach seems simpler to me. And probably more performant. (since i have no joins) /*For every (absolute) value, find how many more positive/negative occurrences we have per key*/ proc sql; create view V_INTERMEDIATE_VIEW as select key, abs(Value) as Value_abs, sum(sign(value)) as balance from INPUT_DATA group by key, Value_abs ; quit; *The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate; /*Now output*/ data OUTPUT_DATA (keep=key Value); set V_INTERMEDIATE_VIEW; Value = sign(balance)*Value_abs; *Put the correct value back; do i=1 to abs(balance) by 1; output; end; run; If you only want pure SAS (so no proc sql), you could do it as below. Note that the idea behind it remains the same. data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW; set INPUT_DATA; value_abs = abs(value); run; proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA; by key value_abs; *we will encounter the negatives of each value and then the positives; run; data OUTPUT_DATA (keep=key value); set INTERMEDIATE_DATA; by key value_abs; retain balance 0; balance = sum(balance,sign(value)); if last.value_abs then do; value = sign(balance)*value_abs; *set sign depending on what we have in excess; do i=1 to abs(balance) by 1; output; end; balance=0; *reset balance for next value_abs; end; run; NOTE: thanks to Joe for some useful performance suggestions.
I don't see any bugs after a quick read. But "zwischen_erg" could have a lot of unnecessary many-to-many matches which would be inefficient. This seems to work (but not guaranteed), and might be more efficient. Also shorter, so perhaps easier to see whats going on. data test; input my_id alias $ amount; datalines4; 1 aaa 10 2 aaa -10 3 aaa 8000 4 aaa -16000 5 aaa 700 6 aaa 10 7 aaa -10 8 aaa 10 ;;;; run; proc sort data=test; by alias amount; run; data zwischen_erg; set test; by alias amount; if first.amount then occurrence = 0; occurrence+1; run; proc sql; create table zwischen as select a.my_id, a.alias, a.amount from zwischen_erg as a left join zwischen_erg as b on a.amount = (-1)*b.amount and a.occurrence = b.occurrence where b.my_id is missing; quit;