SAS: creating benchmark period dummies (around event) - events

I want to do an event study and need some descriptives around the events I am testing with different window sizes.
Let's assume I have a data set with 10000 observations (stocks, dates, measures) and merge them with an announcement data set. Now, I want to get a dummy variable that is always 0 except for parameters given by me:
if date = announcement_date then;
window_dummy at t-60 to t-11 = 1
or
window_dummy at t-5 to t+5 = 1
I hope I described it appropriately and there is a way because there is no lead function or similar.
Best
M

From what I gather from this, you wish to have additional variable, which creates window (or binning) variable reaching x days to past and future;
First we generate some test data. SAS dates are just numbers so:
data begin;
do i = 21040 to 21080;
my_date= i;
output;
end;
drop i;
run;
Next we calculate the window:
%let act_date= 21060;
%let min_date= %eval(&act_date - 5); /*5 days to past*/
%let max_date= %eval(&act_date + 3); /*3 days to future*/
Then it's left to generate the dummy variable:
data refined;
set begin;
if &min_date <= my_date and
my_date <= &max_date then
dummy = 1;
else dummy = 0;
run;
Now, one can calculate this all in single datastep, but maybe this is more informative way.
Edit: you can specify windows as you please. In your case you could have:
min_date1= %eval(&act_date - 60);
max_date2= %eval(&act_date - 11);
....
Then just add them to the if clause with or operator.

Related

Google Spreadsheet Loop & IF statement: Loop through co-worker list and add them to a rotation schedule

EDIT - SOLVED
I have to make a rotation schedule for coworkers for the next year. Some coworkers have standard days off and I do not want to schedule them on those days.
This is the manual outcome I would like to get.
Example: Consultant A does not work on mondays, so I do not want Consultant A to be added to the schedule on a monday.
I then want consultant B to be added to the schedule as a fill-up. Consultant A would be next in line on a tuesday etc. Next would be consultant C but consultant C does not work on wednesdays. Therefore, we need to take consultant D for wednesday and consultant C on a thursday, and so on. When we are at the last consultant of the F column, it needs to start again at consultant A.
I have tried all kinds of formulas, like if statements and arrayformula. But there is no way that I know of to loop through the F column just with formulas.
I am not sure if this is at all clear what I want to achieve here, I am stuck 😄
I am using additionally an add-on to send the schedule to everyone's agenda, thats also the reason why i'd love to automate this, because it would help me SO much.
I did try myself on some coding, but I am no coder and I am not sure if it would be helpful at all to share my failure 😄 But this is what I've tried so far:
function Loop() {
var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
var EndRow = ss.getLastRow();
for (var i = 2; i <= EndRow; i++) {
var Day = ss.getRange(i,2).getValue();
var Consultants = ss.getRange(i,6).getValue();
var Off = ss.getRange(i,7).getValue ();
var Count = ss.getRange(i,8).getValue();
if(Day == Off){
ss.getRange(i, 3).setValue(Consultants)
}else{
ss.getRange(i, 3).setValue(Consultants)
}
}
}
EDIT:
I found a way without using apps scripts, costs me some more work manually and first tried it with a shorter team list.
The highlighted yellow cells are the cells in which the day off was identical to the work-day cell. So they got switched.
I did have to copy paste my input list of consultants but if this is the only manual way, its fine :)
Try this:
Code:
function myFunction() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheetByName("Sheet1");
var dayCol = sh.getRange("B2:B343").getValues().flat(); //get day column values and convert it to 1d array
var dayOffCol = sh.getRange("G2:G9").getValues().flat(); //get day off values and convert it to 1d array
var dayOffColCopy; //initialize copy
var consCol = sh.getRange("F2:F9").getValues().flat(); //get consultants values or column F
var consColCopy; //initialize copy
var tempArray = []; //storage of final value for column C
for(var i = 0; i < dayCol.length; i++){ //loop through dayCol values
var ctr = (i % consCol.length); //used modulo as counter. the value will return to 0 if the value of i is divisible to the length of consCol or in example 8
//The if statement below will help up reset the value of
//consColCopy and dayOffColCopy once the values are emptied because of the splice()
if(ctr == 0){
consColCopy = consCol.slice();
dayOffColCopy = dayOffCol.slice();
}
//the loop below will get the first non-matching values of dayCol and dayOffColCopy,
//the first non-matching values will be removed to the copy variables using splice()
//and insert it to tempArray using push()
for(var j = 0; j < dayOffColCopy.length; j++){ //loop through dayOffColCopy values
if(dayCol[i] != dayOffColCopy[j]){
tempArray.push(consColCopy.splice(j, 1));
dayOffColCopy.splice(j, 1);
break; //exit loop
}
}
}
sh.getRange(2, 3, tempArray.length, 1).setValues(tempArray); //set the values of temp array to column C
}
Example Data & Output:
Note: Make sure to use the cell that has data in your range and change the sheet name. I also added comments in my code to explain the process.
References:
Array.prototype.push()
Array.prototype.slice()
Array.prototype.splice()
Class Range

Sort a range or array based on two columns that contain the date and time

Currently I'm trying to create a Google Apps Script for Google Sheets which will allow adding weekly recurring events, batchwise, for upcoming events. My colleagues will then make minor changes to these added events (e.g. make date and time corrections, change the contact person, add materials neccessary for the event and so forth).
So far, I have written the following script:
function CopyWeeklyEventRows() {
var ss = SpreadsheetApp.getActiveSheet();
var repeatingWeeks = ss.getRange(5,1).getValue(); // gets how many weeks it should repeat
var startDate = ss.getRange(6, 1).getValue(); // gets the start date
var startWeekday = startDate.getDay(); // gives the weekday of the start date
var regWeek = ss.getRange(9, 2, 4, 7).getValues(); // gets the regular week data
var regWeekdays = new Array(regWeek.length); // creates an array to store the weekdays of the regWeek
var ArrayStartDate = new Array(startDate); // helps to store the We
for (var i = 0; i < regWeek.length; i++){ // calculates the difference between startWeekday and each regWeekdays
regWeekdays[i] = regWeek[i][1].getDay() - startWeekday;
Logger.log(regWeekdays[i]);
// Add 7 to move to the next week and avoid negative values
if (regWeekdays[i] < 0) {
regWeekdays[i] = regWeekdays[i] + 7;
}
// Add days according to difference between startWeekday and each regWeekdays
regWeek[i][0] = new Date(ArrayStartDate[0].getTime() + regWeekdays[i]*3600000*24);
}
// I'm struggling with this line. The array regWeek is not sorted:
//regWeek.sort([{ column: 1, ascending: true }]);
ss.getRange(ss.getLastRow() + 1, 2, 4, 7).setValues(regWeek); // copies weekly events after the last row
}
It allows to add one week of recurring events to the overview section of the spreadsheet based on a start date. If the start date is a Tuesday, the regular week is added starting from a Tuesday. However, the rows are not sorted according to the dates:
.
How can the rows be sorted by ascending date (followed by time) before adding them to the overview?
My search for similar questions revealed Google Script sort 2D Array by any column which is the closest hit I've found. The same error message is shown when running my script with the sort line. I don't understand the difference between Range and array yet which might help to solve the issue.
To give you a broader picture, here's what I'm currently working on:
I've noticed that the format will not necessarily remain when adding
new recurring events. So far I haven't found the rule and formatted by
hand in a second step.
A drawback is currently that the weekly recurring events section is
fixed. I've tried to find the last filled entry and use it to set the
range of regWeek, but got stuck.
Use the column A to exclude recurring events from the addition
process using a dropdown.
Allow my colleagues to add an event to the recurring events using a
dropdown (e.g. A26). This event should then be added with sorting to
the right day of the week and start time. The sorting will come in
handy.
Thanks in advance for your input regarding the sorting as well as suggestions on how to improve the code in general.
A demo version of the spreadsheet
UpdateV01:
Here the code lines which copy and sort (first by date, then by time)
ss.getRange(ss.getLastRow()+1,2,4,7).setValues(regWeek); // copies weekly events after the last row
ss.getRange(ss.getLastRow()-3,2,4,7).sort([{column: 2, ascending: true}, {column: 4, ascending: true}]); // sorts only the copied weekly events chronologically
As #tehhowch pointed out, this is slow. Better to sort BEFORE writing.
I will implement this method and post it here.
UpdateV02:
regWeek.sort(function (r1, r2) {
// sorts ascending on the third column, which is index 2
return r1[2] - r2[2];
});
regWeek.sort(function (r1, r2) {
// r1 and r2 are elements in the regWeek array, i.e.
// they are each a row array if regWeek is an array of arrays:
// Sort ascending on the first column, which is index 0:
// if r1[0] = 1, r2[0] = 2, then 1 - 2 is -1, so r1 sorts before r2
return r1[0] - r2[0];
});
UpdateV03:
Here an attempt to repeat the recurring events over several weeks. Don't know yet how to include the push for the whole "week".
// Repeat week for "A5" times and add to start/end date
for (var j = 0; j < repeatingWeeks; j++){
for (var i = 0; i < numFilledRows; i++){
regWeekRepeated[i+j*6][0] = new Date(regWeek[i][0].getTime() + j*7*3600000*24); // <-This line leads to an error message
regWeekRepeated[i+j*6][3] = new Date(regWeek[i][3].getTime() + j*7*3600000*24);
}
}
My question was answered and I was able to make the code work as intended.
Given your comment - you want to sort the written chunk - you have two methods available. One is to sort written data after writing, by using the Spreadsheet service's Range#sort(sortObject) method. The other is to sort the data before writing, using the JavaScript Array#sort(sortFunction()) method.
Currently, your sort code //regWeek.sort([{ column: 1, ascending: true }]); is attempting to sort a JavaScript array, using the sorting object expected by the Spreadsheet service. Thus, you can simply chain this .sort(...) call to your write call, as Range#setValues() returns the same Range, allowing repeated Range method calling (e.g. to set values, then apply formatting, etc.).
This looks like:
ss.getRange(ss.getLastRow() + 1, 2, regWeek.length, regWeek[0].length)
.setValues(regWeek)
/* other "chainable" Range methods you want to apply to
the cells you just wrote to. */
.sort([{column: 1, ascending: true}, ...]);
Here I have updated the range you access to reference the data you are attempting to write - regWeek - so that it is always the correct size to hold the data. I've also visually broken apart the one-liner so you can better see the "chaining" that is happening between Spreadsheet service calls.
The other method - sorting before writing - will be faster, especially as the size and complexity of the sort increases. The idea behind sorting a range is you need to use a function that returns a negative value when the first index's value should come before the second's, a positive value when the first index's value should come after the second's, and a zero value if they are equivalent. This means a function that returns a boolean is NOT going to sort as one thinks, since false and 0 are equivalent in Javascript, while true and 1 are also equivalent.
Your sort looks like this, assuming regWeek is an array of arrays and you are sorting on numeric values (or at least values which will cast to numbers, like Dates).
regWeek.sort(function (r1, r2) {
// r1 and r2 are elements in the regWeek array, i.e.
// they are each a row array if regWeek is an array of arrays:
// Sort ascending on the first column, which is index 0:
// if r1[0] = 1, r2[0] = 2, then 1 - 2 is -1, so r1 sorts before r2
return r1[0] - r2[0];
});
I strongly recommend reviewing the Array#sort documentation.
You could sort the "Weekly Events" range before you set the regWeek variable. Then the range would be in the order you want before you process it. Or you could sort the whole "Overview" range after setting the data. Here's a quick function you can call to sort the range by multiple columns. You can of course tweak it to sort the "Weekly Events" range instead of the "Overview" range.
function sortRng() {
var ss = SpreadsheetApp.getActiveSheet();
var firstRow = 22; var firstCol = 1;
var numRows = ss.getLastRow() - firstRow + 1;
var numCols = ss.getLastColumn();
var overviewRng = ss.getRange(firstRow, firstCol, numRows, numCols);
Logger.log(overviewRng.getA1Notation());
overviewRng.sort([{column: 2, ascending: true}, {column: 4, ascending: true}]);
}
As for getting the number of filled rows in the Weekly Events section, you need to search a column that will always have data if any row has data (like the start date column b), loop through the values and the first time it finds a blank, return that number. That will give you the number of rows that it needs to copy. Warning: if you don't have at least one blank value in column B between the Weekly Events section and the Overview section, you will probably get unwanted results.
function getNumFilledRows() {
var ss = SpreadsheetApp.getActiveSheet();
var eventFirstRow = 9; var numFilledRows = 0;
var colToCheck = 'B';//the StartDate col which should always have data if the row is filled
var vals = ss.getRange(colToCheck + eventFirstRow + ":" + colToCheck).getValues();
for (i = 0; i < vals.length; i++) {
if (vals[i][0] == '') {
numFilledRows = i;
break;
}
}
Logger.log(numFilledRows);
return numFilledRows;
}
EDIT:
If you just want to sort the array in javascript before writing, and you want to sort by Start Date first, then by Time of day, you could make a temporary array, and add a column to each row that is date and time combined. array.sort() sorts dates alphabetically, so you would need to convert that date to an integer. Then you could sort the array by the new column, then delete the new column from each row. I included a function that does this below. It could be a lot more compact but I thought it might be more legible like this.
function sortDates() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var vals = ss.getActiveSheet().getRange('B22:H34').getDisplayValues(); //get display values because getValues returns time as weird date 1899 and wrong time.
var theDate = new Date(); var newArray = []; var theHour = ''; var theMinutes = '';
var theTime = '';
//Create a new array that inserts date and time as the first column in each row
vals.forEach(function(aRow) {
theTime = aRow[2];//hardcoded - assumes time is the third column that you grabbed
//get the hours (before colon) as a number
theHour = Number(theTime.substring(0,theTime.indexOf(':')));
//get the minutes(after colon) as a number
theMinutes = Number(theTime.substring(theTime.indexOf(':')+1));
theDate = new Date(aRow[0]);//hardcoded - assumes date is the first column you grabbed.
theDate.setHours(theHour);
theDate.setMinutes(theMinutes);
aRow.unshift(theDate.getTime()); //Add the date and time as integer to the first item in the aRow array for sorting purposes.
newArray.push(aRow);
});
//Sort the newArray based on the first item of each row (date and time as number)
newArray.sort((function(index){
return function(a, b){
return (a[index] === b[index] ? 0 : (a[index] < b[index] ? -1 : 1));
};})(0));
//Remove the first column of each row (date and time combined) that we added in the first step
newArray.forEach(function(aRow) {
aRow.shift();
});
Logger.log(newArray);
}

sas : sorting one column without changing the order of the others

I wonder if we can just order one column in sas and keep the same order for the other variables.
usually, we use
proc sort
with "by" but this change the order of all variables according to the variable used in "by".
Thank you for help
Create the sorted column as a new dataset and then merge it back onto the data.
proc sort data=have (keep=COLUMN) out=COLUMN ;
by COLUMN;
run;
data want ;
merge have COLUMN;
* no BY statement ;
run;
You can do this in a data step by using hash methods - e.g. to reverse the order of the name column in sashelp.class while keeping the other columns in the same order:
data class;
/*Set up an ordered hash object + iterator to hold the columns we want to sort*/
if 0 then set sashelp.class(keep = name);
declare hash h(ordered:'d');
rc = h.definekey('name','_n_');
rc = h.definedata('name');
rc = h.definedone();
declare hiter hi('h');
/*Populate the hash object, using _n_ as an extra key to prevent deduplication*/
do _n_ = 1 by 1 until(eof1);
set sashelp.class(keep = name) end = eof1;
rc = h.add();
end;
/*Read in the columns in the desired order using the hash iterator*/
do until(eof2);
set sashelp.class end = eof2;
rc = hi.next();
output;
drop rc;
end;
run;
This assumes that you have sufficient memory to hold the columns being sorted.
You can't do it with PROC SORT. You will need to split the data into sorted and non-sorted datasets, then merge then back together one-to-one without "by" statement using a data step. http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001318478.htm
Regards,
Vasilij

PIG- Aggregations based on multiple columns

My Input data set has 3 columns and schema looks like below:
ActivityDate, EventId, EventDate
Now, using pig i need to derive multiple variables like below in one output file:
1) All Event Ids after ActivityDate >= EventDate -30 days
2) All Event Ids after ActivityDate >= EventDate -60 days
3) All Event Ids after ActivityDate >= EventDate -90 days
I have more than 30 variables like this. If it is one variable, we can use simple FILTER to filter the data.
I am thinking about any UDF implementation which takes bag as input and returns count of Event IDs based on above criteria for each parameter.
What is the best way to aggregate the data on multiple columns in pig ?
I would suggest creating another file with all of your thresholds and cross joining with the file.
so you would have a file containing:
30
60
90
etc
read it like this:
grouping = load 'grouping.txt' using PigStorage(',') as (groups:double);
Then do:
data_with_grouping = cross data, grouping;
Then have this binary condition:
data_with_binary_condition = foreach data_with_grouping generate ActivityDate, EventId, EventDate, groups, (ActivityDate >= EventDate - groups ? 1 : 0) as binary_condition;
Now you will have one column with the threshold and one column with a binary variable that tells you whether the ID follows the condition or not.
you can do a filter out all of the zeros from the binary_condition and then group on the groups column:
data_with_binary_condition_filtered = filter data_with_binary_condition by (binary_condition != 0);
grouped_by_threshold = group data_with_binary_condition_filtered by groups;
count_of_IDS = foreach grouped_by_threshold generate group, COUNT(data_with_binary_condition.EventId);
I hope this works. Obviously, I didn't debug it for you since I don't have your files.
This code will take a tad more time to run, but it will produce the output you need without a UDF.
If I understand your question correctly, you want to divide the difference between EventDate and ActivityDate in 30 days blocks (e.g. 1 to 30, 31 to 60, 61 to 90 and so on) and then count the frequency of each block.
In this case, I would just rearrange the above equation to create the variable 'range' as below:
// assuming input contains 3 columns ActivityDate, EventId, EventDate
// dividing the difference between ED and AD by 30 and casting it to int, so that 1 block is represented by 1 integer.
input1 = FOREACH input GENERATE (int)((EventDate - ActivityDate) / 30) as range;
output1 = GROUP input1 BY range;
output2 = FOREACH output1 GENERATE group AS range, COUNT(range) as count;
Hope this helps.

SUM function in Pig script

I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
sum1 = SUM(A.rcvd);
sum2 = SUM(A.sent);
generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;
When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:
(223,123)
(334,444)
(21,12344)
(...,...)
All I really want to do is add all those records together and show that total amount of received and sent: (?,?).
Note: I have tried changing the variable type to int, long, and chararray with no success either.
Some of the errors I am getting while trying to solve this are:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First make sure that the fields that you are summing up are of type int
Use - DESCRIBE A; to check the data type
After that, I think since you have used filter condition and then used group by on F1 -
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
So, while summing up you should use F1 instead of A -
counter = foreach grpd1 {
sum1 = SUM(F1.rcvd);
sum2 = SUM(F1.sent);
generate sum1, sum2;
};
Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A'
I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps.
PS - I am also a student and new to PIG.
A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C1 = foreach grpd1 generate SUM(F1.rcvd);
dump C1;
C2 = foreach grpd1 generate SUM(F1.sent);
dump C2;
NOTE: More info here.
Hope I helped a little!
Please try the following
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;

Resources