How Can I Round All Time Using SAS? - time

I have a little problem and appreciate if anyone could help me.
What I'm trying to do is basically round the time part to the nearest 30 minute.
My question is how can I do rounding data using SAS.
This is my command:
DATA sampledata;
INFORMAT TRD_EVENT_TM time10.;
FORMAT TRD_EVENT_TM TRD_TMR time14.;
INPUT TRD_EVENT_TM;
TRD_TMR = round(TRD_EVENT_TM, 1800);
INFILE;
00:14:12
00:16:12
09:01:23
09:46:32
15:59:45
;
PROC PRINT; RUN;
But I want to round all time, Not five of them.I am using big data.
Thanks for your attention.

assuming you are asking how to do this rounding on other data, not just your datalines in the example above I suggest you separate these two tasks into two different data steps.
First you create your sample data (this you can exchange for your main data later)
DATA sampledata;
infile datalines;
INPUT TRD_EVENT_TM hhmmss8.;
datalines;
00:14:12
00:16:12
09:01:23
09:46:32
15:59:45
;
RUN;
Then you perform the rounding of the time variables.
data test;
set sampledata;
format TRD_EVENT_TM TRD_TMR time.;
TRD_TMR = round(TRD_EVENT_TM, 1800);
run;
Hope this is the answer to the question you had.

data Sampledata_RT;
set Sampledata04;
TRD_EVENT_ROUNDED = intnx('minute30',TRD_EVENT_TM,1,'b');
TRD_EVENT_ROUFOR = put(TRD_EVENT_ROUNDED,hhmm.);
CountedVOLUME = TRD_PR*TRD_TUROVR;
run;

Related

How to format time in SAS

I have a dataset with three columns : Start, Stop and Date
Observations in my Start and Stop are time type.
I have the following two values in my Start and Stop columns:
24:49:00 and 25:16:00
As there are both over 24 hours format.
I would like to convert those two values to the following:
24:49:00 to 00:49:00
and
25:16:00 to 01:16:00
How to do this in both SAS and proc sql ?
Thank you !
Do you need to convert them? Use the TIMEPART() function.
start_day=datepart(start);
start_time=timepart(start);
format start_time tod8.;
Or do you just want to display them that way?
format start stop tod8.;
Start/Stop time-24:00:00 like this:
data _null_;
start='25:16:14't;
point='24:00:00't;
_start=start-point;
put _start;
format _start time8.;
run;
SAS Time and DateTime values use seconds as their fundamental unit.
Thus you can use either modulus arithmetic or TIMEPART function to extract the less than 24 hour part of a > 24 hour time value.
data have;
start = '24:49:00't;
stop = '25:16:00't;
start_remainder = mod(start, '24:00't); * modulus arithmetic;
stop_remainder = mod(stop, '24:00't);
start_timepart = timepart(start); * TIMEPART function;
stop_timepart = timepart(stop);
format start: stop: time10.;
run;
After the computation do not expect start_remainder is less than stop_remainder to be always true.

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

Where is the syntax error within this SAS view code?

data work.temp work.error / view = work.temp;
infile rawdata;
input Xa Xb Xc;
if Xa=. then output work.errors;
else output work.temp;
run;
It says there's a syntax error in the DATA statement, but I can't find where ...
The error is a typo in the OUTPUT statement. You are trying to write observations to ERRORS but the data statement only defined ERROR.
It is a strange construct and not something I would recommend, but it looks like it will work. When you exercise the view TEMP it will also generate the dataset ERROR.
67 data x; set temp; run;
NOTE: The infile RAWDATA is:
Filename=...
NOTE: 2 records were read from the infile RAWDATA.
The minimum record length was 5.
The maximum record length was 5.
NOTE: View WORK.TEMP.VIEW used (Total process time):
real time 0.32 seconds
cpu time 0.01 seconds
NOTE: The data set WORK.ERROR has 1 observations and 3 variables.
NOTE: There were 1 observations read from the data set WORK.TEMP.
NOTE: The data set WORK.X has 1 observations and 3 variables.

How to change format to a single cell in a SAS table

I have to change format to a single cell in a SAS table. That is, the column where the cell is, has format best12., while given that in the cell there is a date, for it I want to use YYMMDD10.
How can I fix?
Thanks in advance.
You can only associate a FORMAT with entire column. If you want cells that have mixed type formatted differently you need a character column that put PUT (function) values into.
To associate a different format with a column use.
proc datasets;
modify data-set-name;
format variables new-format.;
run;
quit;
Here is an example of what you can do if the data allows. Let's say that the earliest date in your data is 1st Jan 2000, this is stored as the number 14,610 in SAS (the number of days since 1st Jan 1960). Therefore if no non-date values exceed this number then you can achieve your goal by formatting all values up to 14,610 as best12. and all values greater than this as yymmdd10.
proc format;
value dtfmt low - 14609 = [best12.]
14610 - high = [yymmdd10.]
;
run;
data want;
input num;
format num dtfmt.;
datalines;
10
20
20514
30
;
run;
You can apply SUBSTR() in IF condition to check for first character and format your variable accordingly..using INPUT() or PUT()

How to write dates (MM/DD/YY) into a matrix (SAS)

I have following problem:
I need to write a begin and end date into a matrix. Where the matrix contains the yearly quarters (1-4) in the collumns and the rows are the year.
E.g.
Matrix:
Q1 Q2 Q3 Q4
2010
2011
Now the Date 01.01.2010 should be put in the first element and the date 09.20.2011 in the sixed element.
Thanks in advance.
You first have to consider that SAS does not actually have date/time/datetime variables. It just uses numeric variables formatted as date/time/datetime. The actual value being:
days since 1/1/1960 for dates
seconds since 00:00 for times
seconds since 1/1/1960 00:00 for datetimes
SAS does not even distinguish between integer and float numeric types. So a date value can contain a fractional part.
What you do or can do with a SAS numeric variable is completely up to you, and mostly depends on the format you apply. You could mistakenly format a variable containing a date value with a datetime format... or even with a currency format... SAS won't notice or complain.
You also have to consider that SAS does not even actually have matrixes and arrays. It does provide a way to simulate their use to read and write to dataset variables.
That said, SAS does provide a whole lot of formats and informats that allow you to implement date and time manipulation.
Assuming you are coding within a data step, and assuming the "dates" are in dataset numeric variables, then the PUT function can extract the datepart you need to calculate row, column of the matrix element to write to, like so:
DATA table;
ARRAY dm{2,4} dm_r1c1-dm_r1c4 dm_r2c1-dm_r2c4;
beg_row = PUT(beg_date, YEAR4.)-2009;
end_row = PUT(end_date, YEAR4.)-2009;
beg_col = PUT(beg_date, QTR1.);
end_col = PUT(end_date, QTR1.);
dm{beg_row,beg_col} = beg_date;
dm{end_row,end_col} = end_date;
RUN;
... or if you are using a one-dimensional array:
DATA table;
ARRAY da{8} da_1-da_8;
beg_index = 4 * (PUT(beg_date, YEAR4.)-2010) + PUT(beg_date, QTR1.);
end_index = 4 * (PUT(end_date, YEAR4.)-2010) + PUT(end_date, QTR1.);
da{beg_index} = beg_date;
da{end_index} = end_date;
RUN;

Resources