Where is the syntax error within this SAS view code? - view

data work.temp work.error / view = work.temp;
infile rawdata;
input Xa Xb Xc;
if Xa=. then output work.errors;
else output work.temp;
run;
It says there's a syntax error in the DATA statement, but I can't find where ...

The error is a typo in the OUTPUT statement. You are trying to write observations to ERRORS but the data statement only defined ERROR.
It is a strange construct and not something I would recommend, but it looks like it will work. When you exercise the view TEMP it will also generate the dataset ERROR.
67 data x; set temp; run;
NOTE: The infile RAWDATA is:
Filename=...
NOTE: 2 records were read from the infile RAWDATA.
The minimum record length was 5.
The maximum record length was 5.
NOTE: View WORK.TEMP.VIEW used (Total process time):
real time 0.32 seconds
cpu time 0.01 seconds
NOTE: The data set WORK.ERROR has 1 observations and 3 variables.
NOTE: There were 1 observations read from the data set WORK.TEMP.
NOTE: The data set WORK.X has 1 observations and 3 variables.

Related

How Can I Round All Time Using SAS?

I have a little problem and appreciate if anyone could help me.
What I'm trying to do is basically round the time part to the nearest 30 minute.
My question is how can I do rounding data using SAS.
This is my command:
DATA sampledata;
INFORMAT TRD_EVENT_TM time10.;
FORMAT TRD_EVENT_TM TRD_TMR time14.;
INPUT TRD_EVENT_TM;
TRD_TMR = round(TRD_EVENT_TM, 1800);
INFILE;
00:14:12
00:16:12
09:01:23
09:46:32
15:59:45
;
PROC PRINT; RUN;
But I want to round all time, Not five of them.I am using big data.
Thanks for your attention.
assuming you are asking how to do this rounding on other data, not just your datalines in the example above I suggest you separate these two tasks into two different data steps.
First you create your sample data (this you can exchange for your main data later)
DATA sampledata;
infile datalines;
INPUT TRD_EVENT_TM hhmmss8.;
datalines;
00:14:12
00:16:12
09:01:23
09:46:32
15:59:45
;
RUN;
Then you perform the rounding of the time variables.
data test;
set sampledata;
format TRD_EVENT_TM TRD_TMR time.;
TRD_TMR = round(TRD_EVENT_TM, 1800);
run;
Hope this is the answer to the question you had.
data Sampledata_RT;
set Sampledata04;
TRD_EVENT_ROUNDED = intnx('minute30',TRD_EVENT_TM,1,'b');
TRD_EVENT_ROUFOR = put(TRD_EVENT_ROUNDED,hhmm.);
CountedVOLUME = TRD_PR*TRD_TUROVR;
run;

Creating an average matrix from four individual matrices of same size in SAS / IML

I am using IML/SAS in SAS Enterprise Guide for the first time, and want to do the following:
Read some datasets into IML matrices
Average the matrices
Turn the resulting IML matrix back into a SAS data set
My input data sets look something like the following (this is dummy data - the actual sets are larger). The format of the input data sets is also the format I want from the output data sets.
data_set0: d_1 d_2 d_3
1 2 3
4 5 6
7 8 9
I proceed as follows:
proc iml;
/* set the names of the migration matrix columns */
varNames = {"d_1","d_2","d_3"};
/* 1. transform input data set into matrix
USE data_set_0;
READ all var _ALL_ into data_set0_matrix[colname=varNames];
CLOSE data_set_0;
USE data_set_1;
READ all var _ALL_ into data_set1_matrix[colname=varNames];
CLOSE data_set_1;
USE data_set_2;
READ all var _ALL_ into data_set2_matrix[colname=varNames];
CLOSE data_set_2;
USE data_set_3;
READ all var _ALL_ into data_set3_matrix[colname=varNames];
CLOSE data_set_3;
/* 2. find the average matrix */
matrix_sum = (data_set0_matrix + data_set1_matrix +
data_set2_matrix + data_set3_matrix)/4;
/* 3. turn the resulting IML matrix back into a SAS data set */
create output_data from matrix_sum[colname=varNames];
append from matrix_sum;
close output_data;
quit;
I've been trying loads of stuff, but nothing seems to work for me. The error I currently get reads:
ERROR: Matrix matrix_sum has not been set to a value
What am I doing wrong? Thanks up front for the help.
The above code works. In the full version of this code (this is simplified for readability) I had misnamed one of my variables.
I'll leave the question up in case somebody else wants to use SAS / IML to find an average matrix.

How to rank multiple variables in a large data set?

I have a data set of around 50 million records with around 30 variables(columns).
I need to rank the dataset for each variable.
Proc rank does not work since it required lot of memory for this large dataset.
To give rank manually, I have to sort the dataset on the respective variable column and then give rank by using a formula. But the problem is we have to sort the dataset 30 times on 30 variables which will take very very long time and not feasible.
What alternates can we use in this case?
You're in a tough spot without many options. If you're sorting and keeping all 30 variables each time, that will significantly increase your processing times. If I were you, I'd only keep the variable you want to rank and a sequence number to apply your formula, then merge it all back together at the end. This would require you to loop over each variable in your dataset then merge it all back together. See example below and if it would help decrease your processing times:
** PUT ALL VARIABLES INTO LIST **;
PROC SQL NOPRINT;
SELECT DISTINCT(NAME)
INTO :VARS SEPARATED BY " "
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'SASHELP' AND MEMNAME = 'FISH';
QUIT;
%PUT &VARS.;
** CREATE SEQUENCE NUMBER IN FULL DATA **;
DATA FISH; SET SASHELP.FISH;
SEQ=_N_;
RUN;
** LOOP OVER EACH VARIABLE TO ONLY PROCESS THAT VARIABLE AND SEQUENCE -- REDUCES PROCESSING TIME **;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET VAR = %SCAN(&VARS,&I.);
DATA FISH_&I.; SET FISH (KEEP=SEQ &VAR.);
RUN;
/* INSERT YOUR FORMULA CODE HERE ON FISH_&I. DATA (MINE IS AN EXAMPLE) */
PROC SORT DATA = FISH_&I.;
BY &VAR.;
RUN;
DATA FISH1_&I.; SET FISH_&I.;
BY &VAR.;
RANK_&VAR = _N_;
RUN;
/* RESORT FINAL DATA BY SEQUENCE NUMBER VARIABLE */
PROC SORT DATA = FISH1_&I.;
BY SEQ;
RUN;
%END;
%MEND;
%LOOP_OVER(&VARS.);
** MERGE ALL SUBSETS BACK TOGETHER BY THE ORIGINAL SEQUENCE NUMBER **;
DATA FINAL;
MERGE FISH1_:;
BY SEQ;
DROP SEQ;
RUN;
If you just need to rank into deciles / percentiles etc rather than a complete ranking from 1 to 50m across all 50m rows, you should be able to get a very good approximation of the correct answer using a much smaller amount of memory via proc summary, using qmethod=P2 and specifying a suitable qmarkers setting.
This approach uses the P-squared algorithm:
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
I am not sure, whether it is a good idea: But you may want to use a Hash object. The object is loaded into your RAM. Assuming that you have 30 Mio of numerical observations, you will need around (2*8bytes)*50 mio = 800MB of RAM -- if I am not mistaken.
The code could look like this (using Foxers Macro to loop over the variables, a little helper macro to get the list of variables from a dataset and a small test dataset with two variables):
%Macro GetVars(Dset) ;
%Local VarList ;
/* open dataset */
%Let FID = %SysFunc(Open(&Dset)) ;
/* If accessable, process contents of dataset */
%If &FID %Then %Do ;
%Do I=1 %To %SysFunc(ATTRN(&FID,NVARS)) ;
%Let VarList= &VarList %SysFunc(VarName(&FID,&I));
%End ;
/* close dataset when complete */
%Let FID = %SysFunc(Close(&FID)) ;
%End ;
&VarList
%Mend ;
data dsn;
input var1 var2;
datalines;
1 48
1 8
2 5
2 965
3 105
4 105
3 85
;
run;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET var = %SCAN(&VARS,&I.);
data out&i.(keep=rank&i.);
if 0 then set dsn;
if _N_ =1 then
do;
dcl hash hh(ordered:'A');
dcl hiter hi('hh');
hh.definekey("&var.");
hh.definedata("&var.","rank&i.");
hh.definedone();
end;
/*Get unique combination variable and point in dataset*/
do while(not last);
set dsn end=last;
hh.ref();
end;
/*Assign ranks within hash object*/
rc=hi.first();
k = 1;
do while(rc=0);
rank&i.=k;
hh.replace();
k+1;
rc=hi.next();
end;
/*Output rank to new dataset in original order of observations*/
do while(not theend);
set dsn end=theend;
hh.find();
output;
end;
/*If data can be sorted according to the rank (with no duplicates) use:
hh.output("out&i.");
&outi. will then have variables &var. and rank&i.
However, the merging below may not be sensible anymore
as correspondence between variables is not preserved.
There will also be no duplicates in the dataset.
*/
run;
%END;
%MEND LOOP_OVER;
%LOOP_OVER(%GetVars(dsn));
/*Merge all rank datasets to one large*/
data all;
merge out:;
run;

How to give equations in Apache pig

I am trying to get a value from this equation
--counted gives the total row count in a file
samplecount = counted*(10/100);
How to sample data according to this
--Load data
examples = LOAD '/home/sreeveni/myfiles/PE/USCensus1990New.csv' ;
--Group data
groupedByUser = group examples all;
--count no of lines in the file
counted = FOREACH groupedByUser generate COUNT(examples) ;
--sampling
sampled = SAMPLE examples counted*(10/100);
store sampled into '/home/sreeveni/myfiles/OUT/samplesout';
Showing error in above line
Invalid scalar projection: counted : A column needs to be projected
from a relation for it to be used as a scalar
Please advice.
Am I doing anything wrong.
i guess sample works with a number between [0,1]. In your case, its exceeding the required value. If you want just 10% of the data, pass 0.1 directly and to get that in a code, find this percentage in a FOREACH statement only.
If you are trying to generate a sample of "examples" with 10% of the total number of rows, all you have to do is:
SAMPLE examples 0.1;
Read the documentation for SAMPLE command here.

How to create missing records within date-time range in pig latin

I have input records of the form
2013-07-09T19:17Z,f1,f2
2013-07-09T03:17Z,f1,f2
2013-07-09T21:17Z,f1,f2
2013-07-09T16:17Z,f1,f2
2013-07-09T16:14Z,f1,f2
2013-07-09T16:16Z,f1,f2
2013-07-09T01:17Z,f1,f2
2013-07-09T16:18Z,f1,f2
These represent timestamps and events. I have written these by hand, but actual data should be sorted based on time.
I would like to generate a set of records which would be input to graph plotting function which needs continuous time series. I would like to fill in missing values, i.e. if there are entries for "2013-07-09T19:17Z" and "2013-07-09T19:19Z", I would like to generate entry for "2013-07-09T19:18Z" with predefined value.
My thoughts on doing this:
Use MIN and MAX to find the start and end date in the series
Write UDF which takes min and max and returns relation with missing
timestamps
Join above 2 relations
I cannot get my head around on how to implement this in PIG though. Would appreciate any help.
Thanks!
Generate another file using a script (outside pig)with all time stamps between MIN and MAX , including MIN and MAX. Load this as a second data set. Here is a sample that I used from your data set. Please note I filled in only few gaps not all.
2013-07-09T01:17Z,d1,d2
2013-07-09T01:18Z,d1,d2
2013-07-09T03:17Z,d1,d2
2013-07-09T16:14Z,d1,d2
2013-07-09T16:15Z,d1,d2
2013-07-09T16:16Z,d1,d2
2013-07-09T16:17Z,d1,d2
2013-07-09T16:18Z,d1,d2
2013-07-09T19:17Z,d1,d2
2013-07-09T21:17Z,d1,d2
Do a COGROUP on the original dataset and the generated dataset above. Use a nested FOREACH GENERATE to write output dataset. If first dataset is empty, use the values from second set to generate output dataset else the first dataset. Here is the piece of code I used on these two datasets.
Org_Set = LOAD 'pigMissingData/timeSeries' USING PigStorage(',') AS (timeStamp, fl1, fl2);
Default_set = LOAD 'pigMissingData/timeSeriesFull' USING PigStorage(',') AS (timeStamp, fl1, fl2);
coGrouped = COGROUP Org_Set BY timeStamp, Default_set BY timeStamp;
Filled_Data_set = FOREACH coGrouped {
x = COUNT(times);
y = (x == 0? (Default_set.fl1, Default_set.fl2): (Org_Set.fl1, Org_Set.fl2));
GENERATE FLATTEN(group), FLATTEN(y.$0), FLATTEN(y.$1);
};
if you need further clarification or help let me know
In addition to #Rags answer, you could use the STREAM x THROUGH command and a simple awk script (similar to this one) to generate the date range once you have the min and max dates. Something similar to (untested! - you might need to single line the awk script with semi-colon command delimitation, or better to ship it as a script file)
grunt> describe bounds;
(min:chararray, max:chararray)
grunt> dump bounds;
(2013/01/01,2013/01/04)
grunt> fullDateBounds = STREAM bounds THROUGH `gawk '{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[1] " " s[2] " " s[3] " 0 0 0")
et=mktime(e[1] " " e[2] " " e[3] " 0 0 0")
for (i=st;i<=et;i+=60*24) print strftime("%Y/%m/%d",i)
}'`;

Resources