Data manipulation with SORT utility

Data manipulation with SORT utility - sorting

There is one DB2 table which has 4byte Interger as Primary key. Now i have to double the rows of this table. One way i have is, manipulate the key value by unloading the table to a dataset and keep all the columns detail as is. This way i will be able to double the rows.
I am planning to multiply each primary key with value '-1' such that i will get another row with same details except key with negative value.
I haven't worked much on data manipulation. Can i use SORT utility for this? If yes, then how?
Are there any references available?

Here is how I would do it. Maybe someone else on here has a better way to accomplish your task, but I would do this:
STEP 1 COPY
using IEBGENER, copy the original data to a temp file. Keep in mind I had to guess that the record length and space required
//COPY1 EXEC PGM=IEBGENER
//SYSUT1 DD DSN=XX.FILE.ORIGINAL,
// DISP=SHR
//SYSUT2 DD DSN=&&TEMPFILE,
// DISP=(KEEP,PASS),UNIT=(SYSDA,1),
// LRECL=50,RECFM=FB,
// SPACE=(CYL,(25,10),RLSE)
//SYSPRINT DD SYSOUT=*
//SYSIN DD DUMMY
After that, write a DFSORT step that will change all of the key values. I assume that this key will appear first in you file and that it is 4 characters.
STEP 2 THE MATH
Here we will take the temp file and write out a new file where the key = key * -1
//MULTI EXEC PGM=DFSORT
//SORTIN DD DSN=&&TEMPFILE,
// DISP=SHR
//SORTOUT DD DSN=XX.FILE.MULTI,
// DISP=(,CATLG,DELETE),UNIT=(SYSDA,1),
// LRECL=50,RECFM=FB,
// SPACE=(CYL,(25,10),RLSE)
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
SORT FIELDS=COPY
OUTREC FIELDS=(-1,MUL,1,4,ZD,EDIT=(STTT),SIGNS=('+','-',,))
/*
Once that step is complete, you can use the DFSORT tool to sort the two files into one
STEP 3 SORT
//SORT EXEC PGM=DFSORT
//SORTIN DD DSN=XX.FILE.ORIGINAL,
// DISP=SHR
// DD DSN=XX.FILE.MULTI,
// DISP=SHR
//SORTOUT DD DSN=XX.FILE.FINAL,
// DISP=(,CATLG,DELETE),UNIT=(SYSDA,1),
// LRECL=50,RECFM=FB,
// SPACE=(CYL,(25,10),RLSE)
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
SORT FIELDS=(1,4,ZD,A)
/*

Related

How to format time in SAS

I have a dataset with three columns : Start, Stop and Date
Observations in my Start and Stop are time type.
I have the following two values in my Start and Stop columns:
24:49:00 and 25:16:00
As there are both over 24 hours format.
I would like to convert those two values to the following:
24:49:00 to 00:49:00
and
25:16:00 to 01:16:00
How to do this in both SAS and proc sql ?
Thank you !

Do you need to convert them? Use the TIMEPART() function.
start_day=datepart(start);
start_time=timepart(start);
format start_time tod8.;
Or do you just want to display them that way?
format start stop tod8.;

Start/Stop time-24:00:00 like this:
data _null_;
start='25:16:14't;
point='24:00:00't;
_start=start-point;
put _start;
format _start time8.;
run;

SAS Time and DateTime values use seconds as their fundamental unit.
Thus you can use either modulus arithmetic or TIMEPART function to extract the less than 24 hour part of a > 24 hour time value.
data have;
start = '24:49:00't;
stop = '25:16:00't;
start_remainder = mod(start, '24:00't); * modulus arithmetic;
stop_remainder = mod(stop, '24:00't);
start_timepart = timepart(start); * TIMEPART function;
stop_timepart = timepart(stop);
format start: stop: time10.;
run;
After the computation do not expect start_remainder is less than stop_remainder to be always true.

How to declare a function pointer in MASM?

I have some masm syntax code on Windows in this way:
stru_40DBA0 dd 0FFFFFFFEh ; GSCookieOffset ; SEH scope table for function 402B22
dd 0 ; GSCookieXOROffset
dd 0FFFFFFC0h ; EHCookieOffset
dd 0 ; EHCookieXOROffset
dd 0FFFFFFFEh ; ScopeRecord.EnclosingLevel
dd 0 ; ScopeRecord.FilterFunc
offset $LN19 ; ScopeRecord.HandlerFunc
.....
Foo proc near
....
$LN19:
....
masm will generate errors at the offset $LN19 line, and I tried to modify it in this way:
PTR PROTO $LN29
Could anyone give me some help on how to declare this? Thank you!

There are really two things you need to do:
Ensure the label is public so it can be seen where you're using its offset.
Ensure the label has been declared where you're using its offset.
The easy way to do the latter is define your structure after the label itself has been defined.
To make the label public, you can either declare it public explicitly, like:
public $LN19
...or where you've defined the label, you can use two colons instead of one:
$LN19::

SAS: Improving the speed of a do loop with proc import

I have over 3400 CSV files, with size varying between 10kb to 3mb. Each CSV files have this generic filename: stockticker-Ret.csv where stockticker is the stock ticker like AAPL, GOOG, YHOO, and so on and has stock returns for at every minute on a given day. My SAS code first start by loading all the stock ticker names from the stockticker-Ret.csv file in a SAS dataset. I loop over each ticker to load the appropriate .csv file in a SAS dataset called want and apply some datasteps on want and store the final dataset want of each ticker in a SAS dataset called global. As you can imagine, this process takes a long time. Is there a way to improve my DO LOOP code below to make this process go faster?
/*Record in a sas dataset all the csv file name to extract the stock ticker*/
data yfiles;
keep filename;
length fref $8 filename $80;
rc = filename(fref, 'F:\data\');
if rc = 0 then do; did = dopen(fref);
rc = filename(fref); end; else do; length msg $200.; msg = sysmsg(); put msg=; did = .; end;
if did <= 0 then putlog 'ERR' 'OR: Unable to open directory.';
dnum = dnum(did);
do i = 1 to dnum; filename = dread(did, i); /* If this entry is a file, then output. */ fid = mopen(did, filename); if fid > 0 then output; end;
rc = dclose(did);
run;
/*store in yfiles all the stock tickers*/
data yfiles(drop=filename1 rename=(filename1=stock));
set yfiles;
filename1=tranwrd(filename,'-Ret.csv','');
run;
proc sql noprint;
select stock into :name separated by '*' from work.yfiles;
%let count2 = &sqlobs;
quit;
*Create the template of the desired GLOBAL SAS dataset;
proc sql;
create table global
(stock char(8), time_gap num(5), avg_ret num(5));
quit;
proc sql;
insert into global
(stock, time_gap,avg_ret)
values('',0,0);
quit;
%macro y1;
%do i = 1 %to &count2;
%let j = %scan(&name,&i,*);
proc import out = want datafile="F:\data\&j-Ret.csv"
dbms=csv replace;
getnames = yes;
run;
data want;
set want; ....
....[Here I do 5 Datasteps on the WANT sasfile]
/*Store the want file in a global SAS dataset that will contain all the stock tickers from the want file*/
data global;
set global want; run;
%end;
%mend y1;
%y1()
As you can see the global SAS dataset expands for every want dataset that I store in global.

Assuming the files have a common layout, you should not import them with PROC IMPORT or do loops. You should read them all in with one datastep. IE:
data want;
length the_file $500;
infile "f:\data\*.csv" dlm=',' lrecl=32767 dsd truncover firstobs=2 filename=the_file;
input
myvar1 myvar2 myvar3 myvar4;
stock_ticker=scan(the_file,'\',-1); *or whatever gets you the ticker name;
run;
Now, if they don't have identical layouts, or there is some complexity to the readin, you may need a more complex input statement than that, but almost always you can achieve it this way. Do loops with lots of PROC IMPORTs will always be inefficient because of the overhead of the IMPORT.
If you don't want every .csv file in the folder (and can't write a mask for what you do want), or if you have a subset of layouts, you can use the FILEVAR option to read the files in from a common dataset. You could then branch into various input statements, perhaps, if needed.
data want;
set yfiles;
infile a filevar=filename;
if filevar [some rule] then do;
input ... ;
end
;else if ... then do;
input ... ;
end;
run;

Do-loop in SAS-IML

I want to use a macro do loop inside proc iml like so:
%Let Tab1=FirstTable;
%Let Tab2=SecondTable;
%Let Tab3=ThirdTable;
*&Tab1-3 have been initialised as sas datasets;
proc iml;
* This works;
use &Tab1;
read all into Mat3;
print Mat3;
* This doesn't work;
%Macro Define_mx;
%do i=1 %to 2;
use &Tab&i;
read all into Mat&i ;
%end;
%Mend Define_mx;
%Define_mx;
*The two matrixes have not been initialised;
print Mat1;
print Mat2;
quit;
In reality I will have to initialise like 50 matrixes so a do-loop is necessary.
I can't figure out why the loop can't see &Tab&i as a macro variable.
I also tried a workaround with a normal (non-macro) do-loop using substr to concatenate the variable names but it didn't work either. What am I missing here ?

Ok, so the macro should be:
%Macro Define_mx;
%do i=1 %to 2;
use &&Tab&i;
read all into Mat&i ;
%end;
%Mend Define_mx;
%Define_mx;
The second amperstand on Tab is necessary as without it the macro processor would try to interpret &Tab as a macro variable (which does not exist). Thus, when trying to concatenate multiple macro variables to create a new one, use &&.

If you have SAS/IML 12.1 (released with 9.3m2), there is an even simpler way.
The USE statement supports dereferencing data set names, like this:
ds = "MyData";
use (ds);
Furthermore, as I show in my article on using the VALSET function, the SAS/IML language supports the VALSET function, which can dynamically create matrices named Mat1, Mat2, and so on.
You can combine these features to eliminate the macros entirely:
data a b c; /* create sample data sets */
x=1;y=2; output;
x=2;y=3; output;
run;
proc iml;
dsnames = {a b c}; /* names of data sets */
do i = 1 to ncol(dsnames);
use (dsnames[i]); /* open each data set */
read all into X;
close (dsname);
MatName = "Mat"+strip(char(i)); /* create Mat1, Mat2,... */
call valset(MatName, X); /* assign values from data set */
end;
show names;

How to remove duplicated records\observations WITHOUT sorting in SAS？

I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.
Is it possible?
BTW, below are what I know regarding unduplicating records, which does sorting in the end..
1.
proc sql;
create table yourdata_nodupe as
select distinct *
From abc;
quit;
2.
proc sort data=YOURDATA nodupkey;
by var1 var2 var3 var4 var5;
run;

You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
data nodupes (drop=rc);;
length Make $13.;
declare hash found_keys();
found_keys.definekey('Make');
found_keys.definedone();
do while (not done);
set sashelp.cars end=done;
rc=found_keys.check();
if rc^=0 then do;
rc=found_keys.add();
output;
end;
end;
stop;
run;
proc print data=nodupes;run;

/* Give each record in the original dataset and row number */
data with_id ;
set mydata ;
_id = _n_ ;
run ;
/* Remove dupes */
proc sort data=with_id nodupkey ;
by var1 var2 var3 ;
run ;
/* Sort back into original order */
proc sort data=with_id ;
by _id ;
run ;

I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.

Depending on the number of variables in your data set, the following might be practical:
data abc_nodup;
set abc;
retain _var1 _var2 _var3 _var4;
if _n_ eq 1 then output;
else do;
if (var1 eq _var1) and (var2 eq _var2) and
(var3 eq _var3) and (var4 eq _var4)
then delete;
else output;
end;
_var1 = var1;
_var2 = var2;
_var3 = var3;
_var4 = var4;
drop _var:;
run;

Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.

The two examples given in the original post are not identical.
distinct in proc sql only removes lines which are fully identical
nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.

This is the fastest way I can think of. It requires no sorting.
data output_data_name;
set input_data_name (
sortedby = person_id stay
keep =
person_id
stay
... more variables ...);
by person_id stay;
if first.stay > 0 then output;
run;

data output;
set yourdata;
by var notsorted;
if first.var then output;
run;
This will not sort the data but will remove duplicates within each group.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Data manipulation with SORT utility - sorting

Related

How to format time in SAS

How to declare a function pointer in MASM?

SAS: Improving the speed of a do loop with proc import

Do-loop in SAS-IML

How to remove duplicated records\observations WITHOUT sorting in SAS？

Categories

Resources