Reading in dates using informats in SAS when raw data is messy - format

I am essentially trying to read messy data into SAS using informats and having problems. I have column of data of the following form in a raw txt file, say:
RegDate
0
0
16/10/2002
20/11/2003
0
For RegDate, 0 = missing, otherwise the date is present. I would like to read this data into SAS, giving 'NA' for the zeros and the date for the date, and output into a dataset.
If all dates were present, I could use the code
data test;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&pathlocation" delimiter='09'x
MISSOVER DSD firstobs=2 ;
informat RegDate ddmmyy10. ;
format RegDate ddmmyy10. ;
input
RegDate;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
However I cannot read the above text file doing this as it does not take into account the zeros, as the informat is set to read in dates.
If using a proc import statement
proc import datafile="&pathlocation" out=test dbms=tab replace;
run;
it tries to use a best32. informat, as there is a zero in the first row. The dates cannot then be read in.
So I need to create a custom format of some sort. I can do this for a numeric informat alone or a character informat alone, or a picture informat (which is needed for the dates?). I cannot figure out how to combine multiple formats for one variable. I'm sure the solution is very simple however I cannot find it online so I apologise if this is obvious. Is there either a way to a) put some IF-THEN statement into the format so that it does different things depending on the input b) read the data in purely as text so that the formats need to be used.

NA's are text and not valid in SAS - they're used in R. To indicate that the value is missing for a numeric variable SAS uses a period (.). Reading the data in with your code assigns the 0 to missing which would be an appropriate read of the data.
If you want NA you'll need to read or convert the data to text, but then your dates will be text and you'll be limited in what you can do with them, for example no date calculations.
If you really want you could display it that way using a nested format.
proc format;
value na_date_fmt
low-high = [ddmmyy10.]
. = "NA";
run;
data have;
infile cards dsd;
informat regDate ddmmyy10.;
format regDate ddmmyy10.;
format newDate na_date_fmt.;
input regdate;
newDate=regdate;
cards;
0
0
16/10/2002
20/11/2003
0
;
run;
proc print data=have;
run;

You can add an IF statement to the DATA step, like this:
data test;
infile "&pathlocation" delimiter='09'x
MISSOVER DSD firstobs=2 ;
informat RegDate ddmmyy10. ;
format RegDate ddmmyy10. ;
input
RegDate;
if RegDate = 0 then RegDate = .;
run;
The output is
RegDate
.
.
16/10/2012
20/11/2003
.

Related

SAS Macros in Datalines

I have a two part question about creating datasets in SAS that calls upon macro variables
Part 1
I'm trying to create a dataset that has one character variable called variable with a length of 100, and 3 observations.
%let first_value=10;
%let second_value=20;
%let third_value=30;
data temp;
infile cards truncover;
input variable $100.;
cards;
First Value: &first_value
Second Value: &second_value
Third Value: &third_value
;
run;
My output dataset doesn't show the macro variables, just the exact text I entered in the datalines. I would love help on syntax of how to concatenate character input with a macro variable. Also I'm curious why sometimes you need a separate length statement for character variables before the input statement when other times you can just specify the length in the input statement like above.
Part 2
Next, I'm trying to create a dataset that has one observation with 4 variables, 3 of which are macro variables.
data temp2;
infile cards dlm=" "
input variable $ first_var second_var third_var
cards;
Observation 1 Filler &first_value &second_value &third_value
;
run;
The 4 spaces in the delimiter statement and between variables in the datalines are actually tabs in my code.
Thanks!
Your examples do not seem to be worth using macro variables.
But if you really need to resolve macro expressions in variable values then use the RESOLVE() function. The RESOLVE() will evaluate all macro code in the text, not just the macro variable references in your example. So any macro function calls and calls to actual macros will be resolved and the generated text returned as the result of the function.
newvar=resolve(oldvar);
So your examples become:
data temp;
infile cards truncover;
input variable $100.;
variable = resolve(variable);
cards;
First Value: &first_value
Second Value: &second_value
Third Value: &third_value
;
data temp2;
infile cards dlm="|" ;
input #;
_infile_=resolve(_infile_);
input variable :$100. first_var second_var third_var ;
cards;
Observation 1 Filler|&first_value|&second_value|&third_value
;
But on the second one be careful as the _INFILE_ variable for CARDS images are fixed multiples of 80 bytes so if the resolved macro expressions make the string longer than the next 80 byte boundary you will lose the extra text.
511 %let xx=%sysfunc(repeat(----+----0,8));
512
513 data test;
514 infile cards truncover;
515 input #;
516 _infile_=resolve(_infile_);
517 input variable $100. ;
518 length=lengthn(variable);
519 put length= variable=;
520 cards;
length=5 variable=short
length=80 variable=long ----+----0----+----0----+----0----+----0----+----0----+----0----+----0----+
NOTE: The data set WORK.TEST has 2 observations and 2 variables.
So use input from an actual file instead. That way the limit is instead the 32,767 byte limit for a character variable.
%let xx=%sysfunc(repeat(----+----0,8));
options parmcards=text;
filename text temp;
parmcards;
short
long &xx
;
531
532
533 data test;
534 infile text truncover;
535 input #;
536 _infile_=resolve(_infile_);
537 input variable $100. ;
538 length=lengthn(variable);
539 put length= variable=;
540 run;
NOTE: The infile TEXT is:
Filename=C:\...\#LN00053,
RECFM=V,LRECL=32767,File Size (bytes)=17,
Last Modified=08Jul2022:23:42:10,
Create Time=08Jul2022:23:42:10
length=5 variable=short
length=95 variable=long ----+----0----+----0----+----0----+----0----+----0----+----0----+----0----+----0----+----0
NOTE: 2 records were read from the infile TEXT.
The minimum record length was 5.
The maximum record length was 8.
NOTE: The data set WORK.TEST has 2 observations and 2 variables.

Check for length of a character string in SAS Proc Format

I would like to write a PROC FORMAT to check for errors in a variable that serves as a unique identifier. The variable is a character string of length 16, and it usually has a number of trailing zeros, like so:
0000001234567890
I would like the PROC to output an error to the log if, for example, the variable is null or if the length of the sting is different from 16. Can this be done in the same proc, without having to go through functions such as length()?
what I would like to obtain is something like:
proc format;
value $ id_error
' ' = _ERROR_
*length ne 16 = _ERROR_;
*other errors* = _ERROR_;
other = 'OK';
run;
Is something equivalent to the above possible to do with a single proc format?
Reeza's suggestion to use PROC FCMP is along the right path I think. You can't really check for length in a format without it.
This is covered in the documentation here. The basic structure is, write a fcmp function that takes a character value as input (for a character format) and returns a character value, and then call that fcmp function with no arguments in the format; the input value for the format will be provided automatically.
In 9.3+:
data have;
length cardno $32;
input cardno;
datalines;
1234567890123456
0000153456789152
0000000000000000
1111111111111111
9999999999999999
0123456
01234567897456
0123154654564897987445
;;;;
run;
proc fcmp outlib=work.funcs.fmts;
function check16fmt(charval $) $;
length retval $16;
if length(charval) = 16 then retval='VALID VALUE';
else retval='_ERROR_';
return(retval);
endsub;
run;
options cmplib=work.funcs;
proc format;
value $chk16f
low-high = [check16fmt()];
quit;
data want;
set have;
format cardno $chk16f.;
run;

SAS format procedure, invalue statement ,UPCASE option does not work

I need to create SAS informat that will change all case versions of 'Male' and 'Female' to digits.
I found in the documentation that there is UPCASE options that does the job. "converts all raw data values to uppercase before they are compared to the possible ranges. If you use UPCASE, then make sure the values or ranges you specify are in uppercase"
Unfortunately after adding the UPCASE option none of the input values is read properly.
The SAS version id 9.2.
My code is below.
options fmtsearch=(WORK);
proc format lib=WORK;
invalue gender UPCASE
MALE = 1
FEMALE = 2
;run;
data _null_;
q='MALE';
x=input(q,gender.);
put q=;
put x=;
run;
The log is:
NOTE: Invalid argument to function INPUT at line 186 column 7.
q=MALE
x=.
q=MALE x=. _ERROR_=1 _N_=1
What is the proper usage of this option?
Very simple, just put UPCASE inside brackets...

Rearrange character column using "PROC FORMAT" in SAS

I want to take the follow data variable:
"Nebraska-Iowa"
"Washington-Arkansas"
"Illinois-Utah"
and transform it so that it orders the character groups around the hyphen to be in alphabetical order:
"Iowa-Nebraska"
"Arkansas-Washington"
"Illinois-Utah"
Is there an easy way to do this? I need to split the string around the hyphen, rearrange if necessary, and than paste back together.
UPDATE
After playing with Matthew's answer, I have decide to generalize this for any number of states with the following dataset:
Nebraska-Iowa
Washington-Arkansas-Texas
Illinois-Utah
Colorado
Here is the code I am trying to build. What I am struggling with is building an array that I loop through, pull out the appropriate word, and then pasting them back together after arranging. Please help!
/*Example dataset*/
data have;
format text $50.;
input text;
datalines;
Nebraska-Iowa
Washington-Arkansas-Texas
Illinois-Utah
Colorado
run;
/*Rearrange strings in dataset*/
data arrangestrings;
set have;
length result $50;
howmanyb = countc(text,'-');
howmany = howmanyb + 1;
array state[howmany] _character_;
do i=1 to howmany;
state[i] = scan(text, i, '-');
end;
call sortc(of state(*));
result = catx("-", state[*]);
keep result;
run;
I don't think you need to go to the trouble of defining a user-defined format for a task like this. The built-in scan method is your friend here:
data have;
format text $50.;
input text;
datalines;
Nebraska-Iowa
Washington-Arkansas
Illinois-Utah
run;
data want;
set have;
length word1 word2 result $50;
word1 = scan(text, 1, '-');
word2 = scan(text, 2, '-');
result = ifc(word1 <= word2, text, catx('-', word2, word1));
run;
proc print data=want;
run;
Check out the documentation on the built-in functions that I used (scan, ifc, catx) if you're not familiar with them:
http://support.sas.com/documentation/cdl/en/allprodslang/67244/HTML/default/viewer.htm#syntaxByType-function.htm

SAS CSV TO CHARACTER LEADING zeros

In SAS 9.3 I need to import a CSV file with my first column having leading zeros. I've reserached and just can't quite figure out how to format the statement. I have done this and messed with it - I know there is a z format that may work but not sure how to incorporate?
data pharmacy;
infile "\\path\June 2013\test.csv"
dsd missover
/*lrecl=512 pad*/
;
input
Field1 $ 1-10
/* Field2 $*/
;
RUN;
Assuming your data is in the following format:
Field1, Field2
00001,1.2
00002,4.5
00010,189.2
00280546,0
0145605616,6
You were along the right lines regarding Z. format.
If you want to keep Field1 as numeric then just read it as numeric - SAS will ignore the leading zeros. But you can use z10. as the format for Field1. So, when the dataset is created - it will show with leading zeros. Alternatively, if you want to store Field1 as character variable then that too is easy - just read Field1 as numeric and reformat using put(Field1, z10.).
DATA WORK.dummyImport;
INFILE '/<path>/dummyImport.csv' MISSOVER DSD FIRSTOBS=2 TERMSTR=CRLF; ;
INPUT
Field1
Field2 ;
FORMAT FIELD1 Z10.;
Field1_char=put(Field1, z10.);
RUN;
PROC PRINT DATA=WORK.DummyImport; RUN;
returns:
Field1 Field2 Field1_char
0000000001 1.2 0000000001
0000000002 4.5 0000000002
0000000010 189.2 0000000010
0000280546 0 0000280546
0145605616 6 0145605616
When you're importing a CSV, you definitely want to use the delimiters to your advantage. I find it unlikely that you would want to use column based input statements, e.g. Field1 $ 1-10. Have you tried something as simple as:
data pharmacy;
infile "\\path\June 2013\test.csv" dsd;
input Field1 $ Field2 $;
RUN;
Personally, I almost always take the easy way out and just use proc import.

Resources