How to initialize a simple matrix in SAS? - matrix

I am new to SAS and have been using R most of the time. I am stuck with a simple and frustrating issue. All I want to do is to create a simple 3 X 3 matrix in SAS. But it throws an error. I need some help in understanding what's going on. The SAS documentation is not very helpful.
data matrixTest;
input Y $ X;
cards;
4 0
3 1
1 1
;
run;
/*Convert X to a categorical variable*/
data matrixTest;
set matrixTest;
if X = 0 then X = "0";
else X = "1";
run;
/*Get design matrix from the regression model*/
proc transreg data=matrixTest design;
model class(X/ zero=last);
output out=input_mcmc(drop=_: Int:);
run;
mX = {5 4 3, 4 0 4, 7 10 3};
And I get the following error when creating the matrix mX:
ERROR 180-322: Statement is not valid or it is used out of proper order.

Your error is that SAS is not a matrix language. SAS is more like a database language; the unit of operation is the dataset, analogous to a SQL table or a dataframe in R or Python.
SAS does have a matrix language built into the system, SAS/IML (interactive matrix language), but it's not part of base SAS and isn't really what you use in the context you're showing. The way you enter data as part of your program is how you did it in the first data step, with datalines.
Side note: You're also showing some R tendencies in the second data step; you cannot convert a variable's type that way. SAS has only 'numeric' and 'character', so you don't have 'categorical' data type anyway; just leave it as is.

Do not use the same data set name in the SET and DATA statements. This makes it hard to debug because you've destroyed your initial data set.
You cannot change types on the fly in SAS. If a variables i character it stays character.
If a variable is numeric, you assign values without quotes, quotes are used for character variables.
Your attempt to create a categorical variable doesn't make sense given the fact that it's already 0/1. Make sure your test data is reflective of your actual situation.
I'm not familiar with PROC TRANSREG so I cannot comment on that portion but those are the issues you're facing now.
As someone else mentioned, SAS is not a matrix language, it processes data line by line instead which means it can handle really, really large data sets because it doesn't have to load it into memory.
Your data set, matrixTest is essentially a data set and ready to go. You don't need to convert it to a matrix or 'initialize' it.
If you want a data set with those values then create that as a data set:
data mx;
input var1-var3;
cards;
5 4 3
4 0 4
7 10 3
;
run;

Related

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

How to transform one column to much column on matrix using Fortran 90

I have one column (im = 160648) and row (jm = 1). I want to transform that to a matrix with sizes (im = 344) and (jm=467)
my program code is
program matrix
parameter (im=160648, jm=1)
dimension h(im,jm)
integer::h
open (1,file="Hasil.txt", status='old')
open (2,file="HasilNN.txt", status='unknown')
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
do i=1,im
write(2,33)(h(i,j),j=1,jm)
end do
33 format(1x, 344f10.6)
end program matrix
the error code that appears when read(1,*)(h(i,j)),j=1,jm)
the data type is floating data.
Your read loop is:
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
Shouldn't do i=1,jm be do i=1,im ?
This would imply there are "im" records (lines) in the formatted text file Hasil.txt, which your question suggests.
read(1,*)(h(i,j)),j=1,jm) implies each record (line of text) has "jm" values, which is 1 value per line. Is this what the file looks like ? (An unknown number of blank lines will be skipped with this read (lu,*) ... statement.)
You appear to be wanting to write this information to another file; HasilNN.txt using 33 format (1x, 344f10.6) which suggests 3441 characters per line, although your write statement will write only 1 value per line (as jm=1). This would be a very long line for a text file and probably difficult to manage outside the program. If you did wish to do this, you could achieve this with an implied do loop, such as:
write(2,33) ((h(i,j),j=1,jm),I=1,im)
A few comments:
using jm = 1 implies each row has only one value, which could be equivalently represented as a 1d vector "dimension h(im)", negating the need for j
File unit numbers 1 and 2 are typically reserved unit numbers for screen/keyboard. You would be better using units 11 and 12.
When devising this code, you need to address the record structure in the 2 files, as a simple vector could be used. You can control the line length with the format. A format of (1x,8f10.6) would create a record of 81 characters, which would be much easier to manage.
Format descriptor f10.6 also limits the range of values you can manage in the files. Values >= 1000 or <= -100 will overflow this format, while values smaller than 1.e-6 will be zero.
As #francescalus has noted, you have declared "h" as integer, but use a real format descriptor. This will produce an "Error : format-data mismatch" and has to be changed to what is expected in the file.
You should consider what you wish to achieve and adjust the code.

Hide Labels with No Data in SPSS

I just started using SPSS, there is a option of Select cases that I was trying in SPSS, and later on finding frequency based on that filter.
For Eg:
Suppose Q1 has 12 parts, Q1_1 Q1_2 Q1_3 Q1_4 Q1_5 Q1_6 Q1_7 Q1_8 Q1_9 Q1_10 Q1_11 Q1_12
I want to see data in these variables based on a condition that I used in select cases. Now when I try to see frequencies of these variables based on the filter, only 4 out of 12 satisfy has data.
Now my question is can I hide rest 8 and show only 4 with data on my output window.
It's not entirely clear what you are trying to describe however reading between the lines, I'm guessing you are trying to delete tables generated from FREQUENCIES which may happen to be empty (likely due to a filter applied but perhaps not necessarily either)
You could do this with SPSS Scripting but avoiding that, you may want to explore using CTABLES, which though may not be in the exact same format as FREQUENCY table output it will still none the less retrieve the same information.
Solution below. Assumes Python Integration with SPSS SELECT VARIABLES installed and of course the CTABLE add-on module.
/****** Simulate example data ******/.
input program.
loop #j = 1 to 100.
compute ID=#j.
vector Q(12).
loop #i = 1 to 12.
do if #j<51 and #i<9.
compute Q(#i) = $sysmis.
else.
compute Q(#i) = trunc(rv.uniform(1,5)).
end if.
end loop.
end case.
end loop.
end file.
end input program.
execute.
/************************************/.
/* frequencies without filtering applied */.
freq q1 to q12.
/* frequencies WITH filtering applied */.
/* Empty table here shoult be removed */.
temp.
select if (ID<51).
freq q1 to q12.
spssinc select variables macroname="!Qp" /properties pattern = "^Q\d+$"/options separator="+" order=file.
spssinc select variables macroname="!Qs" /properties pattern = "^Q\d+$"/options separator=" " order=file.
temp.
select if (ID<51).
ctables /table (!Qp)[c][count colpct]
/categories variables=!Qs empty=exclude.
Note if you had assess empty variables at a total level then there is a function in spssaux2 (spssaux2.FindEmptyVars) which could help you find the empty variables and then you could build the syntax to exclude these and so contain the variables with only valid responses and then run FREQUENCIES. But I don't think spssaux2.FindEmptyVars will honor any filtering.

Stata rnormal()

I want to generate a fixed random variable ~N(0,10) for every observation for future computation.
gen X=rnormal (0,10)
list X
Blank
How can I see what value of X is being generated?
You were probably using an empty dataset when you issued these commands. In that case you would first need to tell Stata how many observations your dataset contains. For that you need to use the set obs command, so something like:
. set seed 12345
. set obs 10
obs was 0, now 10
. gen x = rnormal(0,10)
. list, clean
x
1. -9.580833
2. -2.907274
3. 8.45202
4. 8.617108
5. -12.19151
6. 9.457337
7. 1.722469
8. -13.29949
9. -11.5291
10. 25.1646
Think of what would happen when you did not use set obs. In that case Stata would see gen x = rnormal(0,10) and think "ok, I need to create random draws from a normal distribution, but how many?". If you had a dataset open, then it would answer "as many as there are observations in the dataset". If you had no dataset open, then the answer would still be "as many as there are observations in the dataset", but that happens to be 0.
Edit:
If you just want one number you are best of using scalars and not variables. In Stata a scalar refers to a single number and a variable refers to a single column in your dataset. For scalars it is best to use temporary names, as they share the same namespace as variables but variables take precedence when it comes to abreviations, which can lead to unexpected behaviour. So you could do something like:
. tempname a
. scalar `a' = rnormal(0,10)
. di `a'
10.737423

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

Resources