pull out r squared from fit model to table in JSL JMP - sas-jmp

I'm trying to figure out how to use JSL to write some of the analysis of variance variables values to a table in JMP. My idea is to write a script that runs different types of models with different parameters with R^2 and RSME logging to a table (maybe there is a better way to do this I'm on my second day of JMP). Going through the documentation it seems that different analysis have different ways of doing this and I can't find one for "fit model". I also will need to know how to do this for a neural network which I think I may have found the documentation for.

If you're doing something like screening variables to determine an optimized model, you're in the right place with the fit model platform. However, running the fit model in a loop without human judgment in model selection as you've suggested isn't necessarily expedient.
So at the expense of trying to make JMP/JSL do something it's not really suited for, one way to achieve your generic goal of grabbing text from the fit model platform output is to send your platform to a "report" and then pull from that "report" the data you want, and then send it to a data table. From that data table, you can concatenate it with another data table and you would have your log. That's the idea, here's an example, for some dummy data "Ydata" and "Xdata":
thing = Fit Model(
Y( :Ydata ),
Effects( :Xdata ),
Personality( Standard Least Squares ),
Emphasis( Minimal Report ),
Run(
:Ydata << {Plot Actual by Predicted( 0 ),
Plot Residual by Predicted( 0 ), Plot Effect Leverage( 0 )}
)
);
thing_report = thing<<report;
thing_report_dt_ref = thing_report["Summary of Fit"][1] << make into data table;
//alternatively
//thing_report_dt_ref = thing_report[TableBox(1)] << make into data table;
thing_report_dt_ref << Set Name("Choose_a_name_for_your_new_data_table");
You'd have to handle the looping part, but if you can do it once, you can do it N times.
Because JMP/JSL is stupid, you can alternatively call the "Summary of Fit" directly if your know it's name in the tree structure. In my case, its name was "TableBox(1)". Do:
thing << show tree structure
To see where your data lives in the platform display box.

Related

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

Is it possible to create an iris constraint based on cell_methods?

This just came up, I have an answer but I wanted to share it here...
"Is it possible to create an iris constraint based on cell_methods?"
I have a datafile which loads producing many cubes.
I would like to extract only those containing ensemble spread, which I can identify from their cell_methods, which are set to:
(CellMethod(method=u'standard_deviation', coord_names=(u'realization',), intervals=(), comments=()),)
Is there a way to filter the load so that I only read in the desired "ensemble spread" data ?
You will need to use the "cube_func" approach for this.
http://scitools.org.uk/iris/docs/latest/iris/iris.html?highlight=constraint#iris.Constraint
So, something very roughly like ...
def cube_is_mean(cube):
return any(cm.method == 'mean' for cm in cube.cell_methods)
means_constraint = iris.Constraint(cube_func=cube_is_mean)

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

Defining a flexible structure in Prolog

Well, I'm a bit new to Prolog, so my question is on Prolog pattern/logic.
I have an relationship called tablet. It has many parameters, such as name, operationSystem, ramCapacity, etc. I have many objects/predicates of this relationship, like
tablet(
name("tablet1"),
operatingSystem("ios"),
ramCapacity(1024),
screen(
type("IPS"),
resolution(1024,2048)
)
).
tablet(
name("tablet2"),
operatingSystem("android"),
ramCapacity(2048),
screen(
type("IPS"),
resolution(1024,2048),
protected(yes)
),
isSupported(yes)
).
And some others similar relationships, BUT with different amounts of parameters. Some of attributes in different objects I do not need OR I have created some tablets, and one day add one more field and started to use it in new tablets.
There are two questions:
I need to use the most flexible structure as possible in prolog. Some of the tablets have attributes/innerPredicates and some do not, but They are all tablets.
I need to access data the easiest way, for example find all tablets that have ramCapacity(1024), not include ones that do not have this attributes.
I do need to change some attributes' values in the easiest way. For example query - change ramCapacity to 2048 for tablet that has name "tablet1".
If it's possible it should be pretty to read in a word editor :)
Is this structure flexible? Should I use another one? Do I need additional rules to manipulate this structure? Is this structure easy to change with query?(I keep this structure in a file).
Since the number of attributes is not fixed and needs to be so flexible, consider to represent these items as in option lists, like this:
tablet([name=tablet1,
operating_system=ios,
ram_capacity=1024,
screen=screen([type="IPS",
resolution = res(1024,2048)])]).
tablet([name=tablet2,
operating_system=android,
ram_capacity=2048,
screen=screen([type="IPS",
resolution = res(1024,2048)]),
is_supported=yes]).
You can easily query and arbitrarily extend such lists. Example:
?- tablet(Ts), member(name=tablet2, Ts).
Ts = [name=tablet2, operating_system=android, ram_capacity=2048, screen=screen([type="IPS", resolution=res(..., ...)]), is_supported=yes] ;
false.
Notice also the common Prolog naming_convention_to_use_underscores_for_readability instead of mixingCasesAndMakingEverythingExtremelyHardToRead.

How to best get the data out of a lookup table

I have a few product lines with products that have various features. I have a list of drawings made for each product, and below is a sample of what the product lines, products, and features are represented in the drawings.
I am trying to figure out, when given a particular product with all of its features (i.e. a single row from the table), what is the most efficient most concise way to structure my code around the features, to select the correct drawing?
I can always do something like
if ($product == "A"
&& $motor == true
&& $feet == 3
&& $outlet == true
&& $manual == false)
getDrawing("A_motor_3_outlet_no_manual.jpg");
and end up with as many flat if statements as there are rows in the table. But there must be a better way. I think it may have something to do with cardinality, or maybe with variability of the options in each of the properties of the product. The exact concept escapes me for some reason.
I have a feeling for example that it is better to check if the product has a motor first, just because once you know that it will eliminate about half of all the options and narrow it down faster. i.e in the main outer if block do something like this:
if ($motor == true)
{
//7 drawings left to check
}
else
{
//6 drawings left to consider
}
as opposed to something like this:
if ($feet == 1)
{
//still 10 rows left to consider
}
else if ($feet == 2)
{
//just 2 options left to check
}
else if ($feet == 3)
{
//one option only left:
getDrawing("B_motor_3_no_outlet_manual.jpg");
}
But maybe I am just overthinking this and just need a lookup table somehow with all the properties, maybe like
getDrawing($lookup[$product][$motor][$feet][$outlet][$manual]);
Question still remains on what order of properties is best to use for the lookup table, and if it matters.
Question:
Is there an "optimal" if-then-else block nesting order to decide for product properties, to minimize the total number of decisions the code has to make overall, or do I need to abandon that train of thought and just use a lookup table? Why or why not?
EDIT: By the way .. this looks like a good candidate to use a database .. but I have only 48 rows total and can encode them directly into code. This will be read-only, not updated at all that frequently, so I am thinking of using a multidimensional array to encode this data.
This is what databases are made for. You will be able to use a simple SELECT statement to find exactly what you need. Even if right now it's only one table with very few rows, that's the way I'd go for.
If you really want to store everything in the script, you could go ahead and hash the properties of each product and use the hash value as key of the lookup table. When given certain product properties, you can then just hash them and retreive the respective drawing.
You could just encode your data into drawing filenames. Exactly the way you did with the first if statement. For example:
$filename = $product."_".$feet."_".($outlet ? "true":"false")."_".($manual ? "manual":"no_manual")."jpg";
if (!file_exists($filename)) $filename = "missing.jpg";
This will keep all the troubles outside, and simplyfy the code.

Resources