How to export a sparkR model as PMML? - sparkr

I'm trying to export a sparkR model as PMML.
The first approach was using the pmml library:
library(pmml)
sparkR.session()
data(iris)
df <- createDataFrame(iris)
model <- spark.kmeans(df, Sepal_Length ~ Sepal_Width, k = 4, initMode = "random")
model_pmml <- pmml(model)
The error:
Error in UseMethod("pmml"): no applicable method for 'pmml' applied to an object of class "KMeansModel"
Traceback:
1. pmml(model)
I also investigated if the toPMML method available on scala models could be used from SparkR. I've found a question that suggests it may be possible with Sparklyr, but not with SparkR.
Any ideas?

I have come to the conclusion that exporting a spark R model is not supported. I have added a feature request for this: https://issues.apache.org/jira/browse/SPARK-21430. Please vote on the jira ticket if you are also looking for this functionality.

Related

google's notebook on vertex ai throwing following error: type name google.VertexModel is different from expected: Model

I got this error, when compiling my pipeline:
type name google.VertexModel is different from expected: Model
when running the following notebook by google: automl_tabular_classification_beans
I suppose that kubeflow v2 is not able to handle (yet) google.vertexmodel as type for component input. However, I've been browsing a bit and did not find any good clue, or refs (kfp documentation for v2 is not up to date..) to solve this issue. Hopefully someone here can give me a good pointer? I look forward to all of your ideas.
Cheers
Google.Vertex is defined here:
https://github.com/kubeflow/pipelines/blob/286a49547cce763c502592c822296aa60f50b3e8/components/google-cloud/google_cloud_pipeline_components/types/artifact_types.py#L20
Here is an example on how to define it:
https://github.com/kubeflow/pipelines/blob/286a49547cce763c502592c822296aa60f50b3e8/components/google-cloud/tests/types/artifact_types_test.py#L22
For example,
from google_cloud_pipeline_components.types import artifact_types
model = artifact_types.VertexModel(uri='YOUR_MODEL_URI_STRING')
Can you try specifying your model using the syntax above and let us know if this works for your code?
This was a breaking change with release 0.1.9. Here there are some recommendation:
Pin your release to 0.1.7 and continue to use the Model type.
Use 0.1.9 and switch the output from Output[Model] to Output[Artifact].
Try 0.2.0 release, documentation here.
Hope these suggestions work!

Why can't I make a table using gtsummary?

I'm quite new to R, just been using it for the past month or so for my master's thesis. I finally (hopefully) have figured out my model and now i'm trying to present it in a table. After some research, everyone seems to really love the gtsummary package and I really want to learn to use it. But for some reason, the most basic function is not working for me.
Here's my lmer model:
res1 <- lmer(pH ~ Species * SedLayer + (+1|Site), data = data
summary(res1)
Anores1 <- Anova(res1)
Anores1
Which gives me quite a lot of output, output data frame. And here's the basic code i'm trying to use:
lmer(pH ~ SedLayer * Species + (+1|Site), data = dat)
#> Error in lmer(pH ~ SedLayer * Species + (+1 | Site), data = dat): could not find function "lmer"
Created on 2021-05-16 by the reprex package (v2.0.0)
And I keep getting the following error message:
Error in match.call() : ... used in a situation where it does not exist
I've got all the following packages installed and loaded:
library(stats)
library(rstatix)
library(car)
library(lme4)
library(lmerTest)
library("gtsummary")
library("gt")
library("broom.mixed")
library(tidyverse)
I'm sure I must be making a very silly mistake, but I cannot figure it out! Is my model somehow too weird to put in a table? Is there an additional package I should load? Or am I just writing it wrong?
Thanks for any help you can give me!

LightGBM 'Using categorical_feature in Dataset.' Warning?

From my reading of the LightGBM document, one is supposed to define categorical features in the Dataset method. So I have the following code:
cats=['C1', 'C2']
d_train = lgb.Dataset(X, label=y, categorical_feature=cats)
However, I received the following error message:
/app/anaconda3/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
Why did I get the warning message?
I presume that you get this warning in a call to lgb.train. This function also has argument categorical_feature, and its default value is 'auto', which means taking categorical columns from pandas.DataFrame (documentation). The warning, which is emitted at this line, indicates that, despite lgb.train has requested that categorical features be identified automatically, LightGBM will use the features specified in the dataset instead.
To avoid the warning, you can give the same argument categorical_feature to both lgb.Dataset and lgb.train. Alternatively, you can construct the dataset with categorical_feature=None and only specify the categorical features in lgb.train.
Like user andrey-popov described you can use the lgb.train's categorical_feature parameter to get rid of this warning.
Below is a simple example with some code how you could do it:
# Define categorical features
cat_feats = ['item_id', 'dept_id', 'store_id',
'cat_id', 'state_id', 'event_name_1',
'event_type_1', 'event_name_2', 'event_type_2']
...
# Define the datasets with the categorical_feature parameter
train_data = lgb.Dataset(X.loc[train_idx],
Y.loc[train_idx],
categorical_feature=cat_feats,
free_raw_data=False)
valid_data = lgb.Dataset(X.loc[valid_idx],
Y.loc[valid_idx],
categorical_feature=cat_feats,
free_raw_data=False)
# And train using the categorical_feature parameter
lgb.train(lgb_params,
train_data,
valid_sets=[valid_data],
verbose_eval=20,
categorical_feature=cat_feats,
num_boost_round=1200)
This is less of an answer to the original OP and more of an answer to people who are using sklearn API and encounter this issue.
For those of you who are using sklearn API, especially using one of the cross_val methods from sklearn, there are two solutions you could consider using.
Sklearn API solution
A solution that worked for me was to cast categorical fields into the category datatype in pandas.
If you are using pandas df, LightGBM should automatically treat those as categorical. From the documentation:
integer codes will be extracted from pandas categoricals in the
Python-package
It would make sense for this to be the equivalent in the sklearn API to setting categoricals in the Dataset object.
But keep in mind that LightGBM does not officially support virtually any of the non-core parameters for sklearn API, and they say so explicitly:
**kwargs is not supported in sklearn, it may cause unexpected issues.
Adaptive Solution
The other, more sure-fire solution to being able to use methods like cross_val_predict and such is to just create your own wrapper class that implements the core Dataset/Train under the hood but exposes a fit/predict interface for the cv methods to latch onto. That way you get the full functionality of lightGBM with only a little bit of rolling your own code.
The below sketches out what this could look like.
class LGBMSKLWrapper:
def __init__(self, categorical_variables, params):
self.categorical_variables = categorical_variables
self.params = params
self.model = None
def fit(self, X, y):
my_dataset = ltb.Dataset(X, y, categorical_feature=self.categorical_variables)
self.model = ltb.train(params=self.params, train_set=my_dataset)
def predict(self, X):
return self.model.predict(X)
The above lets you load up your parameters when you create the object, and then passes that onto train when the client calls fit.

H2O equivalent of allStringVecToCategorical() and score() api in H2O python

I am using H2O GLRM model in my scala code.
Now migrating scala code to python.
However I am not able to find following equivalent methods in H2O python module
1) allStringVecToCategorical() [Belongs to H2OFrameSupport Trait]
Using the api as follow in the code:
withLockAndUpdate(h2OFrameForImputation)
{
allStringVecToCategorical(_)
}
2) public Frame score(Frame fr) [Belongs to hex.Model]
Using the api as follow in the code:
glrmModel.score(h2OFrameForImputation)
Please let me know the equivalent method in H2O python module.
There isn't a direct equivalent python api function for allStringVecToCategorical(_), the closest is .asfactor() (docs here) which you can use to convert a single column to type enum/categorical. To get the subset of columns with type string you can use the method columns_by_type() (docs here).
For .score() you can use the python api's .predict() method (docs here).

No signature of method: groovy.lang.MissingMethodException.makeKey()

I've installed titan-0.5.0-hadoop2 with hbase and elasticsearch support
I've loaded the graph with
g = TitanFactory.open('conf/titan-hbase-es.properties')
==>titangraph[hbase:[127.0.0.1]]
and a then I loaded the test application
GraphOfTheGodsFactory.load(g)
Now when I'm trying to create a new index key with:
g.makeKey('userId').dataType(String.class).indexed(Vertex.class).unique().make()
and I got this error:
No signature of method: groovy.lang.MissingMethodException.makeKey() is applicable for argument types: () values: []
Possible solutions: every(), any()
Display stack trace? [yN]
Can someone help me with this ?
when I want to see the indexed keys I see this
g.getIndexedKeys(Vertex.class)
==>reason
==>age
==>name
==>place
I'm not completely following what you are trying to do. It appears that you loaded Graph of the Gods to g and then you want to add userId as a new property to the schema. If that's right, then i think your syntax is wrong, given the Titan 0.5 API. The method for managing the schema is very different from previous versions. Changes to the schema are performed through the ManagementSystem interface which you can get an instance of through:
mgmt = g.getManagementSystem()
The syntax for adding a property then looks something like:
birthDate = mgmt.makePropertyKey('birthDate').dataType(Long.class).cardinality(Cardinality.SINGLE).make()
mgmt.commit()
Note that g.getIndexKeys(Class) is not the appropriate way to get schema information either. You should use the ManagementSystem for that too.
Please see the documentation here for more information.

Resources