After training a RandomForestRegressor in PipelineModel using mlib and DataFrame (Spark 2.0)
I loaded the saved model into my RT environment in order to predict using the model, each request
is handled and transform through the loaded PipelineModel but in the process I had to convert the
single request vector to a one row DataFrame using spark.createdataframe all of this takes around 700ms!
comparing to 2.5ms if I uses mllib RDD RandomForestRegressor.predict(VECTOR).
Is there any way to use the new mlib to predict a a single vector without converting to DataFrame or do something else to speed things up?
The dataframe based org.apache.spark.ml.regression.RandomForestRegressionModel also takes a Vector as input. I don't think you need to convert a vector to dataframe for every call.
Here is how I think you code should work.
//load the trained RF model
val rfModel = RandomForestRegressionModel.load("path")
val predictionData = //a dataframe containing a column 'feature' of type Vector
predictionData.map { row =>
Vector feature = row.getAs[Vector]("feature")
Double result = rfModel.predict(feature)
(feature, result)
}
Related
I want to measure performance of an udf on a large dataset. The spark SQL is:
spark.sql("SELECT my_udf(value) as results FROM my_table")
The udf returns an array. The issue I'm facing is how to make this execute without returning the data to the driver. I need an action but anything returning the full data set will crash the driver, eg. collect or I'm not running the calculation for all rows (show/take(n)). So how can i trigger the calculation and not return all data to the driver?
I think the closest you can get to only running your UDF for measuring timing would be something like below. The general idea is using caching to try and remove data loading time from your measurement, and then use a foreach that does nothing to make spark run your UDF.
val myFunc: String => Int = _.length
val myUdf = udf(myFunc)
val data = Seq("a", "aa", "aaa", "aaaa")
val df = sc.parallelize(data).toDF("text")
// Cache to remove data loading from measurements as much as possible
// Also, do a foreach no-op action to force the data to load and cache before our test
df.cache()
df.foreach(row => {})
// Run the test, grabbing before and after time
val start = System.nanoTime()
val udfDf = df.withColumn("udf_column", myUdf($"text"))
// Force spark to run your UDF and do nothing with the result so we don't include any writing time in our measurement
udfDf.rdd.foreach(row => {})
// Get the total elapsed time
val elapsedNs = System.nanoTime() - start
I am new to Pig so bear with me. I have two datasources that have the same schema: a map of attributes. I know that some attributes will have a single identifiable overlapping attribute. For example
Record A:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza"]}}
Record B:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Buffalo Wings"]}}
I want to merge the records on Name such that:
Merged:
{"Name":{"First":"Foo", "Last":"Bar"}, "FavoriteFoods":{["Oranges", "Pizza", "Buffalo Wings"]}}
UNION, UNION ONSCHEMA,and JOIN don't operate in this way. Is there a method available to do this within Pig or will it have to happen within a UDF?
Something like:
A = LOAD 'fileA.json' USING JsonLoader AS infoMap:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMap:map[];
merged = MERGE_ON infoMap#Name, A, B;
Pig by itself is very dumb when it comes to even slightly complex data translation. I feel you will need two kinds of UDFs to achieve your task. The first UDF will need to accept a map and create a unique string representation of it. It could be like a hashed string representation of the map (lets call it getHashFromMap()). This string will be used to join the two relations. The second UDF would accept two maps and return a merged map (lets call it mergeMaps()). Your script will then look as follows:
A = LOAD 'fileA.json' USING JsonLoader AS infoMapA:map[];
B = LOAD 'fileB.json' USING JsonLoader AS infoMapB:map[];
A2 = FOREACH A GENERATE *, getHashFromMap(infoMapA#'Name') AS joinKey;
B2 = FOREACH B GENERATE *, getHashFromMap(infoMapB#'Name') AS joinKey;
AB = JOIN A2 BY joinKey, B2 BY joinKey;
merged = FOREACH AB GENERATE *, mergeMaps(infoMapA, infoMapB) AS mergedMap;
Here I assume that the attrbute you want to merge on is a map. If that can vary, you first UDF will need to become more generic. Its main purpose would be to get a unique string representation of the the attribute so that the datasets can be joined on that.
My goal is to create a DRF model in H2O with the TRAIN, VALIDATION and TEST datasets I have and predict the RMSE, R2, MSE etc on the TEST model.
Below is the piece of code:
DRFParameters rfParms = (DRFParameters) algParameter;
rfParms._response_column = trainDataFrame._names[responseColumn(trainDataFrame)]; //The response column
rfParms._train = trainDataFrame._key;
//rfParms._valid = testDataFrame._key;
rfParms._nfolds = 5;
DRF job = new DRF(rfParms);
DRFModel drf = job.trainModel().get(); // Train the model
Frame pred = drf.score(testDataFrame); //Score the test
Here I don't know how to proceed with in finding the predictions (R2, RMSE, MSE, MAE etc) after scoring.
Could you please help in H2O DRF modeling and predictions calculation using JAVA?
Depending on whether your model is a regression, binomial or multinomial model you'll have to use one of ModelMetricsRegression.make(), ModelMetricsBinomial.make() or ModelMetricsMultinomial.make(). They have slightly different signatures - you can find them in our Java docs.
For the trainDataFrame you can get them from your drf model, it's in drf._output._training_metrics (you might need to cast it to an appropriate type as this one is a generic ModelMetrics). If you use your test dataset as a validation frame you can get the metrics from drf._output._validation_metrics.
#Edit:
DRFModel drf = job.trainModel().get(); // Train the model
Frame pred = drf.score(testDataFrame); //Score the test
ModelMetricsBinomial mm = ModelMetricsBinomial.make(preds.vec(2), trainDataFrame.vec(rfParms._response_column));
double auc = mm.auc();
double rmse = mm.rmse();
double r2 = mm.r2();
// etc.
Somebody knows - is it possible to save trained model of Spark's Naive Bayes classificator (for example in text file), and load it in future if required?
Thank You.
I tried saving and loading the model. I was not able to recreate the model using the stored weights. ( Couldn't find the proper constructor ). But the whole model is serializable. So you can store and load it as follows :
store as :
val fos = new FileOutputStream(<storage path>)
val oos = new ObjectOutputStream(fos)
oos.writeObject(model)
oos.close
and load it in:
val fos = new FileInputStream(<storage path>)
val oos = new ObjectInputStream(fos)
val newModel = oos.readObject().asInstanceOf[org.apache.spark.mllib.classification.LogisticRegressionModel]
It worked for me
it is discussed in this thread :
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-td11953.html
You can use built-in functions (Spark version 2.1.0). Use NaiveBayesModel#save in order to store the model and NaiveBayesModel#load in order to read previously stored model.
Method save comes from Saveable and is implemented by wide range of classification models. Method load seems to be static in each classification model implementation.
In Hadoop using Pig, I have a large number of fields in a few separate sources which I load, filter, project, group, run through a couple Java UDFs, join, project and store. (That's everyday life in Hadoop.) Some of the fields in the original load of data aren't used by the UDFs and aren't needed until the final store.
When is it better to pass unused fields through UDFs than to store and join them later?
A trivial toy example is a data source with columns name,weight,height and I ultimately want to store name,weight,heightSquared. My UDF is going to square the height for me. Which is better:
inputdata = LOAD 'data' AS name,weight,height;
outputdata = FOREACH inputdata
GENERATE myudf.squareHeight(name,weight,height)
AS (name,weight,heightSquared);
STORE outputdata INTO 'output';
or
inputdata = LOAD 'data' AS name,weight,height;
name_weight = FOREACH inputdata
GENERATE name,weight;
intdata1 = FOREACH inputdata
GENERATE myudf.squareHeight(name,height)
AS (iname,heightSquared);
intdata2 = JOIN intdata1 BY iname, name_weight BY name;
outputdata = FOREACH intdata2
GENERATE name,weight,heightSquared;
STORE outputdata INTO 'output';
In this case it looks pretty obvious: the first case is better. But the UDF does have to read and store and output the weight field. When you have 15 fields the UDF doesn't care about and one it does, is the first case still better?
If you have 15 fields the UDF doesn't care about, then don't send them to the UDF. In your example, there's no reason to write your UDF to take three fields if it's only going to use the third one. The best script for your example would be
inputdata = LOAD 'data' AS name,weight,height;
outputdata =
FOREACH inputdata
GENERATE
name,
weight,
myudf.squareHeight(height) AS heightSquared;
STORE outputdata INTO 'output';
So that addresses the UDF case. If you have a bunch of fields that you'll want to store, but you are not going to use them in any of the next several map-reduce cycles, you may wish to store them immediately and then join them back in. But that would be a matter of empirically testing which approach is faster for your specific case.