How can I set start point for ALS recommender in spark mlib? - apache-spark-mllib

I am using this code to get ALS recommendations:
SparkSession spark = SparkSession
.builder()
.appName("SomeAppName")
.config("spark.master", "local")
.getOrCreate();
JavaRDD<Rating> ratingsRDD = spark
.read().textFile(args[0]).javaRDD()
.map(Rating::parseRating);
Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, Rating.class);
ALS als = new ALS()
.setMaxIter(1)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating");
ALSModel model = als.fit(ratings);
model.setColdStartStrategy("drop");
Dataset<Row> rowDataset = model.recommendForAllUsers(50);
I want to reuse feature vectors for users and items from previous run. How can I set these values as initial point?

if you want to save time and skip training step each time you run the app, you can simply save trained als with:
model.save(jsc.sc(), "saved_model");
and load it in another script.
More info here:
https://spark.apache.org/docs/2.2.0/mllib-collaborative-filtering.html

Related

How to get the leader feature importances in h2o automl pysparkling water

i am using spark standalone cluster and running h2o pysparkling in it.
I am unable to find the function for getting the leader feature importances. please help
Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pyspark import SparkFiles
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
conf = H2OConf()
hc = H2OContext.getOrCreate(conf)
def xgb_automl_features_importance(data, target_metric):
# Converting DataFrame in H2OFrame
hf = h2o.H2OFrame(data)
sparkDF = hc.asSparkFrame(hf)
# Identify predictors and response
y = target_metric
aml = H2OAutoML(labelCol=y)
aml.setIncludeAlgos(["XGBoost"])
aml.setMaxModels(1)
aml.fit(sparkDF)
print('-----------****************')
print(aml.getLeaderboard().show(truncate=False))
The fit method on H2OAutoML returns the leader model. Each model in SW has the method getFeatureImportances() returning Spark data frame with feature importances.
model=aml.fit(sparkDF)
model.getFeatureImportances().show()

Hive internal table takes block size larger than configured

My configuration for dfs.blocksize is 128M, and if I upload any file or create any file, it takes block of size 128M which is cool. But when I create hive table it , however small may be, takes block size of 256M.
Can we set size of table when they are created? I don't know how it's done
UPDATE
I am using spark sql.
spark = SparkSession .builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://bigdata-
namenode:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
//.master("local")
.getOrCreate();
String query1 = String.format("INSERT INTO TABLE bm_top."+orc_table+" SELECT icode, store_code,division,from_unixtime(unix_timestamp(bill_date,'dd-MMM-yy'),'yyyy-MM-dd'), qty, bill_no, mrp FROM bm_top.temp_ext_table");
spark.sql(query1);

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file

How to read a record from HBase then store into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase?

So I want to write a code to read a record from Hadoop HBase then store it into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase. I have ZERO knowledge about either of the two and I need to use AWS cloud or Hadoop virtual machine. Someone please guide me to start from scratch.
Please make use of the basic code in Scala where we are reading the data in HBase using Scala. Similarly you can write a table creation to write the data into HBase
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark._
object HBaseApp {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table1"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 100000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())
sc.stop()
}
}

how to save/load a trained model in H2o?

The user tutorial says
Navigate to Data > View All
Choose to filter by the model key
Hit Save Model
Input for path: /data/h2o-training/...
Hit Submit
The problem is that I do not have this menu (H2o, 3.0.0.26, web interface)
I am, unfortunately, not familiar with the web interface but I can offer a workaround involving H2O in R. The functions
h2o.saveModel(object, dir = "", name = "", filename = "", force = FALSE)
and
h2o.loadModel(path, conn = h2o.getConnection())
Should offer what you need. I will try to have a look at H2O Flow.
Update
I cannot find the possibility to explicitly save a model either. What you can do instead is save the 'Flow'. You ergo could upload/import your file, build the model and then save / load the status :-)
When viewing the model in H2O Flow, you will see an 'Export' button as an action that can be taken against a model
From there, you will be prompted to specify a path in 'Export Model' dialog. Specify the path and hit the 'Export' button. That will save you model to disk.
I'm referring to H2O version 3.2.0.3
A working example that I've used recently while building a deep learning model in version 2.8.6 in h2o.The model was saved in hdfs.For latest version you probably have to remove the classification=T switch and have to replace data with training_frame
library(h2o)
h = h2o.init(ip="xx.xxx.xxx.xxx", port=54321, startH2O = F)
cTrain.h2o <- as.h2o(h,cTrain,key="c1")
cTest.h2o <- as.h2o(h,cTest,key="c2")
nh2oD<-h2o.deeplearning(x =c(1:12),y="tgt",data=cTrain.h2o,classification=F,activation="Tanh",
rate=0.001,rho=0.99,momentum_start=0.5,momentum_stable=0.99,input_dropout_ratio=0.2,
hidden=c(12,25,11,11),hidden_dropout_ratios=c(0.4,0.4,0.4,0.4),
epochs=150,variable_importances=T,seed=1234,reproducible = T,l1=1e-5,
key="dn")
hdfsdir<-"hdfs://xxxxxxxxxx/user/xxxxxx/xxxxx/models"
h2o.saveModel(nh2oD,hdfsdir,name="DLModel1",save_cv=T,force=T)
test=h2o.loadModel(h,path=paste0(hdfsdir,"/","DLModel1"))
This should be what you need:
library(h2o)
h2o.init()
path = system.file("extdata", "prostate.csv", package = "h2o")
h2o_df = h2o.importFile(path)
h2o_df$CAPSULE = as.factor(h2o_df$CAPSULE)
model = h2o.glm(y = "CAPSULE",
x = c("AGE", "RACE", "PSA", "GLEASON"),
training_frame = h2o_df,
family = "binomial")
h2o.download_pojo(model)
http://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-docs/index.html#POJO%20Quick%20Start
How to save models in H2O Flow:
go to "List All Models"
in the model details, you will find an "Export" option
enter the model name you want to save it as
import it back again
How to save a model trained in h2o-py:
# say "rf" is your H2ORandomForestEstimator object. To export it
>>> path = h2o.save_model(rf, force=True) # save_model() returns the path
>>> path
u'/home/user/rf'
#to import it back again(as a new object)
>>> rafo = h2o.load_model(path)
>>> rafo # prints model details
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: drf1
Model Summary:
######Prints model details...................

Resources