How to get the leader feature importances in h2o automl pysparkling water - h2o

i am using spark standalone cluster and running h2o pysparkling in it.
I am unable to find the function for getting the leader feature importances. please help
Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pyspark import SparkFiles
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
conf = H2OConf()
hc = H2OContext.getOrCreate(conf)
def xgb_automl_features_importance(data, target_metric):
# Converting DataFrame in H2OFrame
hf = h2o.H2OFrame(data)
sparkDF = hc.asSparkFrame(hf)
# Identify predictors and response
y = target_metric
aml = H2OAutoML(labelCol=y)
aml.setIncludeAlgos(["XGBoost"])
aml.setMaxModels(1)
aml.fit(sparkDF)
print('-----------****************')
print(aml.getLeaderboard().show(truncate=False))

The fit method on H2OAutoML returns the leader model. Each model in SW has the method getFeatureImportances() returning Spark data frame with feature importances.
model=aml.fit(sparkDF)
model.getFeatureImportances().show()

Related

While deploying model to AKS PipelineModel.load throwing org.apache.hadoop.mapred.InvalidInputException

I am trying to deploy model to AKS. I am using AML SDK to register the model in the aml workspace. I am using PipelineModel module to save the model. And I am trying to load the model using PipelineModel.load. My entry script looks like below:
`
import os
import json
import pandas as pd
from azureml.core.model import Model
from pyspark.ml import PipelineModel
from mmlspark import ComputeModelStatistics
def init():
import mmlspark # this is needed to load mmlspark libraries
import logging
# extract and load model
global model, model_path
model_path = Model.get_model_path("{model_name}")
print(model_path)
print(os.stat(model_path))
print(os.path.exists(model_path))
#model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "{model_name}")
logging.basicConfig(level=logging.DEBUG)
#print(model_path)
#with ZipFile(model_path, 'r') as f:
# f.extractall('model')
model = PipelineModel.load(model_path)
#model = PipelineModel.read().load(model_path)
def run(input_json):
try:
output_df = model.transform(pd.read_json(input_json))
evaluator = ComputeModelStatistics().setScoredLabelsCol("prediction").setLabelCol("label").setEvaluationMetric("AUC")
result = evaluator.transform(predictions)
auc = result.select("AUC").collect()[0][0]
result = auc
except Exception as e:
result = str(e)
return json.dumps({{"result": result}})
`
It's giving error like below:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/var/azureml-app/azureml-models/lightgbm.model/2/lightgbm.model/metadata\n\tat org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)\n\tat org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)\n\tat org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315).
os.path.exists returns true the path fetched from Model.get_model_path.
Am I missing something here?

h2o automl max_runtime_secs not stopping execution

To Whom It May Concern,
The code below is being run in a Docker container based on jupyter's data science notebook;
however, I've install Java 8 and h2o (version 3.20.0.7), as well as exposed the necessary ports. The docker container is being run on a system using Ubuntu 16.04 and has 32 threads and over 300G of RAM.
h2o is using all the threads and 26.67 Gb of memory. I'm attempted to classify text as either a 0 or a 1 using the code below.
However, despite setting max_runtime_secs to 900 or 15 minutes, the code hadn't finished executing and was still tying up most of the machine resources ~15 hours later. As a side note, it took df_train about 20 minutes to parse. Any thoughts on what's going wrong?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
df = pd.read_csv('Data.csv')[['Text', 'Classification']]
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), stop_words = 'english')
x_train_vec = vectorizer.fit_transform(df['Text'])
y_train = df['Classification']
import h2o
from h2o.automl import H2OAutoML
h2o.init()
df_train = h2o.H2OFrame(x_train_vec.A, header=-1, column_names=vectorizer.get_feature_names())
df_labels = h2o.H2OFrame(y_train.reset_index()[['Classification']])
df_train = df_train.concat(df_labels)
x_train_cn = df_train.columns
y_train_cn = 'Classification'
x_train_cn.remove(y_train_cn)
df_train[y_train_cn] = df_train[y_train_cn].asfactor()
h2o_aml = H2OAutoML(max_runtime_secs = 900, exclude_algos = ["DeepLearning"])
h2o_aml.train(x = x_train_cn , y = y_train_cn, training_frame = df_train)
lb = h2o_aml.leaderboard
y_predict = h2o_aml.leader.predict(df_train.drop('Classification'))
print('accuracy: {}'.format(accuracy_score(y_pred=y_predict, y_true=y_train)))
print('precision: {}'.format(precision_score(y_pred=y_predict, y_true=y_train)))
print('recall: {}'.format(recall_score(y_pred=y_predict, y_true=y_train)))
print('f1: {}\n'.format(f1_score(y_pred=y_predict, y_true=y_train)))
This is a bug that has been fixed on master. If you want, you can try out the fix now on the nightly release, otherwise, it will be fixed in the next stable release of H2O, 3.22.

How to change hdfs block size of DataFrame in pysark

This seems related to
How to change hdfs block size in pyspark?
I can successfully change the hdfs block size with rdd.saveAsTextFile,
but not the corresponding DataFrame.write.parquet and unable to save with parquet format.
Unsure whether it's the bug in pyspark DataFrame or I did not set the configurations correctly.
The following is my testing code:
##########
# init
##########
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import hdfs
from hdfs import InsecureClient
import os
import numpy as np
import pandas as pd
import logging
os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
block_size = 512 * 1024
conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", block_size)
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", block_size)
##########
# main
##########
# create DataFrame
df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, \{'temp': "!"}])
# save using DataFrameWriter, resulting 128MB-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')
Found the answer from the following link:
https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html
I can successfully setup parquet block size with spark.hadoop.parquet.block.size
The following is the sample code:
# init
block_size = 512 * 1024
conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set('spark.hadoop.parquet.block.size', str(block_size)).set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size", str(131072))
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
# create DataFrame
df_txt = spark.createDataFrame([{'temp': "hello"}, {'temp': "world"}, {'temp': "!"}])
# save using DataFrameWriter, resulting 512k-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
# save using DataFrameWriter.csv, resulting 512k-block-size
df_txt.write.mode('overwrite').csv('hdfs://spark1/tmp/temp_with_df_csv')
# save using DataFrameWriter.text, resulting 512k-block-size
df_txt.write.mode('overwrite').text('hdfs://spark1/tmp/temp_with_df_text')
# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')
Hadoop and Spark are two independent tools which have their own strategies to work. Spark and Parquet work with data partitions and block size is not meaningful for them. Do what Spark does say and then do what you want with it inside the HDFS.
You can change your Parquet partition number by
df_txt.repartition(6).format("parquet").save("hdfs://...")

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

unable to submit spark python script

I'm using the following script to submit a python script
#!/usr/bin/python
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
from pyspark import SparkContext as sc, SparkConf
data = sc.textFile("hdfs:/dataset/parkinsons.data")
got this error:
data = sc.textFile("hdfs:/dataset/parkinsons.data")
TypeError: unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)
You must create a SparkContext at first, for example:
from pyspark import SparkContext
sc = SparkContext(appName="TestApp")
data = sc.textFile("hdfs:/dataset/parkinsons.data")

Resources