To Whom It May Concern,
The code below is being run in a Docker container based on jupyter's data science notebook;
however, I've install Java 8 and h2o (version 3.20.0.7), as well as exposed the necessary ports. The docker container is being run on a system using Ubuntu 16.04 and has 32 threads and over 300G of RAM.
h2o is using all the threads and 26.67 Gb of memory. I'm attempted to classify text as either a 0 or a 1 using the code below.
However, despite setting max_runtime_secs to 900 or 15 minutes, the code hadn't finished executing and was still tying up most of the machine resources ~15 hours later. As a side note, it took df_train about 20 minutes to parse. Any thoughts on what's going wrong?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
df = pd.read_csv('Data.csv')[['Text', 'Classification']]
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), stop_words = 'english')
x_train_vec = vectorizer.fit_transform(df['Text'])
y_train = df['Classification']
import h2o
from h2o.automl import H2OAutoML
h2o.init()
df_train = h2o.H2OFrame(x_train_vec.A, header=-1, column_names=vectorizer.get_feature_names())
df_labels = h2o.H2OFrame(y_train.reset_index()[['Classification']])
df_train = df_train.concat(df_labels)
x_train_cn = df_train.columns
y_train_cn = 'Classification'
x_train_cn.remove(y_train_cn)
df_train[y_train_cn] = df_train[y_train_cn].asfactor()
h2o_aml = H2OAutoML(max_runtime_secs = 900, exclude_algos = ["DeepLearning"])
h2o_aml.train(x = x_train_cn , y = y_train_cn, training_frame = df_train)
lb = h2o_aml.leaderboard
y_predict = h2o_aml.leader.predict(df_train.drop('Classification'))
print('accuracy: {}'.format(accuracy_score(y_pred=y_predict, y_true=y_train)))
print('precision: {}'.format(precision_score(y_pred=y_predict, y_true=y_train)))
print('recall: {}'.format(recall_score(y_pred=y_predict, y_true=y_train)))
print('f1: {}\n'.format(f1_score(y_pred=y_predict, y_true=y_train)))
This is a bug that has been fixed on master. If you want, you can try out the fix now on the nightly release, otherwise, it will be fixed in the next stable release of H2O, 3.22.
Related
i am using spark standalone cluster and running h2o pysparkling in it.
I am unable to find the function for getting the leader feature importances. please help
Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pyspark import SparkFiles
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
conf = H2OConf()
hc = H2OContext.getOrCreate(conf)
def xgb_automl_features_importance(data, target_metric):
# Converting DataFrame in H2OFrame
hf = h2o.H2OFrame(data)
sparkDF = hc.asSparkFrame(hf)
# Identify predictors and response
y = target_metric
aml = H2OAutoML(labelCol=y)
aml.setIncludeAlgos(["XGBoost"])
aml.setMaxModels(1)
aml.fit(sparkDF)
print('-----------****************')
print(aml.getLeaderboard().show(truncate=False))
The fit method on H2OAutoML returns the leader model. Each model in SW has the method getFeatureImportances() returning Spark data frame with feature importances.
model=aml.fit(sparkDF)
model.getFeatureImportances().show()
I would like to use multiprocessing to launch multiple training instances on CUDA device. Since the data is common between the processes, I want to avoid data copy for every process. I'm using python 3.8's SharedMemory from multiprocessing module to achieve this following this SO example.
I can allocate a memory block using SharedMemory and create as many processes as I'd like with constant memory (RAM) usage. However, when I try to send tensors to CUDA, the memory scales linearly with the number of processes. It appears as if when c.to(device) is called, the base data is copied for every process.
Does any one know why this is happening? Any ideas to mitigate this issue?
Here is the sample code I'm using:
import numpy as np
from multiprocessing import shared_memory, get_context
import time
import torch
import copy
dim = 10000
batch_size = 10
sleep_time = 2
npe = 1 # number of parallel executions
# cuda
if torch.cuda.is_available():
dev = 'cuda:0'
else:
dev = "cpu"
device = torch.device(dev)
def step(i, shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_arr = np.ndarray((dim, dim), dtype=np.float32, buffer=existing_shm.buf)
b = np_arr[i * batch_size: (i + 1) * batch_size, :]
b = torch.Tensor(b)
# This is just to explicitly copy the tensor so that it has nothing to do
# with the shared memory block
c = copy.deepcopy(b)
# If tensor c is sent to the cuda device, then RAM scales linearly
# with the number of parallel executions.
# If c is not sent to cuda device, memory consumption is constant.
c = c.to(device)
time.sleep(sleep_time)
existing_shm.close()
def create_shared_block():
a = np.random.random((dim, dim)).astype(np.float32)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='sha')
np_arr = np.ndarray(a.shape, dtype=np.float32, buffer=shm.buf)
np_arr[:] = a[:]
return shm, np_arr
if __name__ == '__main__':
# create shared memory block
shm, np_arr = create_shared_block()
# create list of inputs to be executed in parallel
inp = [[x, 'sha'] for x in range(npe)]
print(inp)
# sleep added before and after launching multiprocessing to monitor the memory consumption
print('before pool') # to check memory with top or htop
time.sleep(sleep_time)
context = get_context('spawn')
with context.Pool(npe) as pool:
print('after pool') # to check memory with top or htop
time.sleep(sleep_time)
pool.starmap(step, inp)
time.sleep(sleep_time)
shm.close()
shm.unlink()
I am using Python 3.7.3 in macOS system in Anaconda environment. Tensorflow (1.14.0), Matplotlib (3.1.0) and other modules were installed there and everything worked fine. I wrote the following codes and run it.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
def add_layer(inputs, inputs_size, outputs_size,activation_function = None):
with tf.name_scope('layer'):
with tf.name_scope('weight'):
Weights = tf.Variable(tf.random.normal([inputs_size, outputs_size]))
with tf.name_scope('biase'):
biases = tf.Variable(tf.zeros([1,outputs_size])+0.1)
with tf.name_scope('wx_plus_b'):
Wx_plus_b = tf.matmul(inputs, Weights) + biases
if activation_function == None:outputs = Wx_plus_b
else: outputs = activation_function(Wx_plus_b)
return outputs
'''
multiple lines omitted here
'''
writer = tf.compat.v1.summary.FileWriter("logs/",sess.graph)
I can see a local file with name
"events.out.tfevents.1561289962.Botaos-MacBook-Pro.local"
generated in "logs/" folder. I opened terminal and cd to that folder with Anaconda environment activated. Then I typed
"python -m tensorboard.main --logdir=‘logs/‘ --host localhost --port 6006"
and got response
TensorBoard 1.14.0 at http://localhost:6006/ (Press CTRL+C to quit)
Then no matter I use safari or chrome to open "http://localhost:6006/", there's always nothing shown except "No dashboards are active for the current data set."
enter image description here
Actually I also tried other commends such as
python -m tensorboard.main --logd logs --host localhost --port 6006
python -m tensorboard.main --logd logs --host localhost --port 6006
But there's no difference.
The original codes as follows:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
def add_layer(inputs, inputs_size, outputs_size,activation_function = None):
with tf.name_scope('layer'):
with tf.name_scope('weight'):
Weights = tf.Variable(tf.random.normal([inputs_size, outputs_size]))
with tf.name_scope('biase'):
biases = tf.Variable(tf.zeros([1,outputs_size])+0.1)
with tf.name_scope('wx_plus_b'):
Wx_plus_b = tf.matmul(inputs, Weights) + biases
if activation_function == None:outputs = Wx_plus_b
else: outputs = activation_function(Wx_plus_b)
return outputs
x_data = np.linspace(-1,1,300,dtype = np.float32)[:,np.newaxis]
noise = np.random.normal(0,0.05,x_data.shape).astype(np.float32)
y_data = np.square(x_data) - 0.5 + noise
with tf.name_scope('inputs'):
xs = tf.compat.v1.placeholder(tf.float32,[None,1],name='x_in')
ys = tf.compat.v1.placeholder(tf.float32,[None,1],name='y_in')
l1 = add_layer(xs, 1, 10, tf.nn.relu)
prediction = add_layer(l1, 10, 1, None)
with tf.name_scope('loss'):
loss = tf.reduce_mean(tf.reduce_sum(tf.square(prediction - ys),reduction_indices=[1])) #no need to do tf.sum() as in link. #tf.reduce_mean()
with tf.name_scope('train'):
train_step = tf.compat.v1.train.GradientDescentOptimizer(0.1).minimize(loss)
sess = tf.compat.v1.Session()
writer = tf.compat.v1.summary.FileWriter("logs/",sess.graph)
sess.run(tf.compat.v1.global_variables_initializer())
I think the problem is that you are trying to find the logs directory while you are already in the logs directory.
try executing: tensorboard --logdir logs
from the directory containing the logs dir.
I am using an image processing code in python opencv. Since that process is taking a lot of time to process say 30 images. I tried to process these image parallel using Multiprocessing. The multiprocessing part is working good in CPU but I want to use that multiprocessing thing in GPU(cuda).
I use torch.multiprocessing for running task in parallel. So I am using torch.device('cuda') for our class to run whole thing in to this perticular device. When I run the code it's showing device using "cuda" but not using any GPU processing.
import cv2
import numpy as np
import torch
import torch.nn as nn
from torch.multiprocessing import Process, Pool, Manager, set_start_method
import sys
import os
class RoadShoulderWidth(nn.Module):
def __init__(self):
super(RoadShoulderWidth, self).__init__()
pass
// Want to run below method in parallel for 30 images.
#staticmethod
def get_dim(image, road_shoulder_width_list):
..... code
def get_road_shoulder_width(self, _root_dir, _img_path_list):
manager = Manager()
road_shoulder_width_list = manager.list()
processes = []
for img_path in img_path_list[:30]:
img = cv2.imread(_root_dir + '/' + img_path)
img = img[72 * 5:72 * 6, 0:1280]
# Do work
p = Process(target=self.get_dim,args=(img,road_shoulder_width_list))
p.start()
processes.append(p)
for p in processes:
p.join()
return road_shoulder_width_list
Use below set of code to run your class
if __name__ == '__main__':
root_dir = '/home/nikhil_m/r'
img_path_list = os.listdir(root_dir)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
dataloader_kwargs = {'pin_memory': True}
set_start_method('fork')
obj = RoadShoulderWidth().to(device)
val = obj.get_road_shoulder_width(str(root_dir), img_path_list)
print(val)
print(torch.cuda.is_available())
Can anybody suggest me how to fix this?
Your class RoadShoulderWidth is a nn.Module subclass which lets you use .to(device). This only means that all other nn.Module objects or nn.Parameters that are members of your RoadShoulderWidth object are moved to the device. As from your example, there are none, so nothing happens.
In general PyTorch does not move code to GPU but data. If all data of a pytorch operation are on the GPU (e.g. a + b, a and b are on GPU) then the operation is executed on the GPU. You can move the data with a.to(device), given a is a torch.Tensor object.
PyTorch can only execute its own operations on GPU. It's not able to execute OpenCV code on GPU.
I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")