RobustScaler in PySpark - apache-spark-mllib

I would like to use a RobustScaler for preprocessing data. In sklearn it can be found in
sklearn.preprocessing.RobustScaler
. However, I am using pyspark, so I tried to import it with:
from pyspark.ml.feature import RobustScaler
However, I receive the following error:
ImportError: cannot import name 'RobustScaler' from 'pyspark.ml.feature'
As pault pointed out, RobustScaler is implemented only in pyspark 3. I am trying to implement it as:
class PySpark_RobustScaler(Pipeline):
def __init__(self):
pass
def fit(self, df):
return self
def transform(self, df):
self._df = df
for col_name in self._df.columns:
q1, q2, q3 = self._df.approxQuantile(col_name, [0.25, 0.5, 0.75], 0.00)
self._df = self._df.withColumn(col_name, 2.0*(sf.col(col_name)-q2)/(q3-q1))
return self._df
arr = np.array(
[[ 1., -2., 2.],
[ -2., 1., 3.],
[ 4., 1., -2.]]
)
rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df_sprk = rdd2.toDF(["A", "B", "C"])
df_pd = pd.DataFrame(arr, columns=list('ABC'))
PySpark_RobustScaler().fit(df_sprk).transform(df_sprk).show()
print(RobustScaler().fit(df_pd).transform(df_pd))
However I have found that to obtain the same result of sklearn I have to multiply the result by 2. Furthermore, I am worried that if a column has many values close to zero, the interquartile range q3-q1 could become too small and let the result diverge, creating null values.
Does anyone have any suggestions on how to improve it?

This feature has been released in recent pyspark versions.

Related

Why spatial transformer network (STN) is not working on image

I am trying to subject a one image array to STN using code from https://github.com/kevinzakka/spatial-transformer-network :
def STNfn(x):
import tensorflow as tf
print(x.shape)
B,W,H,C = x.shape
# identity transform
initial = np.array([[1., 0, 0], [0, 1., 0]])
initial = initial.astype('float32').flatten()
# localization network
n_fc = 6
W_fc1 = tf.Variable(tf.zeros([H*W*C, n_fc]), name='W_fc1')
b_fc1 = tf.Variable(initial_value=initial, name='b_fc1')
h_fc1 = tf.matmul(tf.zeros([B, H*W*C]), W_fc1) + b_fc1
# spatial transformer layer
from stn import spatial_transformer_network as transformer
h_trans = transformer(x, h_fc1)
return h_trans
fname = 'testimage.jpg'
img = plt.imread(fname)
img = STNfn(np.array([img]))
However, I am getting following error:
TypeError: Input 'y' of 'Mul' Op has type uint8
that does not match type float32 of argument 'x'.
I have tried to replace float32 with np.uint8, but it does not help.
Where is the problem and how can it be solved?
n_fc = 6 has to be a float32 maybe? Not familar with Python, in Java it is like 6.0f for float and just 6 is integer.

Data reshaping in sklearn (Linear regression)

input code:
data = pd.read_csv('test.csv')
data.head()
data['Density'] = data['Flow [Veh/h]'] / data['Speed [km/h]']
data = data.replace(np.nan, 1)
X = data['Density']
y = data['Speed [km/h]']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train) #HERE I GOT AN ERROR
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You can try changing your variable X as the following:
X = data['Density'].values.reshape((-1, 1))
I had faced the same error, where my feature set had only one variable. The above change solved the issue for me.
Try using [[]] while taking the parameters:
X = data[['Density']]

PyTorch custom dataset dataloader returns strings (of keys) not tensors

I am trying to load my own dataset and I use a custom Dataloader that reads in images and labels and converts them to PyTorch Tensors. However when the Dataloader is instantiated it returns strings x "image" and y "labels" but not the real values or tensors when read (iter)
print(self.train_loader) # shows a Tensor object
tic = time.time()
with tqdm(total=self.num_train) as pbar:
for i, (x, y) in enumerate(self.train_loader): # x and y are returned as string (where it fails)
if self.use_gpu:
x, y = x.cuda(), y.cuda()
x, y = Variable(x), Variable(y)
This is how dataloader.py looks like:
from __future__ import print_function, division #ds
import numpy as np
from utils import plot_images
import os #ds
import pandas as pd #ds
from skimage import io, transform #ds
import torch
from torchvision import datasets
from torch.utils.data import Dataset, DataLoader #ds
from torchvision import transforms
from torchvision import utils #ds
from torch.utils.data.sampler import SubsetRandomSampler
class CDataset(Dataset):
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.frame)
def __getitem__(self, idx):
img_name = os.path.join(self.root_dir,
self.frame.iloc[idx, 0]+'.jpg')
image = io.imread(img_name)
# image = image.transpose((2, 0, 1))
labels = np.array(self.frame.iloc[idx, 1])#.as_matrix() #ds
#landmarks = landmarks.astype('float').reshape(-1, 2)
#print(image.shape)
#print(img_name,labels)
sample = {'image': image, 'labels': labels}
if self.transform:
sample = self.transform(sample)
return sample
class ToTensor(object):
"""Convert ndarrays in sample to Tensors."""
def __call__(self, sample):
image, labels = sample['image'], sample['labels']
#print(image)
#print(labels)
# swap color axis because
# numpy image: H x W x C
# torch image: C X H X W
image = image.transpose((2, 0, 1))
#print(image.shape)
#print((torch.from_numpy(image)))
#print((torch.from_numpy(labels)))
return {'image': torch.from_numpy(image),
'labels': torch.from_numpy(labels)}
def get_train_valid_loader(data_dir,
batch_size,
random_seed,
#valid_size=0.1, #ds
#shuffle=True,
show_sample=False,
num_workers=4,
pin_memory=False):
"""
Utility function for loading and returning train and valid
multi-process iterators over the MNIST dataset. A sample
9x9 grid of the images can be optionally displayed.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
Args
----
- data_dir: path directory to the dataset.
- batch_size: how many samples per batch to load.
- random_seed: fix seed for reproducibility.
- #ds valid_size: percentage split of the training set used for
the validation set. Should be a float in the range [0, 1].
In the paper, this number is set to 0.1.
- shuffle: whether to shuffle the train/validation indices.
- show_sample: plot 9x9 sample grid of the dataset.
- num_workers: number of subprocesses to use when loading the dataset.
- pin_memory: whether to copy tensors into CUDA pinned memory. Set it to
True if using GPU.
Returns
-------
- train_loader: training set iterator.
- valid_loader: validation set iterator.
"""
#ds
#error_msg = "[!] valid_size should be in the range [0, 1]."
#assert ((valid_size >= 0) and (valid_size <= 1)), error_msg
#ds
# define transforms
#normalize = transforms.Normalize((0.1307,), (0.3081,))
trans = transforms.Compose([
ToTensor(), #normalize,
])
# load train dataset
#train_dataset = datasets.MNIST(
# data_dir, train=True, download=True, transform=trans
#)
train_dataset = CDataset(csv_file='/home/Desktop/6June17/util/train.csv',
root_dir='/home/caffe/data/images/',transform=trans)
# load validation dataset
#valid_dataset = datasets.MNIST( #ds
# data_dir, train=True, download=True, transform=trans #ds
#)
valid_dataset = CDataset(csv_file='/home/Desktop/6June17/util/eval.csv',
root_dir='/home/caffe/data/images/',transform=trans)
num_train = len(train_dataset)
train_indices = list(range(num_train))
#ds split = int(np.floor(valid_size * num_train))
num_valid = len(valid_dataset) #ds
valid_indices = list(range(num_valid)) #ds
#if shuffle:
# np.random.seed(random_seed)
# np.random.shuffle(indices)
#ds train_idx, valid_idx = indices[split:], indices[:split]
train_idx = train_indices #ds
valid_idx = valid_indices #ds
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, sampler=train_sampler,
num_workers=num_workers, pin_memory=pin_memory,
)
print(train_loader)
valid_loader = torch.utils.data.DataLoader(
valid_dataset, batch_size=batch_size, sampler=valid_sampler,
num_workers=num_workers, pin_memory=pin_memory,
)
# visualize some images
if show_sample:
sample_loader = torch.utils.data.DataLoader(
dataset, batch_size=9, #shuffle=shuffle,
num_workers=num_workers, pin_memory=pin_memory
)
data_iter = iter(sample_loader)
images, labels = data_iter.next()
X = images.numpy()
X = np.transpose(X, [0, 2, 3, 1])
plot_images(X, labels)
return (train_loader, valid_loader)
def get_test_loader(data_dir,
batch_size,
num_workers=4,
pin_memory=False):
"""
Utility function for loading and returning a multi-process
test iterator over the MNIST dataset.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
Args
----
- data_dir: path directory to the dataset.
- batch_size: how many samples per batch to load.
- num_workers: number of subprocesses to use when loading the dataset.
- pin_memory: whether to copy tensors into CUDA pinned memory. Set it to
True if using GPU.
Returns
-------
- data_loader: test set iterator.
"""
# define transforms
#normalize = transforms.Normalize((0.1307,), (0.3081,))
trans = transforms.Compose([
ToTensor(), #normalize,
])
# load dataset
#dataset = datasets.MNIST(
# data_dir, train=False, download=True, transform=trans
#)
test_dataset = CDataset(csv_file='/home/Desktop/6June17/util/test.csv',
root_dir='/home/caffe/data/images/',transform=trans)
test_loader = torch.utils.data.DataLoader(
test_dataset, batch_size=batch_size, shuffle=False,
num_workers=num_workers, pin_memory=pin_memory,
)
return test_loader
#for i_batch, sample_batched in enumerate(dataloader):
# print(i_batch, sample_batched['image'].size(),
# sample_batched['landmarks'].size())
# # observe 4th batch and stop.
# if i_batch == 3:
# plt.figure()
# show_landmarks_batch(sample_batched)
# plt.axis('off')
# plt.ioff()
# plt.show()
# break
A minimal working sample will be difficult to post here but basically I am trying to modify this project http://torch.ch/blog/2015/09/21/rmva.html which works smoothly with MNIST. I am just trying to run it with my own dataset with the custom dataloader.py I use above.
It instantiates a Dataloader like this:
in trainer.py:
if config.is_train:
self.train_loader = data_loader[0]
self.valid_loader = data_loader[1]
self.num_train = len(self.train_loader.sampler.indices)
self.num_valid = len(self.valid_loader.sampler.indices)
-> run from main.py:
if config.is_train:
data_loader = get_train_valid_loader(
config.data_dir, config.batch_size,
config.random_seed, #config.valid_size,
#config.shuffle,
config.show_sample, **kwargs
)
You are not properly using python's enumerate(). (x, y) are currently assigned the 2 keys of your batch dictionary i.e. the strings "image" and "labels". This should solve your problem:
for i, batch in enumerate(self.train_loader):
x, y = batch["image"], batch["labels"]
# ...

pandas series sort_index() not working with kind='mergesort'

I needed a stable index sorting for DataFrames, when I had this problem:
In cases where a DataFrame becomes a Series (when only a single column matches the selection), the kind argument returns an error. See example:
import pandas as pd
df_a = pd.Series(range(10))
df_b = pd.Series(range(100, 110))
df = pd.concat([df_a, df_b])
df.sort_index(kind='mergesort')
with the following error:
----> 6 df.sort_index(kind='mergesort')
TypeError: sort_index() got an unexpected keyword argument 'kind'
If DataFrames (more then one column is selected), mergesort works ok.
EDIT:
When selecting a single column from a DataFrame for example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame(np.array(range(25)).reshape(5,5))
df_b = pd.DataFrame(np.array(range(100, 125)).reshape(5,5))
df = pd.concat([df_a, df_b])
the following returns an error:
df[0].sort_index(kind='mergesort')
...since the selection is casted to a pandas Series, and as pointed out the pandas.Series.sort_index documentation contains a bug.
However,
df[[0]].sort_index(kind='mergesort')
works alright, since its type continues to be a DataFrame.
pandas.Series.sort_index() has no kind parameter.
here is the definition of this function for Pandas 0.18.1 (file: ./pandas/core/series.py):
# line 1729
#Appender(generic._shared_docs['sort_index'] % _shared_doc_kwargs)
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True):
axis = self._get_axis_number(axis)
index = self.index
if level is not None:
new_index, indexer = index.sortlevel(level, ascending=ascending,
sort_remaining=sort_remaining)
elif isinstance(index, MultiIndex):
from pandas.core.groupby import _lexsort_indexer
indexer = _lexsort_indexer(index.labels, orders=ascending)
indexer = com._ensure_platform_int(indexer)
new_index = index.take(indexer)
else:
new_index, indexer = index.sort_values(return_indexer=True,
ascending=ascending)
new_values = self._values.take(indexer)
result = self._constructor(new_values, index=new_index)
if inplace:
self._update_inplace(result)
else:
return result.__finalize__(self)
file ./pandas/core/generic.py, line 39
_shared_doc_kwargs = dict(axes='keywords for axes', klass='NDFrame',
axes_single_arg='int or labels for object',
args_transpose='axes to permute (int or label for'
' object)')
So most probably it's a bug in the pandas documentation...
Your df is Series, it's not a data frame

Using a complex likelihood in PyMC3

pymc.__version__ = '3.0'
theano.__version__ = '0.6.0.dev-RELEASE'
I'm trying to use PyMC3 with a complex likelihood function:
First question: Is this possible?
Here's my attempt using Thomas Wiecki's post as a guide:
import numpy as np
import theano as th
import pymc as pm
import scipy as sp
# Actual data I'm trying to fit
x = np.array([52.08, 58.44, 60.0, 65.0, 65.10, 66.0, 70.0, 87.5, 110.0, 126.0])
y = np.array([0.522, 0.659, 0.462, 0.720, 0.609, 0.696, 0.667, 0.870, 0.889, 0.919])
yerr = np.array([0.104, 0.071, 0.138, 0.035, 0.102, 0.096, 0.136, 0.031, 0.024, 0.035])
th.config.compute_test_value = 'off'
a = th.tensor.dscalar('a')
with pm.Model() as model:
# Priors
alpha = pm.Normal('alpha', mu=0.3, sd=5)
sig_alpha = pm.Normal('sig_alpha', mu=0.03, sd=5)
t_double = pm.Normal('t_double', mu=4, sd=20)
t_delay = pm.Normal('t_delay', mu=21, sd=20)
nu = pm.Uniform('nu', lower=0, upper=20)
# Some functions needed for calculation of the y estimator
def T(eqd):
doses = np.array([52.08, 58.44, 60.0, 65.0, 65.10,
66.0, 70.0, 87.5, 110.0, 126.0])
tmt_times = np.array([29,29,43,29,36,48,22,11,7,8])
return np.interp(eqd, doses, tmt_times)
def TCP(a):
time = T(x)
BCP = pm.exp(-1E7*pm.exp(-alpha*x*1.2 + 0.69315/t_delay(time-t_double)))
return pm.prod(BCP)
def normpdf(a, alpha, sig_alpha):
return 1./(sig_alpha*pm.sqrt(2.*np.pi))*pm.exp(-pm.sqr(a-alpha)/(2*pm.sqr(sig_alpha)))
def normcdf(a, alpha, sig_alpha):
return 1./2.*(1+pm.erf((a-alpha)/(sig_alpha*pm.sqrt(2))))
def integrand(a):
return normpdf(a,alpha,sig_alpha)/(1.-normcdf(0,alpha,sig_alpha))*TCP(a)
func = th.function([a,alpha,sig_alpha,t_double,t_delay], integrand(a))
y_est = sp.integrate.quad(func(a, alpha, sig_alpha,t_double,t_delay), 0, np.inf)[0]
likelihood = pm.T('TCP', mu=y_est, nu=nu, observed=y_tcp)
start = pm.find_MAP()
step = pm.NUTS(state=start)
trace = pm.sample(2000, step, start=start, progressbar=True)
which produces the following message regarding the expression for y_est:
TypeError: ('Bad input argument to theano function with name ":42" at index 0(0-based)', 'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?')
I've overcome various other hurdles to get this far, and this is where I'm stuck. So, provided the answer to my first question is 'yes', then am I on the right track? Any guidance would be helpful!
N.B. Here is a similar question I found, and another.
Disclaimer: I'm very new at this. My only previous experience is successfully reproducing the linear regression example in Thomas' post. I've also successfully run the Theano test suite, so I know it works.
Yes, its possible to make something with a complex or arbitrary likelihood. Though that doesn't seem like what you're doing here. It looks like you have a complex transformation of one variable into another, the integration step.
Your particular exception is that integrate.quad is expecting a numpy array, not a pymc Variable. If you want to do quad within pymc, you'll have to make a custom theano Op (with derivative) for it.

Resources