statsmodels Error Message: "ValueError: v must be > 1 when p >= .9" - statsmodels

I am trying to perform multiple sample comparison and Tukey HSD using the statsmodels module, but I keep getting this error message, "ValueError: v must be > 1 when p >= .9". I have tried looking this up on the internet for a possible solution, but no avail. Any chance anyone familiar with this module could help me out decipher what I am doing wrong to prompt this error. I use Python version 2.7x and spyder. Below is a sample of my data and the print statement. Thanks!
import numpy as np
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,MultiComparison)
###--- Here are the data I am using:
data1 = np.array([ 1, 1, 1, 1, 976, 24, 1, 1, 15, 15780])
data2 = np.array(['lau15', 'gr17', 'fri26', 'bays29', 'dantzig4', 'KAT38','HARV50', 'HARV10', 'HARV20', 'HARV41'], dtype='|S8')
####--- Here's my print statement code:
print pairwise_tukeyhsd(data1, data2, alpha=0.05)

Seems you have to provide more data than a single observation per group, in order for the test to work.
Minimal example:
from statsmodels.stats.multicomp import pairwise_tukeyhsd,MultiComparison
data=[1,2,3]
groups=['a','b','c']
print("1st try:")
try:
print(pairwise_tukeyhsd(data,groups, alpha=0.05))
except ValueError as ve:
print("whoops!", ve)
data.append(2)
groups.append('a')
print("2nd try:")
try:
print( pairwise_tukeyhsd(data, groups, alpha=0.05))
except ValueError as ve:
print("whoops!", ve)
Output:
1st try:
/home/user/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice
**kwargs)
/home/user/.local/lib/python3.7/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
whoops! v must be > 1 when p >= .9
2nd try:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
====================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------
a b 0.5 0.1 -16.045 17.045 False
a c 1.5 0.1 -15.045 18.045 False
b c 1.0 0.1 -18.1046 20.1046 False
----------------------------------------------------

Related

How to solve the error ' [not a vector ]'

I ran this code to find the norm of some fundamnetal units of a biqaudratic number field, but I faced following problem
for (q=5, 200, for(p=q+1, 200, if (isprime(p)==1 && isprime(q)==1 ,k1=bnfinit(y^2-2*p,1); k2=bnfinit(y^2-q,1); k3=bnfinit(y^2-2*p*q,1); ep1=k1[8][5][1]; ep2=k2[8][5][1]; ep3=k3[8][5][1]; normep1=nfeltnorm(k1,ep1); normep2=nfeltnorm(k2,ep2); normep3=nfeltnorm(k3,ep3); li=[[q,p], [normep1, normep2, normep3]]; lis4=concat(lis4,[li]))))
and it works for small p and q. However, when I ran that for p and q greater than 150, it gives the following error:
First, I didn't use the flag=1 for bnf, but after adding that, still I get the same error.
Please, do not use indexing like ...[8][5][1] to get the fundamental units (FU). It seems that bnfinit omits FU matrix for some p and q. Instead, use the member function fu to receive FU. Please, find the example below:
> [q, p] = [23, 109];
> k = bnfinit(y^2 - 2*p*q, 1);
> k[8][5]
[;]
> k[8][5][1] \\ you will get the error here trying to index the empty matrix.
...
incorrect type in _[_] OCcompo1 [not a vector] (t_MAT).
> k.fu
[Mod(-355285121749346859670064114879166870*y - 25157598731408198132266996072608016699, y^2 - 5014)]
> norm(k.fu[1])
1

Fitting Lightgbm distributed with lgb.train hangs

I'm trying to learn how to use lightgbm distributed.
I wrote a simple hello world kind of code where I use iris dataset with 150 rows, split it into train (100 rows) and test(50 rows). Then training the train test set are further split into two parts. Each part is fed into two machines with appropriate rank.
The problem I see is that lgb.train hangs.
Here is the code:
import argparse
import logging
import lightgbm as lgb
import pandas as pd
from sklearn import datasets
import socket
print('lightgbm', lgb.__version__)
HOST = socket.gethostname()
ip_address = socket.gethostbyname(HOST)
print("IP=", ip_address)
# looks like lightgbm operates only with ip addresses
IPS = ['10.121.22.166', '10.121.22.83']
assert ip_address in IPS
logger = logging.getLogger(__name__)
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 10000)
pd.set_option('max_colwidth', 100)
pd.set_option('precision', 5)
def read_train_data(rank):
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
partition = rank
assert partition < 2
separate = 100
train_df = iris_df.iloc[:separate]
test_df = iris_df.iloc[separate:]
separate_train = 60
separate_test = 30
if partition == 0:
train_df = train_df.iloc[:separate_train]
test_df = test_df.iloc[:separate_test]
else:
train_df = train_df.iloc[separate_train:]
test_df = test_df.iloc[separate_test:]
def get_lgb_dataset(df):
target_column = df.columns[-1]
columns = df.columns[:-1]
assert target_column not in columns
print('Target column', target_column)
x = df[columns]
y = df[target_column]
print(x)
ds = lgb.Dataset(free_raw_data=False, data=x, label=y, params={
"enable_bundle": False
})
ds.construct()
return ds
dtrain = get_lgb_dataset(train_df)
dtest = get_lgb_dataset(test_df)
return dtrain, dtest
def train(args):
port0 = 56456
rank = IPS.index(ip_address)
print("Rank=", rank, HOST)
print("RR", rank)
dtrain, dtest = read_train_data(rank=rank)
params = {'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': 2,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 1,
'num_leaves': 31,
'objective': 'regression',
'metric': 'rmse',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': False,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'tree_learner': 'data_parallel',
'num_threads': 48,
'machines': ','.join([f'{machine}:{port0}' for i, machine in enumerate(IPS)]),
'local_listen_port': port0,
'time_out': 120,
'num_machines': len(IPS)
}
print(params)
logging.info("starting to train lgb at node with rank %d", rank)
evals_result = {}
if args.scikit == 1:
print("Using scikit learn")
bst = lgb.sklearn.LGBMRegressor(**params)
bst.fit(
dtrain.data,
dtrain.label,
eval_set=[(dtest.data, dtest.label)],
)
else:
print("Using regular LGB")
bst = lgb.train(params,
dtrain,
valid_sets=[dtest],
evals_result=evals_result)
print(evals_result)
logging.info("finish xgboost training at node with rank %d", rank)
return bst
def main(args):
logging.info("starting the train job")
model = train(args)
pd.set_option('display.max_rows', 500)
print("OUT", model.__class__)
try:
print(model.trees_to_dataframe())
except:
print(model.booster_.trees_to_dataframe())
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--scikit',
help='scikit',
default=0,
type=int,
)
main(parser.parse_args())
I can run it with the scikit fit interface by running: python simple_distributed_lgb_test.py --scikit 1
On the two machines. It produces a reasonable result.
However, when I use -- scikit 0 (which uses lgb.train), then fitting just hangs on both nodes. Last messages before it hangs:
[LightGBM] [Info] Total Bins 22
[LightGBM] [Info] Number of data points in the train set: 40, number of used features: 2
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Start training from score 0.873750
Is that a bug or an expected behavior? dask.py in lightgbm does use scikit learn fit interface.
I use an overnight master version 3.2.1.99. 5b7a6f3e7150aeb704d1dd2b852d246af3e913a3 tag to be exact from Jul 12.
UPDATE 1
I'm trying to dig into the code. So far I see few things:
scikit.train interface appears to have an extra syncronization step before fitting first tree. lgb.train doesn't have it. Dunno yet where it comes from. (I see some Network::Allreduce operations)
It appears that scikit.train has workers syncronized - each worker knows the correct sizes of the blocks to send and receive during reducescatter operations. For example one the first allreduce worker1 sends 208 blocks and receives 368 blocks of data (in Linkers::SendRecv), while worker2 is reversed - sends 368 and receives 208. So allreduce completes fine. ()
On the contrary, lgb.train has workers not syncronized - each worker has numbers for send and receive blocks during reducescatter at the first DataParallelTreeLearner::FindBestSplits encounter. But they don't match. Worker1 sends 208 abd wants to receive 400. Worker2 sends 192 and wants to receive 176. So, the worker that wants to receive more just hangs. The other worker eventually hangs too.
Possibly it has something to do with lgb.Dataset. That thing may need to have same bins or something. I tried to force it by forcedbins_filename parameter. But it doesn't seem to help with lgb.train.
UPDATE 2
Success. If I remove the following line from the example:
ds.construct()
Everything works. So I guess we can't use construct on Dataset when using distributed training.

RobustScaler in PySpark

I would like to use a RobustScaler for preprocessing data. In sklearn it can be found in
sklearn.preprocessing.RobustScaler
. However, I am using pyspark, so I tried to import it with:
from pyspark.ml.feature import RobustScaler
However, I receive the following error:
ImportError: cannot import name 'RobustScaler' from 'pyspark.ml.feature'
As pault pointed out, RobustScaler is implemented only in pyspark 3. I am trying to implement it as:
class PySpark_RobustScaler(Pipeline):
def __init__(self):
pass
def fit(self, df):
return self
def transform(self, df):
self._df = df
for col_name in self._df.columns:
q1, q2, q3 = self._df.approxQuantile(col_name, [0.25, 0.5, 0.75], 0.00)
self._df = self._df.withColumn(col_name, 2.0*(sf.col(col_name)-q2)/(q3-q1))
return self._df
arr = np.array(
[[ 1., -2., 2.],
[ -2., 1., 3.],
[ 4., 1., -2.]]
)
rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df_sprk = rdd2.toDF(["A", "B", "C"])
df_pd = pd.DataFrame(arr, columns=list('ABC'))
PySpark_RobustScaler().fit(df_sprk).transform(df_sprk).show()
print(RobustScaler().fit(df_pd).transform(df_pd))
However I have found that to obtain the same result of sklearn I have to multiply the result by 2. Furthermore, I am worried that if a column has many values close to zero, the interquartile range q3-q1 could become too small and let the result diverge, creating null values.
Does anyone have any suggestions on how to improve it?
This feature has been released in recent pyspark versions.

Lua Random number generator always produces the same number

I have looked up several tutorials on how to generate random numbers with lua, each said to use math.random(), so I did. however, every time I use it I get the same number every time, I have tried rewriting the code, and I always get the lowest possible number. I even included a random seed based on the OS time. code below.
require "math"
math.randomseed(os.time())
num = math.random(0,10)
print(num)
I'm using the random function like this:
math.randomseed(os.time())
num = math.random() and math.random() and math.random() and math.random(0, 10)
This is working fine. An other option would be to improve the built-in random function, described here.
This might help! I had to use these functions to write a class that generates Nano IDs. I basically used the milliseconds from the os.clock() function and used that for math.randomseed().
NanoId = {
validCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_-",
generate = function (size, validChars)
local response = ""
local ms = string.match(tostring(os.clock()), "%d%.(%d+)")
local temp = math.randomseed(ms)
if (size > 0 and string.len(validChars) > 0) then
for i = 1, size do
local num = math.random(string.len(validChars))
response = response..string.sub(validChars, num, num)
end
end
return response
end
}
function NanoId:Generate()
return self.generate(21, self.validCharacters)
end
-- Runtime Testing
for i = 1, 10 do
print(NanoId:Generate())
end
--[[
Output:
>>> p2r2-WqwvzvoIljKa6qDH
>>> pMoxTET2BrIjYUVXNMDNH
>>> w-nN7J0RVDdN6-R9iv4i-
>>> cfRMzXB4jZmc3quWEkAxj
>>> aFeYCA2kgOx-s4UN02s0s
>>> xegA--_EjEmcDk3Q1zh7K
>>> 6dkVRaNpW4cMwzCPDL3zt
>>> R2Fct5Up5OwnHeExDnqZI
>>> JwnlLZcp8kml-MHUEFAgm
>>> xPr5dULuv48UMaSTzdW5J
]]

volemont/insights:chart.EquityCurve.R: a bug in graphing peaks of cumulative return?

I came cross a function of graphing cumulative return of a strategy and the peaks of the return in a great example of combining shiny and quantstrat, thanks to Simon Otziger. The source code is here. The code works fine most of time, but for some data it won't graph the peaks properly.
The code is simplified but the key logic is not changed. I ran the code with three set of data (cumPNL1, cumPNL2, cumPNL3) copied from three example strategies, in which the first data will cause the code to fail to graph peaks properly.
I ran the following codes with cumPNL1, cumPNL2, cumPNL3 separately. with both cumPNL2 and cumPNL3 the code can produce cumulative return line and peak points successfully. however, with cumPNL1 the code can only produce line, but peaks are not at the right positions.
I noticed that both peakIndex based on cumPNL2 and cumPNL3 have their first value being TRUE, so when I change the code by adding a line peakIndex[1] <- TRUE, cumPNL1 will work fine with the modified code.
Though now it works with modified code, I have no idea why it is behaving like this. Could anyone have a look? Thanks
cumPNL1 <- c(-193,-345,-406,-472,-562,-543,-450,-460,-544,-659,-581,-342,-384,276,-858,-257.99)
cumPNL2 <- c(35.64,4.95,-2.97,-6.93,11.88,-19.8,-26.73,-39.6,-49.5,-50.49,-51.48,-48.51,-50.49,-55.44,143.55,770.22,745.47,691.02,847.44,1141.47,1007.82,1392.93,1855.26,1863.18,2536.38,2778.93,2811.6,2859.12,2417.58)
cumPNL3 <- c(35.64,4.95,-2.97,-6.93,11.88,-19.8,-26.73,-39.6,-49.5,-50.49,-51.48,-48.51,-50.49,-55.44,143.55,770.22,745.47,691.02,847.44,1141.47,1007.82,1392.93,1855.26,1863.18,2536.38,2778.93,2811.6,2859.12,2417.58)
peakIndex <- c(cumPNL3[1] > 0, diff(cummax(cumPNL3)) > 0)
# peakIndex[1] <- TRUE
dev.new()
plot(cumPNL3, type='n', xlab="index of trades", ylab="returns in cash", main="cumulative returns and peaks")
grid()
lines(cumPNL3)
points(cbind(1 : length(cumPNL3), cumPNL3)[peakIndex, ],
pch=19, col='green', cex=0.6)
legend(
x='bottomright', inset=0.1,
legend=c('Net Profit','Peaks'),
lty=c(1, NA), pch=c(NA, 19),
col=c('black','green')
)
cumPNL1 has a single peak and R reduces the dimension from a numerical matrix to a numerical vector of length 2. The points function plots the two numerical vector values on the y-axis using the x-axis index 1 and 2:
peakIndex1 <- c(cumPNL1[1] > 0, diff(cummax(cumPNL1)) > 0)
peakIndex3 <- c(cumPNL3[1] > 0, diff(cummax(cumPNL3)) > 0)
str(cbind(1 : length(cumPNL1), cumPNL1)[peakIndex1,])
str(cbind(1 : length(cumPNL3), cumPNL3)[peakIndex3,])
Output:
> str(cbind(1 : length(cumPNL1), cumPNL1)[peakIndex1,])
num [1:12, 1:2] 1 15 16 19 20 22 23 24 25 26 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "" "cumPNL1"
> str(cbind(1 : length(cumPNL3), cumPNL3)[peakIndex3,])
Named num [1:2] 14 276
- attr(*, "names")= chr [1:2] "" "cumPNL3"
Usually setting plot = FALSE preserves the object, e.g., str(cbind(1 : length(cumPNL3), cumPNL3)[peakIndex3, drop = FALSE]), which somehow does not work in this case. However, changing the points line to the following fixes the problem:
points(seq_along(cumPNL3)[peakIndex], cumPNL3[peakIndex], pch = 19,
col = 'green', cex = 0.6)
Thanks for reporting the issue. I will push the fix to GitHub tomorrow.

Resources