Poor h2o GBM Classification Performance in a balanced binomial response - h2o

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 147857 234035 0.612830 =234035/381892
1 44782 271661 0.141517 =44782/316443
Totals 192639 505696 0.399260 =278817/698335
Any expert suggestions to treat the data and reduce the error is welcome.
Following approaches are tried and error is not found decreasing.
Approach 1: Selecting top 5 important variables via h2o.varimp(gbm)
Approach 2: Converting the negative normalized variable as zero and possitive as 1.
#Data Definition
# Variable Definition
#Independent Variables
# ID Unique ID for each observation
# Timestamp Unique value representing one day
# Stock_ID Unique ID representing one stock
# Volume Normalized values of volume traded of given stock ID on that timestamp
# Three_Day_Moving_Average Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range Normalized values of true range for given stock ID
# Average_True_Range Normalized values of average true range for given stock ID
# Positive_Directional_Movement Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement Normalized values of negative directional movement for given stock ID
#Dependent Response Variable
# Outcome Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)
summary(train)
#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)
rm(train,test)
summary(combi.imp)
#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)
summary(combi.imp)
#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]
library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")
train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])
#Models
gbmF_model_1 = h2o.gbm( x=features,
y = y,
training_frame =train.hex,
seed=1234
)
h2o.performance(gbmF_model_1)

You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

Related

How do I add noise/variability to a dataset in Python, given the CV?

Given a dataset of blood results, say cholesterol level, and knowing that the instrument that produced those results is subject to a known degree of variability, how would I add that variability back into the dataset? i.e. I want to assume the result in the original dataset is the true/mean value, and then produce new results that are subject to the known variability of the instrument.
In Excel you use =NORM.INV(RAND(), mean, std_dev), where RAND() provides a random value between 0 and 1, "mean" will be the original value and I have the CV so I can calculate the SD. NORM.INV then provides the inverse of the cumulative normal distribution function.
I've done the following to create a new column with my new values, but would like to know if it is valid (i.e., will each row have a different random number between 0 and 1 as the probability? and is this formula equivalent to NORM.INV?
df8000['HDL_1'] = norm.ppf(random(), loc = df8000['HDL_0'], scale = TAE_df.loc[0,'HDL'])
Thanks in advance!

Include only complete groups in panel regression using Stata

I have a panel set of data but not all individuals are present for all periods. I see when I run my xtreg that there are between 1-4 observations per group with a mean of 1.9. I'd like to only include those with 4 observations. Is there any way I can do this easily?
I understand that you want to include in your regression only those groups for which there are exactly 4 observations. If this is the case, then one solution is to count the number of observations per group and condition the regression using if:
clear all
set more off
webuse nlswork
xtset idcode
list idcode year in 1/50, sepby(idcode)
bysort idcode: gen counter = _N
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
c.tenure#c.tenure 2.race not_smsa south if counter == 12, be
In this example the regression is conditioned to groups with 12 observations. The xtreg command gives (among other things):
Number of obs = 1881
Number of groups = 158
which you can compare with the result of running the regression without the if:
Number of obs = 28091
Number of groups = 4697
As commented by #NickCox, if you don't mind losing observations you can drop or keep (un)desired groups:
bysort idcode: drop if _N != 4
or
bysort idcode: keep if _N == 4
followed by an unconditional xtreg (i.e. with no if).
Notice that both approaches count missings, so you may need to account for that.
On the other hand, you might want to think about why you want to discard that data in your analysis.

How can one analyze the greatest percentage gain (burst) of numbers in sequence in an array?

There are algorithms for detecting the maximum subarray within an array (both contiguous and non-continguous). Most of them are based around having both negative and positive numbers, though. How is it done with positive numbers only?
I have an array of values of a stock over a consequtive range of time (let's say, the array contains values for all consecutive months).
[15.42, 16.42, 17.36, 16.22, 14.72, 13.95, 14.73, 13.76, 12.88, 13.51, 12.67, 11.11, 10.04, 10.38, 10.14, 7.72, 7.46, 9.41, 11.39, 9.7, 12.67, 18.42, 18.44, 18.03, 17.48, 19.6, 19.57, 18.48, 17.36, 18.03, 18.1, 19.07, 21.02, 20.77, 19.92, 18.71, 20.29, 22.36, 22.38, 22.39, 22.94, 23.5, 21.66, 22.06, 21.07, 19.86, 19.49, 18.79, 18.16, 17.24, 17.74, 18.41, 17.56, 17.24, 16.04, 16.05, 15.4, 15.77, 15.68, 16.29, 15.23, 14.51, 14.05, 13.28, 13.49, 13.12, 14.33, 13.67, 13.13, 12.45, 12.48, 11.58, 11.52, 11.2, 10.46, 12.24, 11.62, 11.43, 10.96, 10.63, 10.19, 10.03, 9.7, 9.64, 9.16, 8.96, 8.49, 8.16, 8.0, 7.86, 8.08, 8.02, 7.67, 8.07, 8.37, 8.35, 8.82, 8.58, 8.47, 8.42, 7.92, 7.77, 7.79, 7.6, 7.18, 7.44, 7.74, 7.47, 7.63, 7.21, 7.06, 6.9, 6.84, 6.96, 6.93, 6.49, 6.38, 6.69, 6.49, 6.76]
I need an algorithm to determine for each element the single time period where it had the biggest percentage gain. This could be a time period of 1 month, some span of several months, or the entire array (e.g., 120 months), depending on the stock. I then want to output the burst, in terms of percentage gain, as well as the return (change in price over the original price; so the peak price vs the starting price in the period).
I've combined the max subarray type algorithms, but realized that this problem is a bit different; the array has no negative numbers, so those algorithms just report the entire array as the period and the sum of all elements as the gain.
The algorithms I mentioned are located here and here, with the latter being based on the Master Theorem. Hope this helps.
I'm coding in Ruby but pseudocode would be welcome, too.
I think you went the wrong way ...
I'm not familiar with ruby but let us build the algorithm in pseudocode using your own words :
I've got an array that contains the values of a stock over a range of
time (let's say, for this example, each element is the value of the
stock in a month; the array contains values for all consecutive
months).
We'll name this array StockValues, its length is given by length(StockValues), assume it is 1 based (first item is retrieved with StockValues[1])
I need an algorithm to analyze the array, and determine for each
element the single time period where it had the biggest percentage
gain in price.
You want to know for a given index i at which index j with j>i we have a maximum gain in percent i.e. when gain=100*StockValues[j]/StockValues[i]-100 is maximum.
I then want to output the burst, in terms of percentage gain, as well
as the return(change in price over the original price; so the peak price
vs the starting price in the period).
You want to retrieve the two values burst=gain=100*StockValues[j]/StockValues[i]-100 and return=StockValues[j]-StickValues[i]
The first step will be to loop thru the array and for each element do a second loop to find when the gain is maximum, when we find a maximum we save the values you want in another array named Result (let us assume this array is initialized with invalid values, like burst=-1 which means no gain over any period can be found)
for i=1 to length(StockValues)-1 do
max_gain=0
for j=i+1 to length(StockValues) do
gain=100*StockValues[j]/StockValues[i]-100
if gain>max_gain then
gain=max_gain
Result[i].burst=gain
Result[i].return=StockValues[j]-StockValues[i]
Result[i].start=i
Result[i].end=j
Result[i].period_length=j-i+1
Result[i].start_price=StockValues[i]
Result[i].end_price=StockValues[j]
end if
end for
end for
Note that this algorithm gives the smallest period, if you replace gain>max_gain with gain>=max_gain you'll get the longest period in the case there are more than one period with the same gain value. Only positive or null gains are listed, if there is no gain at all, Result will contain the invalid value. Only period>1 are listed, if period of 1 are accepted then the worst gain possible would be 0%, and you would have to modify the loops i goes to length(StockValues) and j starts at i
This doesn't really sound like several days of work :p unless I'm missing something.
# returns array of percentage gain per period
def percentage_gain(array)
initial = array[0]
after = 0
percentage_gain = []
1.upto(array.size-1).each do |i|
after = array[i]
percentage_gain << (after - initial)/initial*100
initial = after
end
percentage_gain
end
# returns array of amount gain $ per period
def amount_gain(array)
initial = array[0]
after = 0
amount_gain = []
1.upto(array.size-1).each do |i|
after = array[i]
percentage_gain << (after - initial)
initial = after
end
amount_gain
end
# returns the maximum amount gain found in the array
def max_amount_gain(array)
amount_gain(array).max
end
# returns the maximum percentage gain found in the array
def max_percentage_gain(array)
percentage_gain(array).max
end
# returns the maximum potential gain you could've made by shortselling constantly.
# i am basically adding up the amount gained when you would've hit profit.
# on days the stock loses value, i don't add them.
def max_potential_amount_gain(array)
initial = array[0]
after = 0
max_potential_gain = 0
1.upto(array.size-1).each do |i|
after = array[i]
if after - initial > 0
max_potential_gain += after - initial
end
initial = after
end
amount_gain
end
array = [15.42, 16.42, 17.36, 16.22, 14.72, 13.95, 14.73, 13.76, 12.88, 13.51, 12.67, 11.11, 10.04, 10.38, 10.14, 7.72, 7.46, 9.41, 11.39, 9.7, 12.67, 18.42, 18.44, 18.03, 17.48, 19.6, 19.57, 18.48, 17.36, 18.03, 18.1, 19.07, 21.02, 20.77, 19.92, 18.71, 20.29, 22.36, 22.38, 22.39, 22.94, 23.5, 21.66, 22.06, 21.07, 19.86, 19.49, 18.79, 18.16, 17.24, 17.74, 18.41, 17.56, 17.24, 16.04, 16.05, 15.4, 15.77, 15.68, 16.29, 15.23, 14.51, 14.05, 13.28, 13.49, 13.12, 14.33, 13.67, 13.13, 12.45, 12.48, 11.58, 11.52, 11.2, 10.46, 12.24, 11.62, 11.43, 10.96, 10.63, 10.19, 10.03, 9.7, 9.64, 9.16, 8.96, 8.49, 8.16, 8.0, 7.86, 8.08, 8.02, 7.67, 8.07, 8.37, 8.35, 8.82, 8.58, 8.47, 8.42, 7.92, 7.77, 7.79, 7.6, 7.18, 7.44, 7.74, 7.47, 7.63, 7.21, 7.06, 6.9, 6.84, 6.96, 6.93, 6.49, 6.38, 6.69, 6.49, 6.76]

calculate standard deviation of daily data within a year

I have a question,
In Matlab, I have a vector of 20 years of daily data (X) and a vector of the relevant dates (DATES). In order to find the mean value of the daily data per year, I use the following script:
A = fints(DATES,X); %convert to financial time series
B = toannual(A,'CalcMethod', 'SimpAvg'); %calculate average value per year
C = fts2mat(B); %Convert fts object to vector
C is a vector of 20 values. showing the average value of the daily data for each of the 20 years. So far, so good.. Now I am trying to do the same thing but instead of calculating mean values annually, i need to calculate std annually but it seems there is not such an option with function "toannual".
Any ideas on how to do this?
THANK YOU IN ADVANCE
I'm assuming that X is the financial information and it is an even distribution across each year. You'll have to modify this if that isn't the case. Just to clarify, by even distribution, I mean that if there are 20 years and X has 200 values, each year has 10 values to it.
You should be able to do something like this:
num_years = length(C);
span_size = length(X)/num_years;
for n = 0:num_years-1
std_dev(n+1,1) = std(X(1+(n*span_size):(n+1)*span_size));
end
The idea is that you simply pass the date for the given year (the day to day values) into matlab's standard deviation function. That will return the std-dev for that year. std_dev should be a column vector that correlates 1:1 with your C vector of yearly averages.
unique_Dates = unique(DATES) %This should return a vector of 20 elements since you have 20 years.
std_dev = zeros(size(unique_Dates)); %Just pre allocating the standard deviation vector.
for n = 1:length(unique_Dates)
std_dev(n) = std(X(DATES==unique_Dates(n)));
end
Now this is assuming that your DATES matrix is passable to the unique function and that it will return the expected list of dates. If you have the dates in a numeric form I know this will work, I'm just concerned about the dates being in a string form.
In the event they are in a string form you can look at using regexp to parse the information and replace matching dates with a numeric identifier and use the above code. Or you can take the basic theory behind this and adapt it to what works best for you!

Calculating IRR in ruby

Can anyone help me with a method that calculates the IRR of a series of stock trades?
Let's say the scenario is:
$10,000 of stock #1 purchased 1/1 and sold 1/7 for $11,000 (+10%)
$20,000 of stock #2 purchased 1/1 and sold 1/20 for $21,000 (+5%)
$15,000 of stock #3 purchased on 1/5 and sold 1/18 for $14,000 (-6.7%)
This should be helpful: http://www.rubyquiz.com/quiz156.html
But I couldn't figure out how to adapt any of the solutions since they assume the period of each return is over a consistent period (1 year).
I finally found exactly what I was looking for: http://rubydoc.info/gems/finance/1.1.0/Finance/Cashflow
gem install finance
To solve the scenario I posted originally:
include Finance
trans = []
trans << Transaction.new( -10000, date: Time.new(2012,1,1) )
trans << Transaction.new( 11000, date: Time.new(2012,1,7) )
trans << Transaction.new( -20000, date: Time.new(2012,1,1) )
trans << Transaction.new( 21000, date: Time.new(2012,1,20) )
trans << Transaction.new( -15000, date: Time.new(2012,1,5) )
trans << Transaction.new( 14000, date: Time.new(2012,1,18) )
trans.xirr.apr.to_f.round(2)
I also found this simple method: https://gist.github.com/1364990
However, it gave me some trouble. I tried a half dozen different test cases and one of them would raise an exception that I was never able to debug. But the xirr() method in this Finance gem worked for every test case I could throw at it.
For an investment that has an initial value and final value, as is the case with your example data that includes purchase price, sell price and a holding period, you only need to find holding period yield.
Holding period yield is calculated by subtracting 1 from holding period return
HPY = HPR - 1
HPR = final value/initial value
HPY = 11,000/10,000 - 1 = 1.1 - 1 = 0.10 = 10%
HPY = 21,000/20,000 - 1 = 1.05 - 1 = 0.05 = 5%
HPY = 14,000/15,000 - 1 = 0.9333 - 1 = -0.0667 = -6.7%
This article explains holding period return and yield
You can also annualize the holding period return and holding period yield using following formula
AHPR = HPR^(1/n)
AHPY = AHPR - 1
The above formulas only apply if you have a single period return as is the case with your example stock purchase and sale.
Yet if you had multiple returns, for example, you purchased a stock A on 1/1 for 100 and it's closing price over the next week climbed and fell to 98, 103, 101, 100, 99, 104
Then you will have to look beyond what HPR and HPY for multiple returns. In this case you can calculate ARR and GRR. Try out these online calculators for arithmetic rate of return and geometric rate of return.
But then if you had a date schedule for your investments then none of these would apply. You would then have to resort to finding IRR for irregular cash flows. IRR is the internal rate of return for periodic cash flows. For irregular cash flows such as for stock trade, the term XIRR is used. XIRR is an Excel function that calculates internal rate of return for irregular cash flows. To find XIRR you would need a series of cash flows and a date schedule for the cash flows.
Finance.ThinkAndDone.com explains IRR in much more detail than the articles you cited on RubyQuiz and Wiki. The IRR article on Think & Done explains IRR calculation with Newton Raphson method and Secant method using either the NPV equation set to 0 or the profitability index equation set to 1. The site also provides online IRR and XIRR calculators
I don't know anything about finance, but it makes sense to me that if you want to know the rate of return over 6 months, it should be the rate which equals the yearly rate when compounded twice. If you want to know the rate for 3 months, it should be the rate which equals the yearly rate when compounded 4 times, etc. This implies that converting from a yearly return rate to a rate for an arbitrary period is closely related to calculating roots. If you express the yearly return rate as a proportion of the original amount (i.e. express 20% return as 1.2, 100% return as 2.0, etc), then you can get the 6-month return rate by taking the square root of that number.
Ruby has a very handy way to calculate all kinds of complex roots: the exponentiation operator, **.
n ** 0.5 # square root
n ** (1.0/3.0) # 3rd root
...and so on.
So I think you should be able to convert a yearly rate of return to one for an arbitrary period by:
yearly_return ** (days.to_f / 365)
Likewise to convert a daily, weekly, or monthly rate or return to a yearly rate:
yearly_return = daily_return ** 365
yearly_return = weekly_return ** 52
yearly_return = monthly_return ** 12
...and so on.
As far as I can see (from reading the Wikipedia article), the IRR calculation is not actually dependent on the time period used. If you give a series of yearly cash flows as input, you get a yearly rate. If you give a series of daily cash flows as input, you get a daily rate, and so on.
I suggest you use one of the solutions you linked to to calculate IRR for daily or weekly cash flows (whatever is convenient), and convert that to a yearly rate using exponentiation. You will have to add 1 to the output of the irr() method (so that 10% return will be 1.1 rather than 0.1, etc).
Using the daily cash flows for the example you gave, you could do this to get daily IRR:
irr([-30000,0,0,0,-15000,0,11000,0,0,0,0,0,0,0,0,0,0,14000,0,21000])
You can use the Exonio library:
https://github.com/Noverde/exonio
and use it like this:
Exonio.irr([-100, 39, 59, 55, 20]) # ==> 0.28095
I believe that the main problem in order to be able to understand your scenario is the lack of a cash flow for each of the stocks, which is an essential ingredient for computing any type of IRR, without these, none of the formulas can be used. If you clarify this I can help you solve your problem
Heberto del Rio
There is new gem 'finance_math' that solves this problem very easy
https://github.com/kolosek/finance_math

Resources