Weigthing a dataset - algorithm

I'm looking for possible algorithms/techniques to give weight to a data set. I have a set of let us call them ID's which need to be weighted based on different criteria. To make the example a little bit easier, the beginning weight (value) of every record will be 1.
For example we use the following data set
<table><tbody><tr><th>ID</th><th>Age</th><th>Postalcode</th><th>Status Insurance A</th><th>Marital Status Number of childeren</th><th> </th></tr><tr><td>1</td><td>30</td><td>10000</td><td>ON HOLD</td><td>SINGLE</td><td>0</td></tr><tr><td>2</td><td>35</td><td>15000</td><td>ACTIVE</td><td>DIVORCED</td><td>2</td></tr><tr><td>3</td><td>36</td><td>15000</td><td>ACTIVE</td><td>MARRIED</td><td>2</td></tr></tbody></table>
Next to the data set we have a set of rules we need to apply to give it a weight (addition):
ages (> 10 and < 20): +0.5
ages (> 20 and < 30): +0.7
ages (> 30 and < 99): +0.85
martial status : (married) : +0.50
martial status : (divorced) : +0.45
martial status : (single) : +0.35
status insurance (ACTIVE) : +0.05
status insurance (ON HOLD) : +0.1
Now we need to apply these rules to the data set, so for example record with ID=1 would have a weight of
1 + 0.85 (age>30) + 0.1 (status: ON HOLD) + 0.35 (martial status : SINGLE) = 2.3
This is just a small example, but the set of rules is more complex can be manipulated by the user and weights can be changed.
I could off-course implement this the naive way, just write some hard coded logic to go over the data set record by record and test if the rule applies.
I was however wondering if there are better more generic solutions to this kind of problem ?

Related

QuickSight add subset of fields

Total AWS QuickSight newbie here. I'm trying to import some cost data in CSV form into QuickSight and add some calculated fields.
The data I have is of the form:
Type
Units Consumed
A
2
B
3
A
1
B
5
... and so on
Unit Cost ($) is not part of the dataset and is something like
Unit Cost
Amount ($)
Unit Cost (A)
1
Unit Cost (B)
2
I would like to compute (either as part of the dataset or as part of an analysis visual, maybe) the total costs for A and B as separate line items. Something like
Total Cost (A) = Sum(Amount where Type = A) * Unit Cost (A)
Total Cost (B) = Sum(Amount where Type = B) * Unit Cost (B)
Here are the things I've tried which don't work:
sumOver({Units Consumed}, Type='A')
sumIf({Units Consumed}, Type='A')
To break it down and test smaller parts, I added a calculated field which simply does
sum({Units Consumed})
But it just adds a column to the dataset with every field as "Undefined".
How can I achieve what I'm trying to do?
I tried to replicate the code
sumIf({Units Consumed}, Type='A')
and it worked. Could you check if Units Consumed is a integer column type?
How to change column type

ggpredict : confidence interval for negative binomial models

I used the following code to model count data :
ModActi<-glmmTMB(Median ~ H_veg + D_veg + Landscape + JulianDay +
H_veg:D_veg + (1 | Site),
data=MyDataActi, family=nbinom2)
I then used the ggpredict function of the ggeffects package to plot the predicted values of my model for the categorical variable "Landscape":
pr1 <- ggpredict(ModActi, "Landscape")
plot(pr1)
I obtain this Graph.
As you can see, lower confidence intervals are negative, as if the function would calculate them for a normal distribution.
In the help menu of ggpredict, it is not clear to me if there is a way to calculate confidence intervals for a negative binomial distribution (as stated in the model) ?
EDIT : if I use glmer in poisson, the confidence intervals are correct.
My supervisor found a nice solution by recalculating the standard errors in the predict table :
pr1 <- ggpredict(ModActi, "Landscape")
Ynontransform=log(pr1$predicted)
SEnontransform=log(pr1$conf.high)-Ynontransform
ConfLow=exp(Ynontransform-SEnontransform)
pr1$conf.low=ConfLow
plot(pr1)
This was because glmmTMB only returned predictions on the response scale and these were not back transformed. Now glmmTMB was update on CRAN and I also revised ggeffects. You can try out the current dev-version at https://github.com/strengejacke/ggeffects, which now properly computes the CI (after updating glmmTMB to version 0.2.1).

Poor h2o GBM Classification Performance in a balanced binomial response

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 147857 234035 0.612830 =234035/381892
1 44782 271661 0.141517 =44782/316443
Totals 192639 505696 0.399260 =278817/698335
Any expert suggestions to treat the data and reduce the error is welcome.
Following approaches are tried and error is not found decreasing.
Approach 1: Selecting top 5 important variables via h2o.varimp(gbm)
Approach 2: Converting the negative normalized variable as zero and possitive as 1.
#Data Definition
# Variable Definition
#Independent Variables
# ID Unique ID for each observation
# Timestamp Unique value representing one day
# Stock_ID Unique ID representing one stock
# Volume Normalized values of volume traded of given stock ID on that timestamp
# Three_Day_Moving_Average Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range Normalized values of true range for given stock ID
# Average_True_Range Normalized values of average true range for given stock ID
# Positive_Directional_Movement Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement Normalized values of negative directional movement for given stock ID
#Dependent Response Variable
# Outcome Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)
summary(train)
#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)
rm(train,test)
summary(combi.imp)
#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)
summary(combi.imp)
#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]
library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")
train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])
#Models
gbmF_model_1 = h2o.gbm( x=features,
y = y,
training_frame =train.hex,
seed=1234
)
h2o.performance(gbmF_model_1)
You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

SPSS Ranking Data In One Column

I'm still new with SPSS, I Have Data For The Following :
Cereals Vegetables Fruit Meat Dairy Fat Sugar Pulses
I Have Also Computed The Variables With This Formula :
Total FCS = (Cereals*2)+(Vegetables)+(Fruits)+(Meat*4)+(Dairy*4)+(Sugar*0.5)+(Pulses*3)
Now I Want To Rank The Data from the Total FCS In One Column In Order To Make Graph From It As Following:
Rank as :
<28 Poor
>28.5 - <42 Borderline
>42.5 Acceptable
What Should I Do ?
I would use a DO IF statement to assign the ranks. Example below.
DO IF FCS < 28.
COMPUTE RankFCS = 1.
ELSE IF FCS <= 42.5.
COMPUTE RankFCS = 2.
ELSE.
COMPUTE RankFCS = 3.
END IF.
VALUE LABELS RankFCS
1 'Poor'
2 'Borderline'
3 'Acceptable'.
There is a command called Recode in SPSS, you can use that command to create this rank variable. Recode command has two options
1). Recode into same variables
2). Recode into Different variables.
I am using 2nd option as you need to create a new Rank variable.
STRING RankFCS (A8).
RECODE FCS (Lowest thru 28='Poor') (28.5 thru 42='Borderline')
(42.5 thru Highest='Acceptable')
INTO RankFCS.
EXECUTE.

Calculating IRR in ruby

Can anyone help me with a method that calculates the IRR of a series of stock trades?
Let's say the scenario is:
$10,000 of stock #1 purchased 1/1 and sold 1/7 for $11,000 (+10%)
$20,000 of stock #2 purchased 1/1 and sold 1/20 for $21,000 (+5%)
$15,000 of stock #3 purchased on 1/5 and sold 1/18 for $14,000 (-6.7%)
This should be helpful: http://www.rubyquiz.com/quiz156.html
But I couldn't figure out how to adapt any of the solutions since they assume the period of each return is over a consistent period (1 year).
I finally found exactly what I was looking for: http://rubydoc.info/gems/finance/1.1.0/Finance/Cashflow
gem install finance
To solve the scenario I posted originally:
include Finance
trans = []
trans << Transaction.new( -10000, date: Time.new(2012,1,1) )
trans << Transaction.new( 11000, date: Time.new(2012,1,7) )
trans << Transaction.new( -20000, date: Time.new(2012,1,1) )
trans << Transaction.new( 21000, date: Time.new(2012,1,20) )
trans << Transaction.new( -15000, date: Time.new(2012,1,5) )
trans << Transaction.new( 14000, date: Time.new(2012,1,18) )
trans.xirr.apr.to_f.round(2)
I also found this simple method: https://gist.github.com/1364990
However, it gave me some trouble. I tried a half dozen different test cases and one of them would raise an exception that I was never able to debug. But the xirr() method in this Finance gem worked for every test case I could throw at it.
For an investment that has an initial value and final value, as is the case with your example data that includes purchase price, sell price and a holding period, you only need to find holding period yield.
Holding period yield is calculated by subtracting 1 from holding period return
HPY = HPR - 1
HPR = final value/initial value
HPY = 11,000/10,000 - 1 = 1.1 - 1 = 0.10 = 10%
HPY = 21,000/20,000 - 1 = 1.05 - 1 = 0.05 = 5%
HPY = 14,000/15,000 - 1 = 0.9333 - 1 = -0.0667 = -6.7%
This article explains holding period return and yield
You can also annualize the holding period return and holding period yield using following formula
AHPR = HPR^(1/n)
AHPY = AHPR - 1
The above formulas only apply if you have a single period return as is the case with your example stock purchase and sale.
Yet if you had multiple returns, for example, you purchased a stock A on 1/1 for 100 and it's closing price over the next week climbed and fell to 98, 103, 101, 100, 99, 104
Then you will have to look beyond what HPR and HPY for multiple returns. In this case you can calculate ARR and GRR. Try out these online calculators for arithmetic rate of return and geometric rate of return.
But then if you had a date schedule for your investments then none of these would apply. You would then have to resort to finding IRR for irregular cash flows. IRR is the internal rate of return for periodic cash flows. For irregular cash flows such as for stock trade, the term XIRR is used. XIRR is an Excel function that calculates internal rate of return for irregular cash flows. To find XIRR you would need a series of cash flows and a date schedule for the cash flows.
Finance.ThinkAndDone.com explains IRR in much more detail than the articles you cited on RubyQuiz and Wiki. The IRR article on Think & Done explains IRR calculation with Newton Raphson method and Secant method using either the NPV equation set to 0 or the profitability index equation set to 1. The site also provides online IRR and XIRR calculators
I don't know anything about finance, but it makes sense to me that if you want to know the rate of return over 6 months, it should be the rate which equals the yearly rate when compounded twice. If you want to know the rate for 3 months, it should be the rate which equals the yearly rate when compounded 4 times, etc. This implies that converting from a yearly return rate to a rate for an arbitrary period is closely related to calculating roots. If you express the yearly return rate as a proportion of the original amount (i.e. express 20% return as 1.2, 100% return as 2.0, etc), then you can get the 6-month return rate by taking the square root of that number.
Ruby has a very handy way to calculate all kinds of complex roots: the exponentiation operator, **.
n ** 0.5 # square root
n ** (1.0/3.0) # 3rd root
...and so on.
So I think you should be able to convert a yearly rate of return to one for an arbitrary period by:
yearly_return ** (days.to_f / 365)
Likewise to convert a daily, weekly, or monthly rate or return to a yearly rate:
yearly_return = daily_return ** 365
yearly_return = weekly_return ** 52
yearly_return = monthly_return ** 12
...and so on.
As far as I can see (from reading the Wikipedia article), the IRR calculation is not actually dependent on the time period used. If you give a series of yearly cash flows as input, you get a yearly rate. If you give a series of daily cash flows as input, you get a daily rate, and so on.
I suggest you use one of the solutions you linked to to calculate IRR for daily or weekly cash flows (whatever is convenient), and convert that to a yearly rate using exponentiation. You will have to add 1 to the output of the irr() method (so that 10% return will be 1.1 rather than 0.1, etc).
Using the daily cash flows for the example you gave, you could do this to get daily IRR:
irr([-30000,0,0,0,-15000,0,11000,0,0,0,0,0,0,0,0,0,0,14000,0,21000])
You can use the Exonio library:
https://github.com/Noverde/exonio
and use it like this:
Exonio.irr([-100, 39, 59, 55, 20]) # ==> 0.28095
I believe that the main problem in order to be able to understand your scenario is the lack of a cash flow for each of the stocks, which is an essential ingredient for computing any type of IRR, without these, none of the formulas can be used. If you clarify this I can help you solve your problem
Heberto del Rio
There is new gem 'finance_math' that solves this problem very easy
https://github.com/kolosek/finance_math

Resources