Doing BigDecimal accounting with Ruby. When can I convert to float? - ruby

I basically want to know the best to convert my BigDecimal numbers to readable format (float perhaps) for the purposes of displaying them to the client:
In order to figure out liabilities, I do contributions - distributions from each address.
For example, if a person contributes 2 units to an address and that same address distributes 1 unit back to the person, then there is still a 1 unit liability. that's what's going on below. These numbers are all units of cryptocurrency.
Here's another example:
Say address1 and address2 contribute 2 coins each to address3, and address3 distributes 1.0 coins to address1 and 0.5 coins to address2, then address3 has a 1.0 coin liability to address1 and a 1.5 coin liability to address2.
So the actual data using BigDecimal below:
contributions =
{"1444"=>#<BigDecimal:7f915c08f030,'0.502569E2',18(36)>,
"alice"=>#<BigDecimal:7f915c084018,'0.211E1',18(27)>,
"address1"=>#<BigDecimal:7f915c0a4430,'0.87161E1',18(36)>,
"address2"=>#<BigDecimal:7f915c0943f0,'0.84811E1',18(36)>,
"address3"=>#<BigDecimal:7f915c0a43e0,'0.385E0',9(18)>,
"address6"=>#<BigDecimal:7f915c09ebe8,'0.1E1',9(18)>,
"address7"=>#<BigDecimal:7f915c09eb98,'0.1E1',9(18)>,
"address8"=>#<BigDecimal:7f915c09d428,'0.15E1',18(18)>,
"address9"=>#<BigDecimal:7f915c09d3d8,'0.15E1',18(18)>,
"address10"=>#<BigDecimal:7f915c0a7540,'0.132E1',18(36)>,
"address11"=>#<BigDecimal:7f915c0af8a8,'0.392E1',18(36)>,
"address12"=>#<BigDecimal:7f915c0a4980,'0.14E1',18(36)>,
"address13"=>#<BigDecimal:7f915c0af858,'0.2133333333 3333333333 33333334E1',36(54)>,
"address14"=>#<BigDecimal:7f915c0a54c0,'0.3533333333 3333333333 33333334E1',36(45)>,
"address15"=>#<BigDecimal:7f915c0a66e0,'0.1533333333 3333333333 33333334E1',36(36)>,
"sdfds"=>#<BigDecimal:7f915c0a6118,'0.1E0',9(27)>,
"sf"=>#<BigDecimal:7f915c0a6028,'0.1E0',9(27)>,
"address20"=>#<BigDecimal:7f915c0ae688,'0.3E0',9(18)>,
"address21"=>#<BigDecimal:7f915c0ae638,'0.3E0',9(18)>,
"address23"=>#<BigDecimal:7f915c0ae070,'0.1E0',9(27)>,
"address22"=>#<BigDecimal:7f915c0adf80,'0.1E0',9(27)>,
"add1"=>#<BigDecimal:7f915c0ad328,'0.1E0',9(18)>,
"add2"=>#<BigDecimal:7f915c0ad2d8,'0.1E0',9(18)>,
"addx"=>#<BigDecimal:7f915c0acd10,'0.1E0',9(27)>,
"addy"=>#<BigDecimal:7f915c0acc20,'0.1E0',9(27)>}
and the distributions:
distributions =
{"1444"=>#<BigDecimal:7f915a9068f0,'0.502569E2',18(63)>,
"alice"=>#<BigDecimal:7f915a8f44e8,'0.211E1',18(27)>,
"address1"=>#<BigDecimal:7f915a906800,'0.87161E1',18(54)>,
"address2"=>#<BigDecimal:7f915a906710,'0.84811E1',18(54)>,
"address3"=>#<BigDecimal:7f915a906620,'0.385E0',9(36)>,
"address6"=>#<BigDecimal:7f915a8fdea8,'0.1E1',9(27)>,
"address7"=>#<BigDecimal:7f915a8fddb8,'0.1E1',9(27)>,
"address8"=>#<BigDecimal:7f915a8fd5e8,'0.15E1',18(18)>,
"address9"=>#<BigDecimal:7f915a8fd4f8,'0.15E1',18(18)>,
"address10"=>#<BigDecimal:7f915a8fc9b8,'0.132E1',18(36)>,
"address11"=>#<BigDecimal:7f915a9071b0,'0.3920000000 0000003E1',27(45)>,
"address12"=>#<BigDecimal:7f915a907660,'0.1400000000 0000001E1',27(36)>,
"address13"=>#<BigDecimal:7f915a9070c0,'0.2133333333 3333337E1',27(45)>,
"address14"=>#<BigDecimal:7f915a906530,'0.3533333333 3333333333 33333334E1',36(54)>,
"address15"=>#<BigDecimal:7f915a8fc148,'0.1533333333 3333334E1',27(27)>,
"sdfds"=>#<BigDecimal:7f915a907f98,'0.1E0',9(27)>,
"sf"=>#<BigDecimal:7f915a907e08,'0.1E0',9(27)>,
"address20"=>#<BigDecimal:7f915a906ad0,'0.3000000000 0000003E0',18(27)>,
"address21"=>#<BigDecimal:7f915a9069e0,'0.3000000000 0000003E0',18(27)>,
"address23"=>#<BigDecimal:7f915a9063c8,'0.1E0',9(27)>,
"address22"=>#<BigDecimal:7f915a906238,'0.1E0',9(27)>,
"add1"=>#<BigDecimal:7f915a9060a8,'0.5E-1',9(27)>,
"add2"=>#<BigDecimal:7f915a905f18,'0.1E0',9(27)>}
Ideally, I want my liabilities to look like this:
{"add1"=>0.05,
"addx"=>0.1>,
"addy"=>0.1>}
But they look like this:
{"1444"=>0.0,
"alice"=>0.0,
"address1"=>0.0,
"address2"=>0.0,
"address3"=>0.0,
"address6"=>0.0,
"address7"=>0.0,
"address8"=>0.0,
"address9"=>0.0,
"address10"=>0.0,
"address11"=>-3.0e-16,
"address12"=>-1.0e-16,
"address13"=>-3.66666666666e-16,
"address14"=>0.0,
"address15"=>-6.6666666666e-17,
"sdfds"=>0.0,
"sf"=>0.0,
"address20"=>-3.0e-17,
"address21"=>-3.0e-17,
"address23"=>0.0,
"address22"=>0.0,
"add1"=>0.05,
"add2"=>0.0,
"addx"=>#<BigDecimal:7f915c0acd10,'0.1E0',9(27)>,
"addy"=>#<BigDecimal:7f915c0acc20,'0.1E0',9(27)>}
I don't want to include -3.66666666666e-16 because that's essentially 0 and even Ruby accounts for it this way when you run -3.66666666666e-16 > 0... it returns false.
This is what I have... is the a better way? The code below is calculating the liabilities by subtracting con from dis and only selecting the liabilities that are greater than 0.0...that makes sense to me and it excludes 1-time grants of coins (there must be a matching contribution to be a liability). Then, I convert everything to floats so it's readable. Does this look right?
liab = #contributions.merge(#distributions) do |key, con, dis|
(con - dis)
end.select { |addr, amount| amount > 0.0 && #contributions.keys.include?(addr) }
liab.merge(liab) do |k, old, new|
k = new.to_f
end
I want the amount returned in float format, not the big decimal object. Is what I'm doing okay? Will I keep accuracy if I convert to float at the end?

Related

Time Conversion from str to float (00:54:50) -> (54.8) for example

I am trying to convert my time watched in a Netflix show to a float so I can total it up. I cannot figure out how to convert it. I have tried many ways, including:
temp['Minutes'] = temp['Duration'].apply(lambda x: float(x))
Error: ValueError: could not convert string to float: '00:54:45'
''' 2022-05-18 05:21:42 00:54:45 NaN Ozark: Season 4: Mud (Episode 13)
NaN Amazon FTVET31DOVI2020 Smart TV 00:54:50 00:54:50 US (United
States) Wednesday 2022-05-18
'''
I have pulled the day of week and Day out but I would like to plot it just for fun and think the minutes would be the best to add up over time.
Do it like this:
var = '00:54:45'
var_array = var.split(':')
float = float(var_array[1]) + (float(var_array[2])/60)
print(float)
Output: 54.75 (from here u can round the second part, since it's a plus it wouldn't affect the first term)

Ruby splitting a record into multiple records based on contents of a field

Record layout contains two fields:
Requistion
Test Names
Example record:
R00000001,"4 Calprotectin, 1 Luminex xTAG, 8 H. pylori stool antigen (IgA), 9 Lactoferrin, 3 Anti-gliadin IgA, 10 H. pylori Panel, 6 Fecal Fat, 11 Antibiotic Resistance Panel, 2 C. difficile Tox A/ Tox B, 5 Elastase, 7 Fecal Occult Blood, 12 Shigella"
The current Ruby code snippet that is used in the LIMS (Lab Info Management System) system is this:
subj.get_value('Tests').join(', ')
What I need to be able to do in the Ruby code snippet is create a new record off each comma-separated value in the second field.
NOTE:
the amount of values in the 'Test Names' field varies from 1 to 20...or more.
There can be 100's of Requistion records
Final result would be:
R00000001,"4 Calprotectin"
R00000001,"1 Luminex xTAG"
R00000001,"8 H. pylori stool antigen (IgA)"
R00000001,"9 Lactoferrin"
R00000001,"3 Anti-gliadin IgA"
R00000001,"10 H. pylori Panel"
R00000001,"6 Fecal Fat"
R00000001,"11 Antibiotic Resistance Panel"
R00000001,"2 C. difficile Tox A/ Tox B"
R00000001,"5 Elastase"
R00000001,"7 Fecal Occult Blood"
R00000001,"12 Shigella"
If your data is a reliable string which you've shown in your example, here's your method:
data = subj.get_value('Tests').join(', ') # assuming this gives your string obj.
def split_data(data)
arr = data.gsub('"','').split(',')
arr.map {|l| "#{arr[0]} \"#{l.strip}\""}[1..-1]
end
puts split_data(data)

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.

Pymc and binomials: How to fit 7 binomials to data

I have this problem: I have a cohort of individuals grouped in 5 age groups. Initially all of them are susceptible and then they develop disease and finally they have cancers. I have information about the age group distribution of the susceptible and then the cancer carrier. Between the susceptible state and the cancer they pass through 7 stages , with same transition rate.
I'm trying to create a model that simulate each transition as a binomial extraction and fit the data I have.
I tried something but in the moment of analysing the traces , nothing work
You can see the code
Where am I getting wrong?
Thanks for any help
from pylab import *
from pymc import *
from pymc.Matplot import plot as plt
#susceptible_data = array([647,1814,8838,9949,1920])
susceptible_data = array([130,398,1415,1303,206])
infected_data_100000 = array([0,197,302,776,927])
infected_data = array([0,7,38,90,17])
prior_values=np.zeros(len(infected_data))
for i in range(0,len(infected_data)):
prior_values[i]=infected_data[i]/susceptible_data[i]
# stochastic priors
beta1 = Uniform('beta1', 0., 1.)
lambda_0_temp=susceptible_data[0]
lambda_0_1=pymc.Binomial("lambda_0_1",lambda_0_temp,pow(beta1,1))
lambda_0_2=pymc.Binomial("lambda_0_2",lambda_0_1.value,pow(beta1,1))
lambda_0_3=pymc.Binomial("lambda_0_3",lambda_0_2.value,pow(beta1,1))
lambda_0_4=pymc.Binomial("lambda_0_4",lambda_0_3.value,pow(beta1,1))
lambda_0_5=pymc.Binomial("lambda_0_5",lambda_0_4.value,pow(beta1,1))
lambda_0_6=pymc.Binomial("lambda_0_6",lambda_0_5.value,pow(beta1,1))
lambda_0_7=pymc.Binomial("lambda_0_7",n=lambda_0_6.value,p=pow(beta1,1),value=infected_data[0],observed=True)
lambda_1_temp=susceptible_data[1]
lambda_1_1=pymc.Binomial("lambda_1_1",lambda_1_temp,pow(beta1,1))
lambda_1_2=pymc.Binomial("lambda_1_2",lambda_1_1.value,pow(beta1,1))
lambda_1_3=pymc.Binomial("lambda_1_3",lambda_1_2.value,pow(beta1,1))
lambda_1_4=pymc.Binomial("lambda_1_4",lambda_1_3.value,pow(beta1,1))
lambda_1_5=pymc.Binomial("lambda_1_5",lambda_1_4.value,pow(beta1,1))
lambda_1_6=pymc.Binomial("lambda_1_6",lambda_1_5.value,pow(beta1,1))
lambda_1_7=pymc.Binomial("lambda_1_7",n=lambda_1_6.value,p=pow(beta1,1),value=infected_data[1],observed=True)
lambda_2_temp=susceptible_data[2]
lambda_2_1=pymc.Binomial("lambda_2_1",lambda_2_temp,pow(beta1,1))
lambda_2_2=pymc.Binomial("lambda_2_2",lambda_2_1.value,pow(beta1,1))
lambda_2_3=pymc.Binomial("lambda_2_3",lambda_2_2.value,pow(beta1,1))
lambda_2_4=pymc.Binomial("lambda_2_4",lambda_2_3.value,pow(beta1,1))
lambda_2_5=pymc.Binomial("lambda_2_5",lambda_2_4.value,pow(beta1,1))
lambda_2_6=pymc.Binomial("lambda_2_6",lambda_2_5.value,pow(beta1,1))
lambda_2_7=pymc.Binomial("lambda_2_7",n=lambda_2_6.value,p=pow(beta1,1),value=infected_data[2],observed=True)
lambda_3_temp=susceptible_data[3]
lambda_3_1=pymc.Binomial("lambda_3_1",lambda_3_temp,pow(beta1,1))
lambda_3_2=pymc.Binomial("lambda_3_2",lambda_3_1.value,pow(beta1,1))
lambda_3_3=pymc.Binomial("lambda_3_3",lambda_3_2.value,pow(beta1,1))
lambda_3_4=pymc.Binomial("lambda_3_4",lambda_3_3.value,pow(beta1,1))
lambda_3_5=pymc.Binomial("lambda_3_5",lambda_3_4.value,pow(beta1,1))
lambda_3_6=pymc.Binomial("lambda_3_6",lambda_3_5.value,pow(beta1,1))
lambda_3_7=pymc.Binomial("lambda_3_7",n=lambda_3_6.value,p=pow(beta1,1),value=infected_data[3],observed=True)
lambda_4_temp=susceptible_data[4]
lambda_4_1=pymc.Binomial("lambda_4_1",lambda_4_temp,pow(beta1,1))
lambda_4_2=pymc.Binomial("lambda_4_2",lambda_4_1.value,pow(beta1,1))
lambda_4_3=pymc.Binomial("lambda_4_3",lambda_4_2.value,pow(beta1,1))
lambda_4_4=pymc.Binomial("lambda_4_4",lambda_4_3.value,pow(beta1,1))
lambda_4_5=pymc.Binomial("lambda_4_5",lambda_4_4.value,pow(beta1,1))
lambda_4_6=pymc.Binomial("lambda_4_6",lambda_4_5.value,pow(beta1,1))
lambda_4_7=pymc.Binomial("lambda_4_7",n=lambda_4_6.value,p=pow(beta1,1),value=infected_data[4],observed=True)
model=pymc.Model([lambda_0_7,lambda_1_7,lambda_2_7,lambda_3_7,lambda_4_7,beta1])
mcmc =pymc.MCMC(model)
mcmc.sample(iter=100000, burn=50000, thin=10, verbose=1)
lambda_0_samples=mcmc.trace('lambda_0_7')[:]
lambda_1_samples=mcmc.trace('lambda_1_7')[:]
lambda_2_samples=mcmc.trace('lambda_2_7')[:]
lambda_3_samples=mcmc.trace('lambda_3_7')[:]
lambda_4_samples=mcmc.trace('lambda_4_7')[:]
beta1_samples=mcmc.trace('beta1')[:]
What you have implemented above only associates data with the 7th distribution in each set; the others are seemingly-redundant hierarchies on the binomial probability. I would think you want data informing each stage. I'm not sure there is information to inform what the values of p should be at each stage, based on what is provided.

How to normalize an image using Octave?

In their paper describing Viola-Jones object detection framework (Robust Real-Time Face Detection by Viola and Jones), it is said:
All example sub-windows used for training were variance normalized to minimize the effect of different lighting conditions.
My question is "How to implement image normalization in Octave?"
I'm NOT looking for the specific implementation that Viola & Jones used but a similar one that produces almost the same output. I've been following a lot of haar-training tutorials(trying to detect a hand) but not yet able to output a good detector(xml).
I've tried contacting the authors, but still no response yet.
I already answered how to to it in general guidelines in this thread.
Here is how to do method 1 (normalizing to standard normal deviation) in octave (Demonstrating for a random matrix A, of course can be applied to any matrix, which is how the picture is represented):
>>A = rand(5,5)
A =
0.078558 0.856690 0.077673 0.038482 0.125593
0.272183 0.091885 0.495691 0.313981 0.198931
0.287203 0.779104 0.301254 0.118286 0.252514
0.508187 0.893055 0.797877 0.668184 0.402121
0.319055 0.245784 0.324384 0.519099 0.352954
>>s = std(A(:))
s = 0.25628
>>u = mean(A(:))
u = 0.37275
>>A_norn = (A - u) / s
A_norn =
-1.147939 1.888350 -1.151395 -1.304320 -0.964411
-0.392411 -1.095939 0.479722 -0.229316 -0.678241
-0.333804 1.585607 -0.278976 -0.992922 -0.469159
0.528481 2.030247 1.658861 1.152795 0.114610
-0.209517 -0.495419 -0.188723 0.571062 -0.077241
In the above you use:
To get the standard deviation of the matrix: s = std(A(:))
To get the mean value of the matrix: u = mean(A(:))
And then following the formula A'[i][j] = (A[i][j] - u)/s with the
vectorized version: A_norm = (A - u) / s
Normalizing it with vector normalization is also simple:
>>abs = sqrt((A(:))' * (A(:)))
abs = 2.2472
>>A_norm = A / abs
A_norm =
0.034959 0.381229 0.034565 0.017124 0.055889
0.121122 0.040889 0.220583 0.139722 0.088525
0.127806 0.346703 0.134059 0.052637 0.112369
0.226144 0.397411 0.355057 0.297343 0.178945
0.141980 0.109375 0.144351 0.231000 0.157065
In the abvove:
abs is the absolute value of the vector (its length), which is calculated with vectorized multiplications (A(:)' * A(:) is actually sum(A[i][j]^2))
Then we use it to normalize the vector so it will be of length 1.

Resources