LightGBM usage of init_score results in no boosting - lightgbm

It seems if lightgbm.train is used with an initial score (init_score) it cannot boost this score.
Here is a simple example:
params = {"learning_rate": 0.1,"metric": "binary_logloss","objective": "binary",
"boosting_type": "gbdt","num_iterations": 5, "num_leaves": 2 ** 2,
"max_depth": 2, "num_threads": 1, "verbose": 0, "min_data_in_leaf": 1}
x = pd.DataFrame([[1, 0.1, 0.3], [1, 0.1, 0.3], [1, 0.1, 0.3],
[0, 0.9, 0.3], [0, 0.9, 0.3], [0, 0.9, 0.3]], columns=["a", "b", "prob"])
y = pd.Series([0, 1, 0, 0, 1, 0])
d_train = lgb.Dataset(x, label=y)
model = lgb.train(params, d_train)
y_pred_default = model.predict(x, raw_score=False)
In the case above, no init_score is used. The predictions are correct:
y_pred_default = [0.33333333, ... ,0.33333333]
d_train = lgb.Dataset(x, label=y, init_score=scipy.special.logit(x["prob"]))
model = lgb.train(params, d_train)
y_pred_raw = model.predict(x, raw_score=True)
In this part, we assume column "prob" from x to be our initial guess (maybe by some other model). We apply logit and use it as initial score. However, the model cannot improve and the boosting will always return 0: y_pred_raw = [0, 0, 0, 0, 0, 0]
y_pred_raw_with_init = scipy.special.logit(x["prob"]) + y_pred_raw
y_pred = scipy.special.expit(y_pred_raw_with_init)
This part above shows the way I suppose is correct to translate the initial scores together with the boosting back to probabilities. Since the boosting is zero y_pred yields [0.3, ..., 0.3] which is our initial probability.

Related

CountMultiplicativePairs in Python using optimized way

The complete problem is given below for which I wrote a Python code and wanted to know the complexity of it or whether it can be optimised more. The solutions are available in C# but the logics are quiet complex.
http://www.whatsjs.com/2018/01/codility-countmultiplicativepairs.html
Here is the solution to the problem:
How to find pairs with product greater than sum
Below the code I wrote in Python. Is there any other way or someone who has tried this problem in python as the C# code explained above doesn't have proper explanation
def solution(A,B):
"""
Count the number of pairs (x, y) such that x * y >= x + y.
"""
M = 1000*1000
max_count=1000*1000*1000
zero=count=0
if len(A)<=1:
return "Length of array A should be greater than 1"
if len(B)<=1:
return "Length of array B should be greater than 1"
if len(A)!=len(B):
return "Length of both arrays should be equal"
C=[0]*len(A)
for (i, elem) in enumerate(A):
C[i]=float(A[i])+float(B[i]/M)
for (i, elem) in enumerate(C):
if elem==0:
zero+=1
if elem>0 and elem<=1:
pass
if elem>1:
for j in range(i+1,len(C)):
if round(C[i]*C[j],2)>=C[i]+C[j]:
count+=1
zero_pairs=int(zero*(zero-1)/2)
count+=zero_pairs
return min(count,max_count)
#return C
#print(solution([0,1,2,2,3,5], [500000, 500000, 0, 0, 0, 20000]))
print(solution([1, 1, 1, 2, 2, 3, 5, 6],[200000, 250000, 500000, 0, 0, 0, 0, 0]))
# print(solution([0, 0, 2, 2], [0, 0, 0, 0]))
# print(solution([1, 3], [500000, 10000]))
# print(solution([1, 3], [400000, 500000]))
#print(solution([0, 0, 0, 0] , [0, 0, 0, 0]))
#print(solution([0, 0, 0, 0] , [1, 1, 1, 1]))
I wanted a more optimised way to solve this, as I feel the complexity currently is O(n^2)

pyqtgrapth cololbar matching to data valuse

I am a new pyqt user and not quit familar with it.
I have a dataset min=-50000, max=100000. I need to plot it using pyqtgrapth. However the color bar I am using dose not match with the actual values of the data.
I have to add, data normalized to one before ploting.
Data = Data/np.max(Data)
Here are my codes to plot them:
entstops = [0, 0.005, 0.01, 0.025, 0.05,0.1, 0.2, 0.4, 0.5, 0.6, 0.7, 0.8,0.9,1.0]er code here
enself.colors = 255 * array(
[[1,1,1,0.7],[1, 1, 1, 0.2],[1, 1, 1, 1], [0, 0, 0, 0.3], [1, 0, 0, 1], [0, 1, 0, 1], [0, 0, 1, 1], [1, 1, 0.5, 1], [1, 0.89, 0.03, 1],
[1, 1, 0, 1], [1, 0.165, 0, 1],[1, 0, 0, 1],[0.128,0,0,1], [1, 1, 1, 1]])
entercolormap = pg.ColorMap(stops, self.colors)
lut = colormap.getLookupTable(-0.1, 1.0, 512)
self.frog_preview_draw["data"] = pg.ImageItem()
self.frog_preview_draw["data"].setLookupTable(lut)
What I am expecting is, having more colors around (-0.1-0,1) and less colors for bigger numeber, because I would like to see the noise of my data.
Thank you

Is there a common name for a function that maps by an index?

Names such as map, filter or sum are generally understood by every resonably good programmer.
I wonder whether the following function f also has such a standard name:
def f(data, idx): return [data[i] for i in idx]
Example usages:
r = f(['world', '!', 'hello'], [2, 0, 1, 1, 1])
piecePrice = [100, 50, 20, 180]
pieceIdx = [0, 2, 3, 0, 0]
total Price = sum(f(piecePrice, pieceIdx))
I started with map, but map is generally understood as a function that applies a function on each element of a list.

Daru Ruby Gem - How do I transform a categorical variable into a binary one

I have the following Daru Data Frame with a categorical variable called search_term:
home,search_term,bought
0,php,1
0,java,1
1,php,1
...
I want to convert it to a Daru Data Frame with binary columns, something like:
home,php,java,bought
0,1,0,1
0,0,1,1
1,1,0,1
...
I can't find a way to achieve it. I know it's possible in Python's Panda but I want to use Ruby with the Darus gem.
Thanks.
According to a blog post written by Yoshoku, the author of Rumale machine learning library, you can do it like:
train_df['IsFemale'] = train_df['Sex'].map { |v| v == 'female' ? 1 : 0 }
Rumale's label encoder is also useful for the categorical variable.
require 'rumale'
encoder = Rumale::Preprocessing::LabelEncoder.new
labels = Numo::Int32[1, 8, 8, 15, 0]
encoded_labels = encoder.fit_transform(labels)
# Numo::Int32#shape=[5]
# [1, 2, 2, 3, 0]
Rumale::Preprocessing::OneHotEncoder
encoder = Rumale::Preprocessing::OneHotEncoder.new
labels = Numo::Int32[0, 0, 2, 3, 2, 1]
one_hot_vectors = encoder.fit_transform(labels)
# > pp one_hot_vectors
# Numo::DFloat#shape[6, 4]
# [[1, 0, 0, 0],
# [1, 0, 0, 0],
# [0, 0, 1, 0],
# [0, 0, 0, 1],
# [0, 0, 1, 0],
# [0, 1, 0, 0]]
But, conversion of Daru::Vector and Numo::NArray needs to use to_a.
encoder = Rumale::Preprocessing::LabelEncoder.new
train_df['Embarked'] = encoder.fit_transform(train_df['Embarked'].to_a).to_a

How I can find the next value? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Given an array of 0 and 1, e.g. array[] = {0, 1, 0, 0, 0, 1, ...}, how I can predict what the next value will be with the best possible accuracy?
What kind of methods are best suited for this kind of task?
The prediction method would depend on the interpretation of data.
However, it looks like in this particular case we can make some general assumptions that might justify use of certain machine learning techniques.
Values are generated one after another in chronological order
Values depend on some (possibly non-observable) external state. If the state repeats itself, so do the values.
This is a pretty common scenario in many machine learning contexts. One example is the prediction of stock prices based on history.
Now, to build the predictive model you'll need to define the training data set. Assume our model looks at the last k values. In case if k=1, we might end up with something similar to a Markov chain model.
Our training data set will consist of k-dimensional data points together with their respective dependent values. For example, suppose k=3 and we have the following input data
0,0,1,1,0,1,0,1,1,1,1,0,1,0,0,1...
We'll have the following training data:
(0,0,1) -> 1
(0,1,1) -> 0
(1,1,0) -> 1
(1,0,1) -> 0
(0,1,0) -> 1
(1,0,1) -> 1
(0,1,1) -> 1
(1,1,1) -> 1
(1,1,1) -> 0
(1,1,0) -> 1
(1,0,1) -> 0
(0,1,0) -> 0
(1,0,0) -> 1
Now, let's say you want to predict the next value in the sequence. The last 3 values are 0,0,1, so the model must predict the value of the function at (0,0,1), based on the training data.
A popular and relatively simple approach would be to use a multivariate linear regression on a k-dimensional data space. Alternatively, consider using a neural network if linear regression underfits the training data set.
You might need to try out different values of k and test against your validation set.
You could use a maximum likelihood estimator for the Bernoulli distribution. In essence you would:
look at all observed values and estimate parameter p
then use p to determine the next value
In Python this could look like this:
#!/usr/bin/env python
from __future__ import division
signal = [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0]
def maximum_likelihood(s, last=None):
"""
The maximum likelihood estimator selects the parameter value which gives
the observed data the largest possible probability.
http://mathworld.wolfram.com/MaximumLikelihood.html
If `last` is given, only use the last `n` values.
"""
if not last:
return sum(s) / len(s)
return sum(s[:-last]) / last
if __name__ == '__main__':
hits = []
print('p\tpredicted\tcorrect\tsignal')
print('-\t---------\t-------\t------')
for i in range(1, len(signal) - 1):
p = maximum_likelihood(signal[:i]) # p = maximum_likelihood(signal[:i], last=2)
prediction = int(p >= 0.5)
hits.append(prediction == signal[i])
print('%0.3f\t%s\t\t%s\t%s' % (
p, prediction, prediction == signal[i], signal[:i]))
print('accuracy: %0.3f' % (sum(hits) / len(hits)))
The output would like this:
# p predicted correct signal
# - --------- ------- ------
# 1.000 1 False [1]
# 0.500 1 True [1, 0]
# 0.667 1 True [1, 0, 1]
# 0.750 1 False [1, 0, 1, 1]
# 0.600 1 False [1, 0, 1, 1, 0]
# 0.500 1 True [1, 0, 1, 1, 0, 0]
# 0.571 1 False [1, 0, 1, 1, 0, 0, 1]
# 0.500 1 True [1, 0, 1, 1, 0, 0, 1, 0]
# 0.556 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1]
# 0.600 1 False [1, 0, 1, 1, 0, 0, 1, 0, 1, 1]
# 0.545 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0]
# 0.583 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1]
# 0.615 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1]
# 0.643 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1]
# 0.667 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1]
# 0.688 1 False [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1]
# 0.647 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0]
# 0.667 1 False [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1]
# 0.632 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0]
# 0.650 1 True [1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1]
# accuracy: 0.650
You could vary the window size for performance reasons or to favor recent events.
In above example, if we would estimate the the next value by looking only at the last 3 observed values, we could increase our accuracy to 0.7.
Update: Inspired by Narek's answer I added a logistic regression classifier example to the gist.
You can predict by calculating the probabilities of 0s and 1s and make their probability ranges and then draw a random number between 0 and 1 to predict.....
If these are series of numbers that are generated each time after some reset event, and next numbers are somehow related to previous ones, you could create a tree (binary tree with two branches at each node in your case) and feed in such historical series from the root, adjusting weights (say a count) on each branch you follow.
Could divide such counts by the number of series you entered before using them, or keep a number on each node too, increased before choosing a branch. That way root node contains number of series entered.
Then, as you feed it a new sequence you can see which branch is "hotter" (would make nice visualization as heatmap/tree btw) to follow, especially if sequence is long enough. That is, assuming order of items in sequence plays a role in what comes next.

Resources