Curve fit does not return expected result - curve-fitting

I need a little help with my code during curve fitting some data.
I have the following data:
'''
x_data=[0.0, 0.006702200711821348, 0.012673613376102217, 0.01805805116486128, 0.02296065262674275, 0.027460615301376282,
0.03161908492177514, 0.03548425629114566, 0.03909479074665314, 0.06168416627459879, 0.06395092768264225,
0.0952415360565632, 0.0964823380829502, 0.11590819258911032, 0.11676250975220677, 0.18973251809768016,
0.1899603458289615, 0.2585011532435637, 0.2586068948029052, 0.40046782450999047, 0.40067753715444315]
y_data=[0.005278154532534359, 0.004670803439961002, 0.004188802888597246, 0.003796976494876385, 0.003472183813732432,
0.0031985782141146, 0.002964943046115825, 0.0027631157936632137, 0.0025870148284089897, 0.001713418196416643,
0.0016440241050665323, 0.0009291243501697267, 0.0009083385934116964, 0.0006374601714823219, 0.0006276132323039056,
0.00016900738921547616, 0.00016834735819595378, 7.829234957755694e-05, 7.828353274888779e-05, 0.00015519569743801753,
0.00015533437619227267]
'''
I know that the data can be fitted using the following mathematical model:
'''
def model(x,a,b,c):
return (ab)/(bx+1)+3cx**2
'''
I am trying to obtain the a,b,c coefficients of the model calibrated, so that I obtain the following result (in red is the model calibrated and in blue is the data sample):
My code to achieve the shown result in the former picture is:enter image description here
'''
import numpy as np
from scipy.optimize import curve_fit
popt, _pcov = curve_fit(model, x_data, y_data,maxfev = 100000)
x_sample=np.linspace(0,0.5,1000)
y_sample=model(x_sample,*popt)
'''
If I plot the predicted data based on the fitted coefficients (in green) I get this result:enter image description here
for some reason I get some coefficients that produce a result I know it is wrong. Does anyone know how to solve this issue?

Your model y=(ab)/(bx+1)+3cx**2 appears not really satisfising. Instead of the hyperbolic term an exponential term seems better according to the shape of the data. That is why the proposed model is :
y=A * exp(B * x) + C * x**2
The method to compute approximates of the parameters A,B,C is shown below :
Details of the numerical calculus :
Note :
The parabolic term appears under represented. This is because they are not enough points at large x compare to the many points at small x.
The method used above is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales. The method isn't iterative and doesn't need initial "guessed" values. The accuracy is not good in case of few points, due to the numerical integration (calculus of the Sk).
If necessary, this can be improved thanks to post-treatment with non-linear regression starting from the above approximative values of the parameters;

An even better model is made of two exponentials :

Related

SARIMAX model in PyMC3

I would like to write down the following SARIMAX model (2,0,0) (2,0,0,12) in PyMC3 to perform bayesian estimation of its coefficients but I cannot figure out how to start with the seasonal part
Has anyone tries something like this?
with pm.Model() as ar2:
theta = pm.Normal("theta", 0.0, 1.0, shape=2)
sigma = pm.HalfNormal("sigma", 3)
likelihood = pm.AR("y", theta, sigma=sigma, observed=data)
trace = pm.sample(
1000,
tune=2000,
random_seed=13,
)
idata = az.from_pymc3(trace)
Although it would be best (e.g. best performance) if you can get an answer that uses PyMC3 exclusively, in case that does not exist yet, there is an alternative way to do this that uses the SARIMAX model in Statsmodels in combination with PyMC3.
There are too many details to repeat a full answer here, but basically you wrap the log-likelihood and gradient methods associated with a Statsmodels SARIMAX model. Here is a link to an example Jupyter notebook that shows how to do this:
https://www.statsmodels.org/stable/examples/notebooks/generated/statespace_sarimax_pymc3.html
I'm not sure if you'll still need it, however, expanding on cfulton's answer, here is how to fix the error in the statsmodels example (https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_pymc3.html, cell 8):
with pm.Model():
# Priors
arL1 = pm.Uniform('ar.L1', -0.99, 0.99)
maL1 = pm.Uniform('ma.L1', -0.99, 0.99)
sigma2 = pm.InverseGamma('sigma2', 2, 4)
# convert variables to tensor vectors
# # this is wrong:
theta = tt.as_tensor_variable([arL1, maL1, sigma2])
# # this is correct:
theta = tt.as_tensor_variable([arL1, maL1, sigma2], 'v')
# use a DensityDist (use a lamdba function to "call" the Op)
# # this is wrong:
# pm.DensityDist('likelihood', lambda v: loglike(v), observed={'v': theta})
# # this is correct:
pm.DensityDist('likelihood', lambda v: loglike(v), observed=theta)
# Draw samples
trace = pm.sample(ndraws, tune=nburn, discard_tuned_samples=True, cores=4)
I'm no pymc3/theano expert, but I think the error means that Theano has failed to associate the tensor's name with the values. If you define the name along with the values right at the beginning, it works.
I know it's not a direct answer to your question. Nevertheless, I hope it helps.

Reduce the output layer size from XLTransformers

I'm running the following using the huggingface implementation:
t1 = "My example sentence is really great."
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLLMHeadModel.from_pretrained("transfo-xl-wt103")
encoded_input = tokenizer(t1, return_tensors='pt', add_space_before_punct_symbol=True)
output = model(**encoded_input)
tmp = output[0].detach().numpy()
print(tmp.shape)
>>> (1, 7, 267735)
With the goal of getting output embeddings that I'll use downstream.
The last dimension is /substantially/ larger than I expected, and it looks like it is the size of the entire vocab_size rather than a reduction based on the ECL from the paper (which potentially I am misinterpreting).
What argument would I provide the model to reduce this layer size to a smaller dimensional space, something more like the basic BERT at 400 or 768 and still obtain good performance based on the pretrained embeddings?
That's because you used ...LMHeadModel, which predicts the next token. You can use TransfoXLModel.from_pretrained("transfo-xl-wt103") instead, then output[0] is the last hidden state which has the shape (batch_size, sequence_length, hidden_size).

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

ggpredict : confidence interval for negative binomial models

I used the following code to model count data :
ModActi<-glmmTMB(Median ~ H_veg + D_veg + Landscape + JulianDay +
H_veg:D_veg + (1 | Site),
data=MyDataActi, family=nbinom2)
I then used the ggpredict function of the ggeffects package to plot the predicted values of my model for the categorical variable "Landscape":
pr1 <- ggpredict(ModActi, "Landscape")
plot(pr1)
I obtain this Graph.
As you can see, lower confidence intervals are negative, as if the function would calculate them for a normal distribution.
In the help menu of ggpredict, it is not clear to me if there is a way to calculate confidence intervals for a negative binomial distribution (as stated in the model) ?
EDIT : if I use glmer in poisson, the confidence intervals are correct.
My supervisor found a nice solution by recalculating the standard errors in the predict table :
pr1 <- ggpredict(ModActi, "Landscape")
Ynontransform=log(pr1$predicted)
SEnontransform=log(pr1$conf.high)-Ynontransform
ConfLow=exp(Ynontransform-SEnontransform)
pr1$conf.low=ConfLow
plot(pr1)
This was because glmmTMB only returned predictions on the response scale and these were not back transformed. Now glmmTMB was update on CRAN and I also revised ggeffects. You can try out the current dev-version at https://github.com/strengejacke/ggeffects, which now properly computes the CI (after updating glmmTMB to version 0.2.1).

What causes "Jacobian matrix" to be singular in SAS?

I have a simple SAS (version 9.2) program as follows,
proc model;
cdf('normal',log(V/100)+1)=0.5;
bounds V>0;
solve V/solveprint;
run;
It throws exception that says jacobian matrix to be singular,
The Newton method Jacobian matrix of partial derivatives of the
equations with respect to the variables to be solved is singular.
What is the possible cause of this error?
Update: I have simplified the problem a bit. When modified to "cdf('normal', X)=0.5", it works without exception.
Update2: bounds is updated to V>0; but exception still there
What input data set are you passing to proc model? For example, this code works consistently:
data a;
v=100;
run;
proc model data=a;
cdf('normal',log(V/100)+1) = 0.5;
bounds V>0;
solve V / solveprint;
run;
quit;
And gives a solution of V=36.78794
But changing the input data somewhat (see below) will consistently give a singular Jacobian matrix error.
data a;
v=0.00001;
run;
proc model data=a;
cdf('normal',log(V/100)+1) = 0.5;
bounds V>0;
solve V / solveprint;
run;
quit;
You are asking SAS to solve a function that has no solution. You are asking for the value of V>1000 that makes this equation true. But there are no such values because log(1000/100+1) is about 3.3, and the CDF of a Normal random variable with mean 0 and standard deviation 1 evaluated at 3.3 is 0.9995. Any larger value of V will just move the function closer to 1, not toward 0.5, so there is no answer to your question.
By telling you that the matrix of partial derivatives is singular, SAS is just using fancy math speak for "your function doesn't have a solution". (Really what it's saying is, "I've turned your question into an equivalent maximization problem, and that problem doesn't have a maximum, so I can't help you.")

Resources