I am working on cole cole model which basically exhibits how the permittivity varies with respect to frequency and is given by;
Where, ε_∞ is the higher permittivity,
ε_s is the static permittivity ε_s>ε_∞,
ε_(o) is 8.854e-12,
ω=2πf,
α is the relaxation time 0≪α≪1
σ is the conductivity (can be a CONSTANT)
I tried one of the examples given here https://lmfit.github.io/lmfit-py/examples/example_complex_resonator_model.html but was not able to follow along.
Currently I have frequency vs real permittivity data. I want to fit the model to my measured data by using lmfit and guess the values from the fit. Basically i want to fit the real part of permittivity to the model.
Related
I am dealing with a discrete time system with sampling time of 300s.
My question is that how to express the state equation or output eqatuin like
x(k+1)=A*x(k)+B*u(k)
y(k)=C*x(k)
where x(k) is the state and y(k) is the output. I have all the value of A, B, C matrix.
I found some information about discrete time system on webpage https://apmonitor.com/wiki/index.php/Apps/DiscreteStateSpace
I want to know whether there is another way to express state equation other than
x,y,u = m.state_space(A,B,C,D=None,discrete=True)
The discrete state space model is the preferred way to pose your model. You could also convert your equations to a discrete time series form or to a continuous state space form. These all are equivalent forms. Another way to write your model is to use IMODE=2 (algebraic equations) but this is much more complicated. Here is an example of MIMO identification where we estimate ARX parameters with IMODE=2. I recommend the m.state_space model and to use it with IMODE>=4.
Here is a pendulum state space model example.
and a flight control state space model.
These both use continuous state space models but the methods are similar to what is needed for your application.
There are multiple sources, but they explain at a bit too high a level for me a to actually understand.
Here is my knowledge of how this model works;
We feed-forward information in prior layer's nodes using the weight * value. We do NOT use the sigmoid function here. This is because any hidden layers will force the value to be POSITIVE if we use the sigmoid function here. If it is always positive, then subsequent values can never be less than 0.5.
When we have fed forward to the output, we then use the sigmoid function on the output.
So in total we only use the sigmoid function on the output layer values only.
I will try to include a hopefully not terrible diagram
https://imgur.com/a/4EzkpH5
I have tested with my own code, and evidently it should not be the sigmoid function on every value and weight, but I am unsure if it is just the sum of weight*value
So basically you have a set of features for your model. These features are independent variables which will be responsible for producing of the output. So features are the inputs and the predicted values are the outputs. This is indeed a function.
It is easy to understand neural networks if we study them in terms of functions.
First multiply the feature vector with the vector of weights. Meaning, the dot product of the both vectors must be produced.
The dot product is a scalar if you have a single node ( neuron ). Apply sigmoid function on the product. The output is the final prediction.
The whole model could be expressed as a single composite function like,
y = sigmoid( dot( w , x ) )
Also understanding back propogation ( gradient descent ) for NN makes some intuition if we treat NN as functions.
In the above function,
sigmoid : applies sigmoid activation function to the argument.
dot : returns the dot product of two vectors.
Also, use vector notation as far as possible. It saves you from the confusion related with summations.
Hope it helps.
Activation functions serve an important role in neural network models: they can, given the choice of activation function, grant the network the capability to model non-linear datasets.
The example illustrated in the figure you posted (rendered below) will be limited to model linear problems where the output value is between 0 and 1 (the range of the sigmoidal function). However, the model would support non-linear datasets if the sigmoidal was applied to the two nodes in the middle. StackOverflow is not the place to discuss the theoretic foundation of why this works, instead I recommend looking into some light reading like this ebook: Neural Networks and Deep Learning (no affiliation).
As a side note: the final, output layer of a network are sometimes instantiated as a simple sum, or a ReLU. This will widen the range of the network's output.
I want to predict a value. I have a time series as well as a bunch of other time series that may be interesting to use to augment the prediction.
Someone is arguing with me that it is the same thing to find the correlation between 2 non stationary time series and finding the correlation when making both stationary by some sort of differencing. Their logic is that a state space model doesn't care.
Isn't the whole idea of regression to exploit correlations to predict values? Doesn't there have to exist a correlation to incorporate an explanation of variance in the data and not increase the variance in the predictions? Also, I am 100% convinced that finding the correlation between two non stationary time series without doing anything is wrong.... And you'll end up with correlations to time and not the variables themselves.
Any input is helpful. Thanks.
Depends on the models you're employing later on. You say that there has to exist a correlation or else the variance in the predictions will increase. That might hold for some models. Rather, I'd recommend you to go for models that have some model-election in themselves.
Think of LASSO, for example, that gives sparse vectors for the coefficients. Or think of a model that allows you to calculate Variable Importance and base your decisions on that outcome.
Second, let's do some math:
Correlation original = E[X(t)*Y(t)]
Correlation differencing = E[(X(t)-X(t-1))*(Y(t)-Y(t-1))] = E[X(t)Y(t)] + E[X(t-1)Y(t)] + E[X(t-1)Y(t-1)] + E[X(t)Y(t-1)]
If you assume that one time series is not correlated with the other time-series previous sample, then this reduces to
= E[X(t)Y(t)] + E[X(t-1)Y(t-1)]
So I have a very challenging MCMC run I would like to do in PyMC, which I have run several times before for much simpler analyses. However, my newest challenge requires me to combine many different Gaussian Processes in a very specific way, and I don't know enough about Gaussian processes in general or how they are implemented in PyMC to engineer the code I need.
Here is the problem I am trying to tackle:
The data I have is five time series (we'll call them A(t), B(t), C(t), D(t), and E(t)) , each measurement of which has Gaussian/Normal uncertainties. Each of these can be modeled as the product of one series-specific efficiency function and one underlying function shared between all five time series, so A(t) = a(t) * f(t), B(t) = b(t) * f(t), C(t) = c(t) * f(t), etc... I need to measure the posterior for f(t), or more specifically, the posterior of the integral of f(t) dt over a domain.
So I have read over some documentation about implementing Gaussian Processes in PyMC, but I have a few additional wrinkles with my efficiency functions that need to be addressed specifically before I can start coding up my model. Mainly -
1) I have no strong prior about the shape of the efficiency functions a(t), b(t), etc... So long as they vary smoothly there is no shape that is strongly forbidden.
2) These efficiency functions are physically bound to be between 0 and 1 for all times. So while I have no prior on the shape of the curve it has to fall between these bounds. I do have some prior about its typical value but since I need to marginalize over it I can't put too many other constraints on this.
Has anyone out there tackled a similar type of problem before, and what might be the most elegant way to guarantee that my efficiency priors are implemented in this complex MCMC run? I simply don't know enough about Gaussian Processes/Covariance functions to know how to force these constraints on the data.
I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.