Math function with three variables (correlation) - algorithm

I want to analyse some data in order to program a pricing algorithm.
Following dates are available:
I need a function/correlationfactor of the three variables/dimension which show the change of the Median (price) while the three dimensions (pers_capacity, amount of bedrooms, amount of bathrooms) grow.
e.g. Y(#pers_capacity,bedroom,bathroom) = ..
note:
- in the screenshot below are not all the data available (just a part of it)
- median => price per night
- yellow => #bathroom
e.g. For 2 persons, 2 bedrooms and 1 bathroom is the median price 187$ per night
Do you have some ideas how I can calculate the correlation/equation (f(..)=...) in order to get a reliable factor?
Kind regards

One typical approach would be formulating this as a linear model. Given three variables x, y and z which explain your observed values v, you assume v ≈ ax + by + cz + d and try to find a, b, c and d which match this as closely as possible, minimizing the squared error. This is called a linear least squares approximation. You can also refer to this Math SE post for one example of a specific linear least squares approximation.
If your your dataset is sufficiently large, you may consider more complicated formulas. Things like
v ≈
a1x2 +
a2y2 +
a3z2 +
a4xy +
a5xz +
a6yz +
a7x +
a8y +
a9z +
a10
The above is non-linear in the variables but still linear in the coefficients ai so it's still a linear least squares problem.
Or you could apply transformations to your variables, e.g.
v ≈
a1x +
a2y +
a3z +
a4exp(x) +
a5exp(y) +
a6exp(z) +
a7
Looking at the residual errors (i.e. difference between predicted and observed values) in any of these may indicate terms worth adding.
Personally I'd try all this in R, since computing linear models is just one line in that language, and visualizing data is fairly easy as well.

Related

Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attribute

I am having a hard time with the question below. I am not sure if I got it correct, but either way, I need some help futher understanding it if anyone has time to explain, please do.
Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attributes:
− Age (customer’s age, which is a real number with the maximum age is 90 years and minimum age 15 years)
− Cr (“credit rating”) which is ordinal attribute with values ‘very good’, ‘good, ‘medium’, ‘poor’, and ‘very poor’.
− Av_bal (avg account balance, which is a real number with mean 7000, standard deviation is 4000)
Using the L1 distance function computes the distance between the following 2 customers: c1 = (55, good, 7000) and c2 = (25, poor, 1000). [15 points]
Using the L2 distance function computes the distance between the above mentioned 2 customers
Using the L2 distance function computes the distance between the above mentioned 2 customers.
Answer with L1
d(c1,c2) = (c1.cr-c2.cr)/4 +(c1.avg.bal –c2.avg.bal/4000)* (c1.age-mean.age/std.age)-( c2.age-mean.age/std.age)
The question as is, leaves some room for interpretation. Mainly because similarity is not specified exactly. I will try to explain what the standard approach would be.
Usually, before you start, you want to normalize values such that they are rougly in the same range. Otherwise, your similarity will be dominated by the feature with the largest variance.
If you have no information about the distribution but just the range of the values you want to try to nomalize them to [0,1]. For your example this means
norm_age = (age-15)/(90-15)
For nominal values you want to find a mapping to ordinal values if you want to use Lp-Norms. Note: this is not always possible (e.g., colors cannot intuitively be mapped to ordinal values). In you case you can transform the credit rating like this
cr = {0 if ‘very good’, 1 if ‘good, 2 if ‘medium’, 3 if ‘poor’, 4 if ‘very poor’}
afterwards you can do the same normalization as for age
norm_cr = cr/4
Lastly, for normally distributed values you usually perform standardization by subtracting the mean and dividing by the standard deviation.
norm_av_bal = (av_bal-7000)/4000
Now that you have normalized your values, you can go ahead and define the distance functions:
L1(c1, c2) = |c1.norm_age - c2.norm_age| + |c1.norm_cr - c2.norm_cr |
+ |c1.norm_av_bal - c2.norm_av_bal|
and
L2(c1, c2) = sqrt((c1.norm_age - c2.norm_age)2 + (c1.norm_cr -
c2.norm_cr)2 + (c1.norm_av_bal -
c2.norm_av_bal)2)

Find equation based on known x and answers

So basically I have something like this
[always 8 numbers]
5-->10
2-->4
9-->18
7-->14
I know four x and the answers for that four x. I need to find equation so it fits for all of those x and their answers. I know there is infinite number of equations possible, but I would like to solve for shortest ones if possible.
For this example
x*2 or x+x fits the best
of course something like this x*3-x and infinite number of other equations works also but they're not most optimal ones like x*2
Any ideas, theories or algorithms that solve similar problem?
Using the numbers you provided:
5-->10
2-->4
9-->18
7-->14
You want to find a, b, c and d that solve the system defined by:
ax^3 + bx^2 + cx + d = f(x)
So, in your case it is:
125a + 25b + 5c + d = 10
8a + 4b + 2c + d = 4
729a + 81b + 9c + d = 18
343a + 49b + 7c + d = 14
If you solve the system you'll find that (a,b,c,d) must be (0, 0, 2, 0). So, the minimum polynomial is 2x.
I made a website some time ago that solves this:
http://juanlopes.net/actually42/#5%2010%202%204%209%2018%207%2014/true/true
If your goal is to fit the data to a polynomial function, i.e. something like:
f(x) = a_0 + a_1*x + a_2*x^2 + ... + a_n*x^n where each a_i is a real (or complex) number,
then there is some theory available as to when it is possible to put all those points on a single curve. What you can do is pick a degree (the highest power of x) and then write down a system of equations and solve the system (or try to solve it). For example, if the degree is 2, then your data become:
10 = a_0 + a_1*5 + a_2*5^2
4 = a_0 + a_1*2 + a_2*2^2
etc
If you are able to solve the system, then great. If not, you need a larger degree. Solving the system can be done (built in) in many languages via matrix multiplication. You may want to start out by saying: can my data all fit on a polynomial of degree 1? if yes, done. If not, does it fit on degree 2 polynomial? if yes, done. If not, degree 3, etc. Be careful though, because in general you may have data that you cannot fit "exactly" to a polynomial (or any function for that matter). If you just want a low degree polynomial that is very close, then you want to look into polynomial regression (which will give you a best fit polynomial), see: http://en.wikipedia.org/wiki/Polynomial_regression

Gaussian Mixture Model - Matlab training for parameters

I am running a speech enhancement algorithm based on Gaussian Mixture Model. The problem is that the estimation algorithm underflows during the training processing.
I am trying to calculate the PDF of a log spectrum frame X given a Gaussian cluster which is a product of the PDF of each frequnecy component X_k (fft is done for k=1..256)
what i get is a product of 256 exp(-v(k)) such that v(k)>=0
Here is a snippet of the MATLAB calculation:
N - number of frames; M- number of mixtures; c_i weight for each mixture;
gamma(n,i) = c_i*f(X_n|I = i)
for i=1 : N
rep_DataMat(:,:,i) = repmat(DataMat(:,i),1,M);
gamma_exp(:,:) = (1./sqrt((2*pi*sigmaSqr_curr))).*exp(((-1)*((rep_DataMat(:,:,i) - mue_curr).^2)./(2*sigmaSqr_curr)));
gamma_curr(i,:) = c_curr.*(prod(10*gamma_exp(:,:),1));
alpha_curr(i,:) = gamma_curr(i,:)./sum(gamma_curr(i,:));
end
The product goes quickly to zero due to K = 256 since the numbers being smaller then one. Is there a way I can calculate this with causing an underflow (like logsum or similar)?
You can perform the computations in the log domain.
The conversion of products into sums is straightforward.
Sums on the other hand can be converted with something such as logsumexp.
This works using the formula:
log(a + b) = log(exp(log(a)) + exp(log(b)))
= log(exp(loga) + exp(logb))
Where loga and logb are the respective representation of a and b in the log domain.
The basic idea is then to factorize the exponent with the largest argument (eg. loga for sake of illustration):
log(exp(loga)+exp(logb)) = log(exp(loga)*(1+exp(logb-loga)))
= loga + log(1+exp(logb-loga))
Note that the same idea applies if you have more than 2 terms to add.

fast multiplications

When I am going to compute the following series 1+x+x^2+x^3+..., I would prefer to do like this: (1+x)(1+x^2)(1+x^4)... (which is like some sort of repeated squaring) so that the number of multiplications can be significantly reduced.
Now I want to compute the series 1+x/1!+(x^2)/2!+(x^3)/3!+..., how can I use the similar techniques to improve the number of multiplications?
Any suggestions are warmly welcome!
The method of optimization you refer, is probably Horner's method:
a + bx +cx^2 +dx^3 = ((c+dx)x + b)x + a
The alternating series A*(1-x)(1+x^2)(1-x^4)(1+x^8) ... OTOH is useful in calculating approximation for division of A/(1+x), where x is small.
The Taylor series sigma x^n/n! for exp(x) converges quite badly; other approximations are better suited to get accurate values; if there's a trick to make it with less multiplications, it is to iterate with a temporary value:
sum=1; temp=x; k=1;
// The sum after first iteration is (1+x) or 1+x^1/1!
for (i=1;i<=N;i++) { sum=sum+temp; k=k*(i+1); temp = temp * x / k; }
// or
prod=1.0; for (i=N;i>0;i--) prod = prod * x/(double)i + 1.0;
Multiplying the factorial should increase accuracy a bit -- in real life situation it's may be advisable to either combine temp=temp*x/(i+1) in order to be able to iterate much further, or to use a lookup table for the constant a_n / n!, as one typically needs just a few terms. (4 or 5 terms for sin/cos).
As it turned out, Horner's rule didn't have much role in the transformation of the geometric series Sigma x^n to product form. To calculate exponential, other powerful techniques have to be applied -- typically range reduction and rational (Pade), polynomial (chebyshev) approximations and such.
Converting comment to an answer:
Note that for first series, there is exact equivalence:
1+x+x^2+x^3+...+x^n = (1-x^(n+1))/(1-x)
Using it, you can compute it much, much faster.
Second one is convergence series for e^x, you might want to use standard math library functions pow(e, x) or exp(x) instead.
On your approach for the first series don't you think that using 1 + x(1+ x( 1+ x( 1+x)....)) would be a better approach. Similar approach can be applied for the second series. So 1 + x/1 ( 1+ x/2 (1 + x/3 * (1 + x/4(.....))))

Algorithm for calculating the sum-of-squares distance of a rolling window from a given line function

Given a line function y = a*x + b (a and b are previously known constants), it is easy to calculate the sum-of-squares distance between the line and a window of samples (1, Y1), (2, Y2), ..., (n, Yn) (where Y1 is the oldest sample and Yn is the newest):
sum((Yx - (a*x + b))^2 for x in 1,...,n)
I need a fast algorithm for calculating this value for a rolling window (of length n) - I cannot rescan all the samples in the window every time a new sample arrives.
Obviously, some state should be saved and updated for every new sample that enters the window and every old sample leaves the window.
Notice that when a sample leaves the window, the indecies of the rest of the samples change as well - every Yx becomes Y(x-1). Therefore when a sample leaves the window, every other sample in the window contribute a different value to the new sum: (Yx - (a*(x-1) + b))^2 instead of (Yx - (a*x + b))^2.
Is there a known algorithm for calculating this? If not, can you think of one? (It is ok to have some mistakes due to first-order linear approximations).
Won't a straightforward approach do the trick?...
By 'straightforward' I mean maintaining a queue of samples. Once a new sample arrives, you would:
pop the oldest sample from the queue
subtract its distance from your sum
append the new sample to the queue
calculate its distance and add it to your sum
As for time, everything here is O(1) if the queue is implemented as linked list or something similar, You would want to store the distance with your samples in queue, too, so you calculate it only once. The memory usage is thus 3 floats per sample - O(n).
If you expand the term (Yx - (a*x + b))^2 the terms break into three parts:
Terms of only a,x and b. These produce some constant when summed over n and can be ignored.
Terms of only Yx and b. These can be handled in the style of a boxcar integrator as #Xion described.
One term of -2*Yx*a*x. The -2*a is a constant so ignore that part. Consider the partial sum S = Y1*1 + Y2*2 + Y3*3 ... Yn*n. Given Y1 and a running sum R = Y1 + Y2 + ... + Yn you can find S - R which eliminates Y1*1 and reduces each of the other terms, leaving you with Y2*1 + Y3*2 + ... + Yn*(n-1). Now update the running sum R as for (2) by subtracting off Y1 and adding Y(n+1). Add the new Yn*n term to S.
Now just add up all those partial terms.

Resources