Which PageRank formula should I use? - pagerank

I see a lot of people using PR(A) = 1 - d + d * \sigma PR(E)/L(E) as the pagerank formula. But it's really PR(A) = (1 - d)/N + \sigma PR(E)/L(E). I know there won't be that much a difference since if PR(A) > PR(B) then it's still the same whichever formula you use. Also in Larry Page's paper on PageRank he said that when added together all pageranks should equal 1

If you want that all values sum up to one you should use the (1-d)/n + d * sum(...) version.

Related

Problog - probabilistic graph example

I am going through the following example on Problog probabilistic graph
I tried to computed probability of path from 1 to 5. Here are my manual computations
0.6*0.4+0.1*0.3*0.8 = 0.264
However, Problog returns P(path(1,5)) = 0.25824
Am I computing it correctly?
No, you can't just add up all the probabilities for the different paths. To see that, just assume that both paths from 1 to 5 had a probability of 0.7 each. You would get a probability of 1.4 which is clearly wrong (meaning that it is impossible that there is no path).
The way to calculate the probability for either of two events A and B is to get the probability of neither being true and then looking at the inverse of this event.
P(1->2->5) = 0.24
P(1->3->4->5) = 0.024
P(either) = 1 - (1 - 0.24) * (1 - 0.024)
= 1 - 0.74176
= 0.25824
Sorry for probably bad terminology, my statistics knowledge is a bit rusty.

Bresenham’s Circle Algorithm

https://www.geeksforgeeks.org/bresenhams-circle-drawing-algorithm/
I was looking at Bresenham's algorithm which I'm trying to use to make a MS paint style application. I've implemented it into python and it works. However, I was not sure HOW this worked. I understood all of the algorithm except for the decision parameter. Specifically why it has to be d = 3 – (2 * r) , d = d + (4*x) + 6 or d = d + 4 * (x – y) + 10. Is anyone familiar with the algorithm or understands the math behind how these were derived? I understood the theory behind the algorithm for lines, but I'm having a hard time understanding the circle drawing.
If you just drew pixel (x,y), then the next pixel to be drawn is either (x+1,y) or (x+1,y-1)
The actual condition used determine which one to choose is appoximately which one is closest to the ideal circle. Specifically (x+1,y-1) is chosen if (x+1)² + y² - r² > r² - (x+1)² - (y-1)²
Collecting like terms, simplifies to 2(x+1)² + y² + (y-1)² - 2r² > 0
Expanding gives 2x² + 2y² - 2r² + 4x - 2y + 3 > 0
That expression on the left is d. Initially, x=0 and y=r, so most of those terms are zero or cancel out and we have d = 3 - 2y = 3 - 2r
The other expressions you ask about indicate how d changes after you pick the next pixel.
http://www.wolframalpha.com/input/?i=simplify+(2(x%2B2)%C2%B2+%2B+(y-1)%C2%B2+%2B+(y-2)%C2%B2+-+2r%C2%B2)+-+(2(x%2B1)%C2%B2+%2B+y%C2%B2+%2B+(y-1)%C2%B2+-+2r%C2%B2)
http://www.wolframalpha.com/input/?i=simplify+(2(x%2B2)%C2%B2+%2B+y%C2%B2+%2B+(y-1)%C2%B2+-+2r%C2%B2)+-+(2(x%2B1)%C2%B2+%2B+y%C2%B2+%2B+(y-1)%C2%B2+-+2r%C2%B2)

Math function with three variables (correlation)

I want to analyse some data in order to program a pricing algorithm.
Following dates are available:
I need a function/correlationfactor of the three variables/dimension which show the change of the Median (price) while the three dimensions (pers_capacity, amount of bedrooms, amount of bathrooms) grow.
e.g. Y(#pers_capacity,bedroom,bathroom) = ..
note:
- in the screenshot below are not all the data available (just a part of it)
- median => price per night
- yellow => #bathroom
e.g. For 2 persons, 2 bedrooms and 1 bathroom is the median price 187$ per night
Do you have some ideas how I can calculate the correlation/equation (f(..)=...) in order to get a reliable factor?
Kind regards
One typical approach would be formulating this as a linear model. Given three variables x, y and z which explain your observed values v, you assume v ≈ ax + by + cz + d and try to find a, b, c and d which match this as closely as possible, minimizing the squared error. This is called a linear least squares approximation. You can also refer to this Math SE post for one example of a specific linear least squares approximation.
If your your dataset is sufficiently large, you may consider more complicated formulas. Things like
v ≈
a1x2 +
a2y2 +
a3z2 +
a4xy +
a5xz +
a6yz +
a7x +
a8y +
a9z +
a10
The above is non-linear in the variables but still linear in the coefficients ai so it's still a linear least squares problem.
Or you could apply transformations to your variables, e.g.
v ≈
a1x +
a2y +
a3z +
a4exp(x) +
a5exp(y) +
a6exp(z) +
a7
Looking at the residual errors (i.e. difference between predicted and observed values) in any of these may indicate terms worth adding.
Personally I'd try all this in R, since computing linear models is just one line in that language, and visualizing data is fairly easy as well.

What is the most numerically precise method for dividing sums or differences?

Consider (a-b)/(c-d) operation, where a,b,c and d are floating-point numbers (namely, double type in C++). Both (a-b) and (c-d) are (sum-correction) pairs, as in Kahan summation algorithm. Briefly, the specific of these (sum-correction) pairs is that sum contains a large value relatively to what's in correction. More precisely, correction contains what didn't fit in sum during summation due to numerical limitations (53 bits of mantissa in double type).
What is the numerically most precise way to calculate (a-b)/(c-d) given the above speciality of the numbers?
Bonus question: it would be better to get the result also as (sum-correction), as in Kahan summation algorithm. So to find (e-f)=(a-b)/(c-d), rather than just e=(a-b)/(c-d) .
The div2 algorithm of Dekker (1971) is a good approach.
It requires a mul12(p,q) algorithm which can exactly computes a pair u+v = p*q. Dekker uses a method known as Veltkamp splitting, but if you have access to an fma function, then a much simpler method is
u = p*q
v = fma(p,q,-u)
the actual division then looks like (I've had to change some of the signs since Dekker uses additive pairs instead of subtractive):
r = a/c
u,v = mul12(r,c)
s = (a - u - v - b + r*d)/c
The the sum r+s is an accurate approximation to (a-b)/(c-d).
UPDATE: The subtraction and addition are assumed to be left-associative, i.e.
s = ((((a-u)-v)-b)+r*d)/c
This works because if we let rr be the error in the computation of r (i.e. r + rr = a/c exactly), then since u+v = r*c exactly, we have that rr*c = a-u-v exactly, so therefore (a-u-v-b)/c gives a fairly good approximation to the correction term of (a-b)/c.
The final r*d arises due to the following:
(a-b)/(c-d) = (a-b)/c * c/(c-d) = (a-b)/c *(1 + d/(c-d))
= [a-b + (a-b)/(c-d) * d]/c
Now r is also a fairly good initial approximation to (a-b)/(c-d) so we substitute that inside the [...], so we find that (a-u-v-b+r*d)/c is a good approximation to the correction term of (a-b)/(c-d)
For tiny corrections, maybe think of
(a - b) / (c - d) = a/b (1 - b/a) / (1 - c/d) ~ a/b (1 - b/a + c/d)

Algorithm for multidimensional optimization / root-finding / something

I have five values, A, B, C, D and E.
Given the constraint A + B + C + D + E = 1, and five functions F(A), F(B), F(C), F(D), F(E), I need to solve for A through E such that F(A) = F(B) = F(C) = F(D) = F(E).
What's the best algorithm/approach to use for this? I don't care if I have to write it myself, I would just like to know where to look.
EDIT: These are nonlinear functions. Beyond that, they can't be characterized. Some of them may eventually be interpolated from a table of data.
There is no general answer to this question. A solver finding the solution to any equation does not exist. As Lance Roberts already says, you have to know more about the functions. Just a few examples
If the functions are twice differentiable, and you can compute the first derivative, you might try a variant of Newton-Raphson
Have a look at the Lagrange Multiplier Method for implementing the constraint.
If the function F is continuous (which it probably is, if it is an interpolant), you could also try the Bisection Method, which is a lot like binary search.
Before you can solve the problem, you really need to know more about the function you're studying.
As others have already posted, we do need some more information on the functions. However, given that, we can still try to solve the following relaxation with a standard non-linear programming toolbox.
min k
st.
A + B + C + D + E = 1
F1(A) - k = 0
F2(B) - k = 0
F3(C) -k = 0
F4(D) - k = 0
F5(E) -k = 0
Now we can solve this in any manner we wish, such as penalty method
min k + mu*sum(Fi(x_i) - k)^2
st
A+B+C+D+E = 1
or a straightforward SQP or interior-point method.
More details and I can help advise as to a good method.
m
The functions are all monotonically increasing with their argument. Beyond that, they can't be characterized. The approach that worked turned out to be:
1) Start with A = B = C = D = E = 1/5
2) Compute F1(A) through F5(E), and recalculate A through E such that each function equals that sum divided by 5 (the average).
3) Rescale the new A through E so that they all sum to 1, and recompute F1 through F5.
4) Repeat until satisfied.
It converges surprisingly fast - just a few iterations. Of course, each iteration requires 5 root finds for step 2.
One solution of the equations
A + B + C + D + E = 1
F(A) = F(B) = F(C) = F(D) = F(E)
is to take A, B, C, D and E all equal to 1/5. Not sure though whether that is what you want ...
Added after John's comment (thanks!)
Assuming the second equation should read F1(A) = F2(B) = F3(C) = F4(D) = F5(E), I'd use the Newton-Raphson method (see Martijn's answer). You can eliminate one variable by setting E = 1 - A - B - C - D. At every step of the iteration you need to solve a 4x4 system. The biggest problem is probably where to start the iteration. One possibility is to start at a random point, do some iterations, and if you're not getting anywhere, pick another random point and start again.
Keep in mind that if you really don't know anything about the function then there need not be a solution.
ALGENCAN (part of TANGO) is really nice. There are Python bindings, too.
http://www.ime.usp.br/~egbirgin/tango/codes.php - " general nonlinear programming that does not use matrix manipulations at all and, so, is able to solve extremely large problems with moderate computer time. The general algorithm is of Augmented Lagrangian type ... "
http://pypi.python.org/pypi/TANGO%20Project%20-%20ALGENCAN/1.0
Google OPTIF9 or ALLUNC. We use these for general optimization.
You could use standard search technic as the others mentioned. There are a few optimization you could make use of it while doing the search.
First of all, you only need to solve A,B,C,D because 1-E = A+B+C+D.
Second, you have F(A) = F(B) = F(C) = F(D), then you can search for A. Once you get F(A), you could solve B, C, D if that is possible. If it is not possible to solve the functions, you need to continue search each variable, but now you have a limited range to search for because A+B+C+D <= 1.
If your search is discrete and finite, the above optimizations should work reasonable well.
I would try Particle Swarm Optimization first. It is very easy to implement and tweak. See the Wiki page for it.

Resources