Filling multiple missing data with EM algorithm - algorithm

I'm studying with this ppt. Starting from page 22, it's showing how a missing data can be filled with the most likely value with EM algorithm. I managed to understand this but I started wondering how I can fill 2 missing data. If 2 data only in field B were missing, I can see how I would calculate it. But what if one data is missing, both in A and B field? The calculation on the ppt is being conducted with the assumption that data on A is firm but in this case it's not... Can someone explain a little bit?

If you want missing values on both A and B, you need some additional hidden variables.
To be more precise:
Assume that you have 4 hidden variables, H1, H2, A' and B' taking values in {0, 1} which generates your observations (A, B) as follows:
A = A' if H1=0, A = 'H' otherwise
B = B' if H1=0, B = 'H' otherwise
and assume that (A', B') is independent from (H1, H2). Therefore, your model is parametrized by the joint distribution of (A', B') and the joint distribution of (H1, H2).
Now to learn the model, you can just run EM as you did before, the only difference is that your hidden variable H is now extended by A', B', H1 and H2. Once your model is learnt, you can fill the missing pairs of observations by the most likely pair (given the distribution of the model).

Related

Resorting and holding a vector

So this is probably a very specific problem and i am not sure if it is even solvable but here we go:
I have a vector with 6 indices which each are variables. These variables get calculated separately. What I want now is that the order of the indices changes at a specific time and stays like that. But the actual value of the indices needs to keep being calculated. Maybe explaining it in my Modelica code helps with understanding.
I have a vector with six indices, made up out of 6 variables, let's name them A to F. A to F are each calculated in a different way which is (probably) not relevant here so I'm simply writing [...] for that here. They behave independently of each other.
Real Vector[6];
Real A;
Real B;
Real C;
Real D;
Real E;
Real F;
equation
A = 3*x;
B = 5-x +7/x ...;
C = [and so on]
D = [...]
E = [...];
F = [...];
Initially, the Vector is sorted like this:
Vector = {A, B, C, D, E, F};
But I want the order of the indices to be resorted via some if-clauses every 100 seconds (starting at time=0) which i make work like that:
when sample(0,100) then
Vector = {if xyz then A,
elseif xyz then B ....}
end when;
Again, the specific way in which i resort the indices (probably) doesn't matter because it definitely works.
My problem is: While it does resort my Vector every 100 seconds and holds this new order/sequence (which is exactly what i need), it of course also holds the calculated actual values of A to F at that time. Which means i get constant values between each time step.
What i need is the new order to hold but the values of A to F need to keep being calculated.
I also tried using if instead of when like
if time <100 then Vector = {A, B, C, D, E, F}
elseif time >=100 and <200 then Vector = {if xyz then A, elseif xyz then B ....(see above)}
else ...;
end if;
Problem here: it does also resort my Vector while also calculating A to F. But it looks to resort my vector all the time, not only once every 100 seconds --> holding the order until the next 100 seconds are over (the resorting is dependent on other calculated values in the model which are constantly changing).
My model is very huge so it's tricky to share all the parts that weave into this part of my work which is the reason i had to simplify my explanations as much as possible. I hope someone can help me with this.
I'm still relatively new at this and have been mostly teaching myself for the last few months so maybe I'm simply not aware of an easy obvious solution here. Or what I need is simply not doable in Modelica.
Thank you!
Not 100% sure I got the question correctly, but would the the graph below show what you need?
...with v being the original vector and vs being the continuously computed, but sorted (ascending every 100s) version of v.
This is the respective code:
model VectorSorting "Computes 'vs' every 100s from 'v' with acending order"
Real A, B, C, D, E, F; // Some variables computed in equations below
Real v[6]; // vector for A...F
Real vs[6]; // sorted version of 'v'
Integer i[6](start=1:6, fixed=true); // indexes of vector
Real d[6](start=zeros(6), fixed=true); // dummy variable
equation
A = time+200;
B = time-150;
C = 3*time-333;
D = 0.5*time+75;
E = -250;
F = 750;
v = {A, B, C, D, E, F};
vs = v[i];
when sample(0, 100) then
(d, i) = Modelica.Math.Vectors.sort(v);
end when;
annotation (experiment(StopTime=500), uses(Modelica(version="4.0.0")));
end VectorSorting;

Unable to generate a non-linear model fit to single variable data

I am trying to model the Kerr effect with experimental data, and the relationship between the independent variable voltage applied(U) and light intensity on crossed polarizers (L) is L = a * sin(b*U^2), where a and b are independent constants to be determined.
data = {{300, 0.014336918}, {350, 0.023297491}, {400,
0.053763441}, {450, 0.098566308}, {500, 0.172043011}, {550,
0.23297491}, {600, 0.336917563}, {650, 0.336917563}, {700,
0.403225806}, {750, 0.448028674}, {800, 0.480286738}, {850,
0.485663082}, {900, 0.487455197}, {950, 0.476702509}, {970,
0.465949821}, {985, 0.435483871}, {995, 0.415770609}}
nlm = NonlinearModelFit[data, a*Sin (b*(x^2)), {a, b}, x]
However, I get the following error:
NonlinearModelFit::nrlnum: ...
is not a list of real numbers with dimensions {17} at {a,b} = {1.,1.}.
I'm new to programming in this language but I have no idea what I am doing wrong. Is there any way to structure my data so that this function actually works?
After reading through the documentation, I realize that the non-linear function is a local iterated approximation method and since the coefficient of the parameter b is too small, Mathematica is unable to compute the value of b. Thus linearisation of the function and substitution back into the original equation solved my problem.

How to implement dp?

I recently have encountered a question where I have been given two types of person: Left-handed and Right-handed (writing style). They all are to be seated in a class row and to avoid any disturbance their hands must not collide i.e. pattern may be like LR or LL or RR.
I tried using recursion (taking two branches of L and R for every seat) but number of computations would be very high even for a row size of 100.
Somehow I need to implement DP to reduce the computations. Please suggest.
EDIT:
Actually there is a matrix (like a classroom) in which three types (L R B) of people can be seated with no collisions of hand. I have to generate the maximum number of people that can be seated. Suppose I have a 2x2 matrix and to fill it with Left, Right and Both handed type of persons L=0, R=1 ,B =3 are given. So one valid arrangement would be Row0: B R and row1: B - where - means blank seat.
Actually the fact that you have a matrix doesn't make a difference in the solution you can transform it to an array without losing generality because each state depends on its left or right not its up and down. A bottom-up approach: In each state you have three things L, B and R , index of the seat you want to fill and its left person. Now we can fill the table from right to left. The answer is dp[inedx=0][L][B][R][left_person=' '].
recursive [index][L][B][R][left_person] :
if left_person = ' ' or 'L':
LVal = recursive[index+1][L-1][B][R]['L']
if left_person = ' ' or 'L' or 'R'
RVal = recursive[index+1][L][B][R-1]['R']
if left_person = ' ' or 'L':
BVal = recursive[index+1][L][B-1][R]['B']
NVal = recursive[index+1][L][B][R][' ']
max(LVal, RVal, BVal, NVal) -> dp[index][L][B][R][left_person]
Of course this is not complete and I'm just giving you the genereal idea. You should add some details like the base case and checking if there is anymore person of that kind before assigning it and some other details.

Algorithm to separate items of the same type

I have a list of elements, each one identified with a type, I need to reorder the list to maximize the minimum distance between elements of the same type.
The set is small (10 to 30 items), so performance is not really important.
There's no limit about the quantity of items per type or quantity of types, the data can be considered random.
For example, if I have a list of:
5 items of A
3 items of B
2 items of C
2 items of D
1 item of E
1 item of F
I would like to produce something like:
A, B, C, A, D, F, B, A, E, C, A, D, B, A
A has at least 2 items between occurences
B has at least 4 items between occurences
C has 6 items between occurences
D has 6 items between occurences
Is there an algorithm to achieve this?
-Update-
After exchanging some comments, I came to a definition of a secondary goal:
main goal: maximize the minimum distance between elements of the same type, considering only the type(s) with less distance.
secondary goal: maximize the minimum distance between elements on every type. IE: if a combination increases the minimum distance of a certain type without decreasing other, then choose it.
-Update 2-
About the answers.
There were a lot of useful answers, although none is a solution for both goals, specially the second one which is tricky.
Some thoughts about the answers:
PengOne: Sounds good, although it doesn't provide a concrete implementation, and not always leads to the best result according to the second goal.
Evgeny Kluev: Provides a concrete implementation to the main goal, but it doesn't lead to the best result according to the secondary goal.
tobias_k: I liked the random approach, it doesn't always lead to the best result, but it's a good approximation and cost effective.
I tried a combination of Evgeny Kluev, backtracking, and tobias_k formula, but it needed too much time to get the result.
Finally, at least for my problem, I considered tobias_k to be the most adequate algorithm, for its simplicity and good results in a timely fashion. Probably, it could be improved using Simulated annealing.
First, you don't have a well-defined optimization problem yet. If you want to maximized the minimum distance between two items of the same type, that's well defined. If you want to maximize the minimum distance between two A's and between two B's and ... and between two Z's, then that's not well defined. How would you compare two solutions:
A's are at least 4 apart, B's at least 4 apart, and C's at least 2 apart
A's at least 3 apart, B's at least 3 apart, and C's at least 4 apart
You need a well-defined measure of "good" (or, more accurately, "better"). I'll assume for now that the measure is: maximize the minimum distance between any two of the same item.
Here's an algorithm that achieves a minimum distance of ceiling(N/n(A)) where N is the total number of items and n(A) is the number of items of instance A, assuming that A is the most numerous.
Order the item types A1, A2, ... , Ak where n(Ai) >= n(A{i+1}).
Initialize the list L to be empty.
For j from k to 1, distribute items of type Ak as uniformly as possible in L.
Example: Given the distribution in the question, the algorithm produces:
F
E, F
D, E, D, F
D, C, E, D, C, F
B, D, C, E, B, D, C, F, B
A, B, D, A, C, E, A, B, D, A, C, F, A, B
This sounded like an interesting problem, so I just gave it a try. Here's my super-simplistic randomized approach, done in Python:
def optimize(items, quality_function, stop=1000):
no_improvement = 0
best = 0
while no_improvement < stop:
i = random.randint(0, len(items)-1)
j = random.randint(0, len(items)-1)
copy = items[::]
copy[i], copy[j] = copy[j], copy[i]
q = quality_function(copy)
if q > best:
items, best = copy, q
no_improvement = 0
else:
no_improvement += 1
return items
As already discussed in the comments, the really tricky part is the quality function, passed as a parameter to the optimizer. After some trying I came up with one that almost always yields optimal results. Thank to pmoleri, for pointing out how to make this a whole lot more efficient.
def quality_maxmindist(items):
s = 0
for item in set(items):
indcs = [i for i in range(len(items)) if items[i] == item]
if len(indcs) > 1:
s += sum(1./(indcs[i+1] - indcs[i]) for i in range(len(indcs)-1))
return 1./s
And here some random result:
>>> print optimize(items, quality_maxmindist)
['A', 'B', 'C', 'A', 'D', 'E', 'A', 'B', 'F', 'C', 'A', 'D', 'B', 'A']
Note that, passing another quality function, the same optimizer could be used for different list-rearrangement tasks, e.g. as a (rather silly) randomized sorter.
Here is an algorithm that only maximizes the minimum distance between elements of the same type and does nothing beyond that. The following list is used as an example:
AAAAA BBBBB CCCC DDDD EEEE FFF GG
Sort element sets by number of elements of each type in descending order. Actually only largest sets (A & B) should be placed to the head of the list as well as those element sets that have one element less (C & D & E). Other sets may be unsorted.
Reserve R last positions in the array for one element from each of the largest sets, divide the remaining array evenly between the S-1 remaining elements of the largest sets. This gives optimal distance: K = (N - R) / (S - 1). Represent target array as a 2D matrix with K columns and L = N / K full rows (and possibly one partial row with N % K elements). For example sets we have R = 2, S = 5, N = 27, K = 6, L = 4.
If matrix has S - 1 full rows, fill first R columns of this matrix with elements of the largest sets (A & B), otherwise sequentially fill all columns, starting from last one.
For our example this gives:
AB....
AB....
AB....
AB....
AB.
If we try to fill the remaining columns with other sets in the same order, there is a problem:
ABCDE.
ABCDE.
ABCDE.
ABCE..
ABD
The last 'E' is only 5 positions apart from the first 'E'.
Sequentially fill all columns, starting from last one.
For our example this gives:
ABFEDC
ABFEDC
ABFEDC
ABGEDC
ABG
Returning to linear array we have:
ABFEDCABFEDCABFEDCABGEDCABG
Here is an attempt to use simulated annealing for this problem (C sources): http://ideone.com/OGkkc.
I believe you could see your problem like a bunch of particles that physically repel eachother. You could iterate to a 'stable' situation.
Basic pseudo-code:
force( x, y ) = 0 if x.type==y.type
1/distance(x,y) otherwise
nextposition( x, force ) = coined?(x) => same
else => x + force
notconverged(row,newrow) = // simplistically
row!=newrow
row=[a,b,a,b,b,b,a,e];
newrow=nextposition(row);
while( notconverged(row,newrow) )
newrow=nextposition(row);
I don't know if it converges, but it's an idea :)
I'm sure there may be a more efficient solution, but here is one possibility for you:
First, note that it is very easy to find an ordering which produces a minimum-distance-between-items-of-same-type of 1. Just use any random ordering, and the MDBIOST will be at least 1, if not more.
So, start off with the assumption that the MDBIOST will be 2. Do a recursive search of the space of possible orderings, based on the assumption that MDBIOST will be 2. There are a number of conditions you can use to prune branches from this search. Terminate the search if you find an ordering which works.
If you found one that works, try again, under the assumption that MDBIOST will be 3. Then 4... and so on, until the search fails.
UPDATE: It would actually be better to start with a high number, because that will constrain the possible choices more. Then gradually reduce the number, until you find an ordering which works.
Here's another approach.
If every item must be kept at least k places from every other item of the same type, then write down items from left to right, keeping track of the number of items left of each type. At each point put down an item with the largest number left that you can legally put down.
This will work for N items if there are no more than ceil(N / k) items of the same type, as it will preserve this property - after putting down k items we have k less items and we have put down at least one of each type that started with at ceil(N / k) items of that type.
Given a clutch of mixed items you could work out the largest k you can support and then lay out the items to solve for this k.

Algorithm/Data Structure for finding combinations of minimum values easily

I have a symmetric matrix like shown in the image attached below.
I've made up the notation A.B which represents the value at grid point (A, B). Furthermore, writing A.B.C gives me the minimum grid point value like so: MIN((A,B), (A,C), (B,C)).
As another example A.B.D gives me MIN((A,B), (A,D), (B,D)).
My goal is to find the minimum values for ALL combinations of letters (not repeating) for one row at a time e.g for this example I need to find min values with respect to row A which are given by the calculations:
A.B = 6
A.C = 8
A.D = 4
A.B.C = MIN(6,8,6) = 6
A.B.D = MIN(6, 4, 4) = 4
A.C.D = MIN(8, 4, 2) = 2
A.B.C.D = MIN(6, 8, 4, 6, 4, 2) = 2
I realize that certain calculations can be reused which becomes increasingly important as the matrix size increases, but the problem is finding the most efficient way to implement this reuse.
Can point me in the right direction to finding an efficient algorithm/data structure I can use for this problem?
You'll want to think about the lattice of subsets of the letters, ordered by inclusion. Essentially, you have a value f(S) given for every subset S of size 2 (that is, every off-diagonal element of the matrix - the diagonal elements don't seem to occur in your problem), and the problem is to find, for each subset T of size greater than two, the minimum f(S) over all S of size 2 contained in T. (And then you're interested only in sets T that contain a certain element "A" - but we'll disregard that for the moment.)
First of all, note that if you have n letters, that this amounts to asking Omega(2^n) questions, roughly one for each subset. (Excluding the zero- and one-element subsets and those that don't include "A" saves you n + 1 sets and a factor of two, respectively, which is allowed for big Omega.) So if you want to store all these answers for even moderately large n, you'll need a lot of memory. If n is large in your applications, it might be best to store some collection of pre-computed data and do some computation whenever you need a particular data point; I haven't thought about what would work best, but for example computing data only for a binary tree contained in the lattice would not necessarily help you anything beyond precomputing nothing at all.
With these things out of the way, let's assume you actually want all the answers computed and stored in memory. You'll want to compute these "layer by layer", that is, starting with the three-element subsets (since the two-element subsets are already given by your matrix), then four-element, then five-element, etc. This way, for a given subset S, when we're computing f(S) we will already have computed all f(T) for T strictly contained in S. There are several ways that you can make use of this, but I think the easiest might be to use two such subset S: let t1 and t2 be two different elements of T that you may select however you like; let S be the subset of T that you get when you remove t1 and t2. Write S1 for S plus t1 and write S2 for S plus t2. Now every pair of letters contained in T is either fully contained in S1, or it is fully contained in S2, or it is {t1, t2}. Look up f(S1) and f(S2) in your previously computed values, then look up f({t1, t2}) directly in the matrix, and store f(T) = the minimum of these 3 numbers.
If you never select "A" for t1 or t2, then indeed you can compute everything you're interested in while not computing f for any sets T that don't contain "A". (This is possible because the steps outlined above are only interesting whenever T contains at least three elements.) Good! This leaves just one question - how to store the computed values f(T). What I would do is use a 2^(n-1)-sized array; represent each subset-of-your-alphabet-that-includes-"A" by the (n-1) bit number where the ith bit is 1 whenever the (i+1)th letter is in that set (so 0010110, which has bits 2, 4, and 5 set, represents the subset {"A", "C", "D", "F"} out of the alphabet "A" .. "H" - note I'm counting bits starting at 0 from the right, and letters starting at "A" = 0). This way, you can actually iterate through the sets in numerical order and don't need to think about how to iterate through all k-element subsets of an n-element set. (You do need to include a special case for when the set under consideration has 0 or 1 element, in which case you'll want to do nothing, or 2 elements, in which case you just copy the value from the matrix.)
Well, it looks simple to me, but perhaps I misunderstand the problem. I would do it like this:
let P be a pattern string in your notation X1.X2. ... .Xn, where Xi is a column in your matrix
first compute the array CS = [ (X1, X2), (X1, X3), ... (X1, Xn) ], which contains all combinations of X1 with every other element in the pattern; CS has n-1 elements, and you can easily build it in O(n)
now you must compute min (CS), i.e. finding the minimum value of the matrix elements corresponding to the combinations in CS; again you can easily find the minimum value in O(n)
done.
Note: since your matrix is symmetric, given P you just need to compute CS by combining the first element of P with all other elements: (X1, Xi) is equal to (Xi, X1)
If your matrix is very large, and you want to do some optimization, you may consider prefixes of P: let me explain with an example
when you have solved the problem for P = X1.X2.X3, store the result in an associative map, where X1.X2.X3 is the key
later on, when you solve a problem P' = X1.X2.X3.X7.X9.X10.X11 you search for the longest prefix of P' in your map: you can do this by starting with P' and removing one component (Xi) at a time from the end until you find a match in your map or you end up with an empty string
if you find a prefix of P' in you map then you already know the solution for that problem, so you just have to find the solution for the problem resulting from combining the first element of the prefix with the suffix, and then compare the two results: in our example the prefix is X1.X2.X3, and so you just have to solve the problem for
X1.X7.X9.X10.X11, and then compare the two values and choose the min (don't forget to update your map with the new pattern P')
if you don't find any prefix, then you must solve the entire problem for P' (and again don't forget to update the map with the result, so that you can reuse it in the future)
This technique is essentially a form of memoization.

Resources