time series alignment with continuous and categorical data - algorithm

I have pairs of series of timed measurements I want to align. I have
annotators judging and marking ambiguous events in a signal and want
to optimally match their times and events. The input is four columns:
the event onset times and labels for one annotator, and the times and
labels for the other annotator. For example (as rows):
annotator_1_times: .34, .39, .50, .68, .88
annotator_1_label: A, X, Q, L, Z
annotator_2_times: .33, .41, .67, .90
annotator_2_label: A, X, L, X
Annotators don't necessarily have the same number of events when they
interpret the same signal. Annotators in general are expected to have
similar but nonidentical labels, and similar but nonidentical times.
How this should be done depends on some sort of cost function --
something that will decide how "bad" it is for a time to be off a
certain amount, and for two labels to disagree.
A desirable output in my example case:
annotator_1_times: .34, .39, .50, .68, .88
annotator_1_label: A, X, Q, L, Z
annotator_2_times: .33, .41, [], .67, .90 <-note gap inserted
annotator_2_label: A, X, [], L, X
Stuff I would then do post hoc:
time_mismatch_dif: .01, .02, XX, .01, .02 <- for computing agreemt
label_mismatches_: 0, 0, ADD, 0, SUBST <- for computing agreemt
The hard part is to know where to insert the gaps.
If need be, I can do just the numerical alignments and separately just
the label alignments and then merge them somehow. I know there are
character alignment algorithms (e.g. in genetics) and there must be
time-series alignment algorithms.
Any suggestions welcome.

Your problem is very similar to Levenshtein distance problem, and you can adapt the same algorithm there.
Firstly define your cost function.
Then, run a dynamic programming on a quadratic table: for each i and j calculate ans[i][j], that is the 'cost of alignment' of first i events from the first annotator and first j events from the second. This can be done in three ways:
either you align i and j, then ans[i][j] becomes ans[i-1][j-1] + costAlignment(a[i],b[j])
either you align i and 'gap', then ans[i][j] becomes ans[i-1][j] + costGap(a[i])
either you align j and 'gap', then ans[i][j] becomes ans[i][j-1] + costGap(b[j])
You should choose minimum of three options.

Related

Haskell - Grouping specific nearest neighbours in a cartesian grid

I'm after some direction in this puzzle where I need to group specific nearest neighbours together.
my input data is:
myList :: Map (Int, Int) Int
myList =
fromList
[((-2,-2),0),((-2,-1),0),((-2,0),2),((-2,1),0),((-2,2),0)
,((-1,-2),1),((-1,-1),3),((-1,0),0),((-1,1),0),((-1,2),1)
,((0,-2),0),((0,-1),0),((0,0),0),((0,1),0),((0,2),0)
,((1,-2),0),((1,-1),0),((1,0),0),((1,1),2),((1,2),1)
,((2,-2),0),((2,-1),2),((2,0),0),((2,1),0),((2,2),0)]
Which is a data representation of this 5 x 5 grid (brown land, blue water):
I'm using (Int,Int) as XY coordinates, because the way the list had to be generated (thus its ordering) was in a spiral on a cartesian coordinate grid (0,0) being the origin. The remaining Int is size of population 0 being water, 1..9 being land.
Because of the ordering of my Map I've been struggling with finding a way I can traverse my data and return 4 grouped land items that are grouped due to each others connected proximity (including diagonal), so I'm looking for a result like bellow:
[ [(-1 , 2)]
, [(1, 2),(1,1)]
, [(-2, -0),(-1,-1),(-1,-2)]
, [(2, -1)]]
I've researched and tried various algorithm like BFS, Flood Fill but my input data never fit the structural requirements or my understanding of the subjects doesn't allow me to convert it to using coordinates.
Is there a way I can run an algorithm directly on the data, or should I be looking at another direction?
I'm sorry there is no code examples of what I have so far but I've not even been able to create anything remotely useful to use.
I recommend using a union-find data structure. Loop over all positions; if it is land, mark it equivalent to any positions immediately NE, N, NW, or W of it that are also land. (It will automatically get marked equivalent to any land that exists E, SW, S, or SE of it when you visit that other land. The critical property of the set D={NE, N, NW, W} is that if you mirror all the directions in D to get M, then M∪D contains every direction; any other set D with this property will do fine, too.) The equivalence classes returned by the data structure at the end of this process will be your connected land chunks.
If n is the total number of positions, this process is O(n*log n); the log n component comes from the Map lookups needed to determine if a neighbor is land or water.
You should consider making the Map sparse if you can -- storing only the key-value pairs corresponding to land and skipping the water keys -- to graduate to O(m*log m) where m is the total number of lands, rather than the total number of positions. If you cannot (because you must remember the difference between water and not-existing positions, say), you could consider switching to an array as your backing store to graduate to O(n*a n), where a is the inverse Ackermann function, and so the whole shebang would basically be as close to O(n) as it is possible to get without actually being O(n).
Whether O(m*log m) or O(n*a n) is preferable when both are an option is a matter for empirical exploration on some data sets that you believe represent your typical use case.
I ended up going with this solution by Chris Penner via FP slack channel, it uses Union Find Algorithm (I've added comments to code to help a little):
-- | Take Map of land coordinates and return list of grouped land items forming islands
-- | Using Union find algorythm
findIslands :: M.Map Coordinate Coordinate -> IO [[Coordinate]]
findIslands land = do
-- create fresh point map
pointMap <- traverse U.fresh land
-- traverse each point checking for neighbours
void . flip M.traverseWithKey pointMap $ \(x, y) point ->
for_ (catMaybes (flip M.lookup pointMap <$> [(x + 1, y), (x, y + 1),(x +1, y +1), (x - 1, y + 1)]))
$ \neighbourPoint ->
U.union point neighbourPoint
-- traverse ppintMap and representative and their descriptors
withUnionKey :: (M.Map Coordinate Coordinate) <- for pointMap (U.repr >=> U.descriptor)
-- swap cordinates arround
let unionKeyToCoord :: [(Coordinate, Coordinate)] = (swap <$> M.toList withUnionKey)
-- combine coordinates to create islands
results :: M.Map Coordinate [Coordinate] = M.fromListWith (<>) (fmap (:[]) <$> unionKeyToCoord)
-- return just the elements from the Map
return (M.elems results)
convertTolandGrid :: [Coordinate] -> M.Map Coordinate Coordinate
convertTolandGrid = M.fromList . fmap (id &&& id)

How do I randomly equalize unequal values?

Say I have multiple unequal values a, b, c, d, e. Is it possible to turn these unequal values into equal values just by using random number generation?
Example: a=100, b=140, c=200, d=2, e=1000. I want the algorithm to randomly target these sets such that the largest value is targeted most often and the smallest value is left alone for the most parts.
Areas where I've run into problems: if I just use non-unique random number generation, then value e will end up going under the other values. If I use unique number generation, then the ration between the values doesn't change even if their absolute values do. I've tried using sets where a certain range of numbers have to be hit a certain number of times before the value changes. I haven't tried using a mix of unique/non-unique random numbers yet.
I want the ratio between the values to gradually approach 1 as the algorithm runs.
Another way to think about the problem: say these values a, b, c, d, e, are all equal. If we randomly choose one, each is as likely to be chosen as any other. After we choose one, we add 1 to that value. Then we run this process again. This time, the value that was picked last time is 1-larger than any other value so it's more likely to be picked than any one other value. This creates a snowball effect where the value picked first is likely to keep getting picked and achieve runaway growth. I'm looking for the opposite of this algorithm where we start after these originally-equal values have diverged and we bring them back to the originally-equal state.
I think this process is impossible because of entropy and the inherent one-way nature of existence.
Well, there is a technique called Inverse Weights, where you sample items inverse proportional to their previous appearance. Each time we sample a, b, c, d or e, we update their appearance numbers and recalculate probabilities. Simple python code, I sample numbers [0...4] as a, b, c, d, e and start with what you listed as appearances. After 100,000 samples they looks to be equidistributed
import numpy as np
n = np.array([100, 140, 200, 2, 1000])
for k in range(1, 100000):
p = (1.0 / n) # make probabilities inverse to weights
p /= np.sum(p) # normalization
a = np.random.choice(5, p = p) # sampling numbers in the range [0...5)
n[a] += 1 # update weights
print(n)
Output
[20260 20194 20290 20305 20392]

Prolog: How to break a piece of chocolate into its pieces

I've got the following task to do:
Given a rectangular chocolate bar, consisting of m x n small rectangles, and the wish of breaking it into its parts. At each step you can only pick one piece and break it either along any of its vertical lines or along its horizontal lines. How should you break the chocolate bar using the minimum number of steps?
I know you need exactly m x n - 1 steps to break the chocolate bar, but I'm asked to do it "the CS way:"
Define a predicate which selects the minimum number of steps among all alternative possibilities to break the chocolate bar into pieces. Construct a straucture on an additional argument position, which tells you where and how to break the bar and what to do with the resulting two pieces.
My thoughts: after breaking the piece of chocolate once, you have the choice of breaking it either on its vertical or its horizontal lines. So this is my code, but it doesn't work:
break_chocolate(Horizontal, Vertical, Minimum) :-
break_horizontal(Horizontal, Vertical, Min1),
break_vertical(Horizontal, Vertical, Min2),
Minimum is min(Min1, Min2).
break_horizontal(0,0,_).
break_vertical(0,0,_).
break_horizontal(0, V, Min) :-
V > 0,
break_horizontal(0, V, Min).
break_horizontal(H, V, Min) :-
H1 is H-1,
Min1 is Min + 1,
break_vertical(H1, V, Min1).
break_horizontal(H, V, Min) :-
H1 is H-1,
Min1 is Min + 1,
break_vertical(H1, V, Min).
break_vertical(H, V, Min) :-
V1 is V-1,
Min1 is Min + 1,
break_horizontal(H, V1, Min1).
break_vertical(H, V, Min) :-
V1 is V-1,
Min1 is Min + 1,
break_vertical(H, V1, Min1).
break_vertical(H, 0, Min) :-
H > 0,
break_horizontal(H, 0, Min).
Could anyone help me with this one?
This is not a complete answer, but should push you towards the right direction:
First an observation: every time you cut a chocolate bar, you end up with exactly one more pieces than you had before. So, actually, there is no "minimal" number of breaks you can have; you start with 1 piece (the whole bar), and you end up with m * n pieces, so you always have exactly m * n - 1 breaks. So either you have misunderstood your problem statement, or somehow misrepresented it in your question.
Second: once you break into two pieces, you will have to break each of the two in the same way that you have broken the previous one. One way to program that would be with a recursive call. I don't see this in your program, as it stands.
Third: so do you want to report the breaks that you make? How are you going to do this?
Whether you program in Prolog, C, or JavaScript, understanding your problem is a prerequisite to finding a solution.
Here are some additional hints for representing and solving your problem.
Each break separates one piece into two pieces (see Boris' second hint). You can, therefore, think of the collection of breaks as a binary tree of breaks which has the following characteristics:
The root node of the tree has the value M-N (the bar is M x N in dimension)
Suppose X-Y represents the value of any node in the tree representing a single X by Y piece that is not the single piece, 1-1. Since the two children of the node represents the piece of dimension X and Y being broken along one dimension, then the children of X-Y either have the values A-Y and B-Y where A + B = X, or the values X-A and X-B where A + B = Y
All of the leaf nodes of the tree have the value 1-1 (the smallest possible piece)
Each node of a binary tree consists of the node value, the left sub tree, and the right sub tree. An empty sub tree would have the value nil (or some other suitably chosen atom). A common representation of a tree would be something like, btree(X-Y, LeftSubTree, RightSubTree) (the term X-Y being the value of the top node of the tree, which in this problem, would be the dimensions of the piece in question). Using this scheme, the smallest piece of candy would be, btree(1-1, nil, nil), for example. A set of breaks for a 2 x 1 candy bar would be, btree(2-1, btree(1-1, nil, nil), btree(1-1, nil, nil)).
You can use the CLPFD library to constrain C #= A + B, A #> 0, B #> 0 and, to eliminate symmetrical cases, A #=< B.
As an algorithm (I'm not familiar with Prolog), I can't find any different answers in the number of breaks. I've tried 4x4, and been unable to come up with an answer other than 15 (either above or below); I've tried with a 5x2 and been unable to come up with an answer other than 9.
On this basis, I would suggest the simplest possible coding method:
while there is more than one column:
snap off the left-most column
while this column has more than one square:
snap off the top square
while the remaining column has more than one square:
snap off the top square
Depending on the situation, you may wish to change one or more of: (left, column)<->(top, row), left->right, top->bottom.

Construct a full rank matrix by adding vectors from the standard basis

I have a nxn singular matrix. I want to add k rows (which must be from the standard basis e1, e2, ..., en) to this matrix such that the new (n+k)xn matrix is full column rank. The number of added rows k must be minimum and they can be added in any order (not just e1, e2 ,..., it can be e4, e10, e1, ...) as long as k is minimum.
Does anybody know a simple way to do this? Any help is appreciated.
You can achieve this by doing a QR decomposition with column pivoting, then taking the transpose of the last n-rank(A) columns of the permutation matrix.
In matlab, this is achieved by the qr function(See the matlab documentation here):
r=rank(A);
[Q,R,E]=qr(A);
newA=[A;transpose(E(:,end-r+1:end))];
Each row of transpose(E(:,end-r+1:end)) will be a member of standard basis, rank of newA will be n, and this is also the minimal number of standard basis you will need to do so.
Here is how this works:
QR decomposition with column pivoting is a standard procedure to decompose a matrix A into products:
A*E==Q*R
where Q is an orthogonal matrix if A is real, or an unitary matrix if A is complex; R is upper triangular matrix, and E is a permutation matrix.
In short, the permutations are chosen so that the diagonal elements are larger than the off-diagonals in the same row, and that size of the diagonal elements are non-increasing. More detailed description can be found on the netlib QR factorization page.
Since Q and E are both orthogonal (or unitary) matrices, the rank of R is the same as the rank of A. To bring up the rank of A, we just need to find ways to increase the rank of R; and this is much more straight forward thanks to the structure of R as the result of pivoting and the fact that it is upper-triangular.
Now, with the requirement placed on pivoting procedure, if any diagonal element of R is 0, the entire row has to be 0. The n-rank(A) rows of 0s in the bottom if R is responsible for the nullity. If we replace the lower right corner with an identity matrix, the that new matrix would be full rank. Well, we cannot really do the replacement, but we can append the rows matrix to the bottom of R and form a new matrix that has the same rank:
B==[ 0 I ] => newR=[ R ; B ]
Here the dimensionality of I is the nullity of A and that of R.
It is readily seen that rank(newR)=n. Then we can also define a new unitary Q matrix by expanding its dimensionality in a trivial manner:
newQ=[Q 0 ; 0 I]
With that, our new rank n matrix can be obtained as
newA=newQ*newR.transpose(E)=[Q*R ; B ]*transpose(E) =[A ; B*transpose(E)]
Note that B is [0 I] and E is a permutation matrix, so B*transpose(E) is simply the transpose
of the last n-rank(A) columns of E, and thus a set of rows made of standard basis, and that's just what you wanted!
Is n very large? The simplest solution without using any math would be to try adding e_i and seeing if the rank increases. If it does, keep e_i. proceed until finished.
I like #Xiaolei Zhu's solution because it's elegant, but another way to go (that's even more computationally efficient is):
Determine if any rows, indexed by i, of your matrix A are all zero. If so, then the corresponding e_i must be concatenated.
After that process, you can simply concatenate any subset of the n - rank(A) columns of the identity matrix that you didn't add in step 1.
rows/cols from Identity matrix can be added in any order. it does not need to be added in usual order as e1,e2,... in general situation for making matrix full rank.

Algorithm to separate items of the same type

I have a list of elements, each one identified with a type, I need to reorder the list to maximize the minimum distance between elements of the same type.
The set is small (10 to 30 items), so performance is not really important.
There's no limit about the quantity of items per type or quantity of types, the data can be considered random.
For example, if I have a list of:
5 items of A
3 items of B
2 items of C
2 items of D
1 item of E
1 item of F
I would like to produce something like:
A, B, C, A, D, F, B, A, E, C, A, D, B, A
A has at least 2 items between occurences
B has at least 4 items between occurences
C has 6 items between occurences
D has 6 items between occurences
Is there an algorithm to achieve this?
-Update-
After exchanging some comments, I came to a definition of a secondary goal:
main goal: maximize the minimum distance between elements of the same type, considering only the type(s) with less distance.
secondary goal: maximize the minimum distance between elements on every type. IE: if a combination increases the minimum distance of a certain type without decreasing other, then choose it.
-Update 2-
About the answers.
There were a lot of useful answers, although none is a solution for both goals, specially the second one which is tricky.
Some thoughts about the answers:
PengOne: Sounds good, although it doesn't provide a concrete implementation, and not always leads to the best result according to the second goal.
Evgeny Kluev: Provides a concrete implementation to the main goal, but it doesn't lead to the best result according to the secondary goal.
tobias_k: I liked the random approach, it doesn't always lead to the best result, but it's a good approximation and cost effective.
I tried a combination of Evgeny Kluev, backtracking, and tobias_k formula, but it needed too much time to get the result.
Finally, at least for my problem, I considered tobias_k to be the most adequate algorithm, for its simplicity and good results in a timely fashion. Probably, it could be improved using Simulated annealing.
First, you don't have a well-defined optimization problem yet. If you want to maximized the minimum distance between two items of the same type, that's well defined. If you want to maximize the minimum distance between two A's and between two B's and ... and between two Z's, then that's not well defined. How would you compare two solutions:
A's are at least 4 apart, B's at least 4 apart, and C's at least 2 apart
A's at least 3 apart, B's at least 3 apart, and C's at least 4 apart
You need a well-defined measure of "good" (or, more accurately, "better"). I'll assume for now that the measure is: maximize the minimum distance between any two of the same item.
Here's an algorithm that achieves a minimum distance of ceiling(N/n(A)) where N is the total number of items and n(A) is the number of items of instance A, assuming that A is the most numerous.
Order the item types A1, A2, ... , Ak where n(Ai) >= n(A{i+1}).
Initialize the list L to be empty.
For j from k to 1, distribute items of type Ak as uniformly as possible in L.
Example: Given the distribution in the question, the algorithm produces:
F
E, F
D, E, D, F
D, C, E, D, C, F
B, D, C, E, B, D, C, F, B
A, B, D, A, C, E, A, B, D, A, C, F, A, B
This sounded like an interesting problem, so I just gave it a try. Here's my super-simplistic randomized approach, done in Python:
def optimize(items, quality_function, stop=1000):
no_improvement = 0
best = 0
while no_improvement < stop:
i = random.randint(0, len(items)-1)
j = random.randint(0, len(items)-1)
copy = items[::]
copy[i], copy[j] = copy[j], copy[i]
q = quality_function(copy)
if q > best:
items, best = copy, q
no_improvement = 0
else:
no_improvement += 1
return items
As already discussed in the comments, the really tricky part is the quality function, passed as a parameter to the optimizer. After some trying I came up with one that almost always yields optimal results. Thank to pmoleri, for pointing out how to make this a whole lot more efficient.
def quality_maxmindist(items):
s = 0
for item in set(items):
indcs = [i for i in range(len(items)) if items[i] == item]
if len(indcs) > 1:
s += sum(1./(indcs[i+1] - indcs[i]) for i in range(len(indcs)-1))
return 1./s
And here some random result:
>>> print optimize(items, quality_maxmindist)
['A', 'B', 'C', 'A', 'D', 'E', 'A', 'B', 'F', 'C', 'A', 'D', 'B', 'A']
Note that, passing another quality function, the same optimizer could be used for different list-rearrangement tasks, e.g. as a (rather silly) randomized sorter.
Here is an algorithm that only maximizes the minimum distance between elements of the same type and does nothing beyond that. The following list is used as an example:
AAAAA BBBBB CCCC DDDD EEEE FFF GG
Sort element sets by number of elements of each type in descending order. Actually only largest sets (A & B) should be placed to the head of the list as well as those element sets that have one element less (C & D & E). Other sets may be unsorted.
Reserve R last positions in the array for one element from each of the largest sets, divide the remaining array evenly between the S-1 remaining elements of the largest sets. This gives optimal distance: K = (N - R) / (S - 1). Represent target array as a 2D matrix with K columns and L = N / K full rows (and possibly one partial row with N % K elements). For example sets we have R = 2, S = 5, N = 27, K = 6, L = 4.
If matrix has S - 1 full rows, fill first R columns of this matrix with elements of the largest sets (A & B), otherwise sequentially fill all columns, starting from last one.
For our example this gives:
AB....
AB....
AB....
AB....
AB.
If we try to fill the remaining columns with other sets in the same order, there is a problem:
ABCDE.
ABCDE.
ABCDE.
ABCE..
ABD
The last 'E' is only 5 positions apart from the first 'E'.
Sequentially fill all columns, starting from last one.
For our example this gives:
ABFEDC
ABFEDC
ABFEDC
ABGEDC
ABG
Returning to linear array we have:
ABFEDCABFEDCABFEDCABGEDCABG
Here is an attempt to use simulated annealing for this problem (C sources): http://ideone.com/OGkkc.
I believe you could see your problem like a bunch of particles that physically repel eachother. You could iterate to a 'stable' situation.
Basic pseudo-code:
force( x, y ) = 0 if x.type==y.type
1/distance(x,y) otherwise
nextposition( x, force ) = coined?(x) => same
else => x + force
notconverged(row,newrow) = // simplistically
row!=newrow
row=[a,b,a,b,b,b,a,e];
newrow=nextposition(row);
while( notconverged(row,newrow) )
newrow=nextposition(row);
I don't know if it converges, but it's an idea :)
I'm sure there may be a more efficient solution, but here is one possibility for you:
First, note that it is very easy to find an ordering which produces a minimum-distance-between-items-of-same-type of 1. Just use any random ordering, and the MDBIOST will be at least 1, if not more.
So, start off with the assumption that the MDBIOST will be 2. Do a recursive search of the space of possible orderings, based on the assumption that MDBIOST will be 2. There are a number of conditions you can use to prune branches from this search. Terminate the search if you find an ordering which works.
If you found one that works, try again, under the assumption that MDBIOST will be 3. Then 4... and so on, until the search fails.
UPDATE: It would actually be better to start with a high number, because that will constrain the possible choices more. Then gradually reduce the number, until you find an ordering which works.
Here's another approach.
If every item must be kept at least k places from every other item of the same type, then write down items from left to right, keeping track of the number of items left of each type. At each point put down an item with the largest number left that you can legally put down.
This will work for N items if there are no more than ceil(N / k) items of the same type, as it will preserve this property - after putting down k items we have k less items and we have put down at least one of each type that started with at ceil(N / k) items of that type.
Given a clutch of mixed items you could work out the largest k you can support and then lay out the items to solve for this k.

Resources