Getting rid of data dependency - parallel-processing

Getting rid of data dependency - parallel-processing

I have this code:
for(i=0; i<size; i++)
{
d[i] = d[i-1] + v[i];
}
When I do parallel processing for this loop, I have data dependency and the initiation interval becomes 2
Meaning I have:
initiation interval:2
|load v[i-1]|load d[i-2]| add |store d[i-1]|
| | | load v[i]|load d[i-1] | add | store d[i] |
I do not want to stall in between.
initiation interval:1
|load v[i-1]|load d[i-2]| add |store d[i-1]|
| |load v[i] |load d[i-1]| add | store d[i] |
This is not possible since d[i-1] is not stored yet.
How do we make initiation interval to 1 by changing the code?

You cannot reduce that gap.
Also this (loop unrolling) is not the most efficient way of parallel processing for this kind of loop. Your loop looks like a prefix-sum operation. There are fast parallel algorithms and implementations available for prefix-sum. For example, this question

Related

Linked list loop detection

how we can prove that moving fast and slow pointer(from the begining) by 1 makes there meeting point the loop node?I mean i cant understand what gives it a guranteed solution that the meeting node is the loop node(i.e node from where cycle starts)
I am clear with tortoise hare loop detection basically i am talking about detcting the node where cycle starts after loop has been detected.

It's a very simple proof really. First, you proof that the slow pointer will match the fast pointer after at most n + k steps, where n is the number of links to the start of the cycle, and k is the length of the cycle. And then you proof that they will match again after exactly k further steps.
The point where they meet will be anywhere in the cycle.

Before trying to prove this formally, you should first look at an example so you can get a more intuitive understanding of, and can visualize, what is going on. Suppose you have the following linked list, in which the 3 (at index 3) points back to the 1 (at index 1):
[0| ]->[1| ]->[2| ]->[3| ]--+
^ |
| |
| |
+------------------+
Walking through the logical progression, you can observe the following when incrementing slow by one position and fast by two:
slow = index 0; fast = index 0
slow = index 1; fast = index 2
slow = index 2; fast = index 1
slow = index 3; fast = index 3 (loop exists)
Hope this helps!

Find max value 2d array N*N with fewer comparisons

I want to find the maximum value in two dimensional array N*N in C with fewer comparisons. I can do it simply with an O(N^2) algorithm, but I think it is too slow.
So, I thought about another way. I simply loop once and search by row and column at the same time, and try to reduce the complexity. (I guess O(2(n-1))) You can see in this picture what I'm trying to do.
I use the same loop to check the content of the columns and the rows.
What I want to know is there anything faster? Like Sort the 2D array with O(N log N) complexity? Assume the values are unsorted.

If the 2d array of M x M elements is not sorted in any way, then you're not going to do better than O(M^2).
Keep in mind that the matrix has M^2 elements, so sorting them will have complexity of O(M^2 log M^2), since most decent sorts are O(N log N) and here N = M^2.

Divide it up into [no, of cores] chunks. Get max. of each chunk in parallel. Pick the bones out of the results.

You could probably just cast the array to a 1D array and iterate over the flattened pointer...
I'll explain:
As you probably know, a 2D Array in the memory is stored in a flat state. The Array char c[4][2] looks like this:
| c[0][0] | | c[0][1] | | c[1][0] | | c[1][1] | | c[2][0] | ...
| Byte 1 | | Byte 2 | | Byte 3 | | Byte 4 | | Byte 5 | ...
In this example, c[1][1] == ((char*)c)[3].
For this reason, when all members are of the same type, it's possible to safely cast a 2D array to a 1D array, i.e.
int my_array[20][20];
for (int i = 0; i < 400 ; i++) {
((int *)(my_array))[i] = i;
}
// my_array[19][0] == 180;
As dbush points out (up vote his answer), If your matrix is M x M elements, then M^2 is the best you're going to get and flattening the array this way simply saves you from copying the memory over before any operations.
EDIT
Someone asked why casting the array to a 1D array might be better.
The idea is to avoid a nested inner loop, making the optimizer's work easier. It is more likely that the compiler will unroll the loop if it's only a single dimension loop and the array's size is fixed.

dbush certainly has the right answer in terms of complexity.
It should also be noted that if you want "faster" in terms of actual run time (not just complexity), you need to consider caching. Going down the rows and columns in parallel is very bad for data locality, and you will incur a cache miss when you iterate down a column if your data has relatively large rows. You have to touch every element at least once in order to find the max, and it would be fastest to touch them in a "row major" ordering.

Similarity measurement of time-sequenced data with different length

Consider the following data.
Groundtruth | Dataset1 | Dataset2 | Dataset3
Datapoints|Time | Datapoints|Time | Datapoints|Time | Datapoints|Time
A |0 | a |0 | a |0 | a |0
B |10 | b |5 | b |5 | b |13
C |15 | c |12 | c |12 | c |21
D |25 | d |22 | d |14 | d |30
E |30 | e |30 | e |17 |
| | f |27 |
| | g |30 |
Visualized like this (as in number of - between each identifier):
Time ->
Groundtruth: A|----------|B|-----|C|----------|D|-----|E
Dataset1: a|-----|b|-------|c|----------|d|--------|e
Dataset2: a|-----|b|-------|c|--|d|---|e|----------|f|---|g
Dataset3: a|-------------|b|--------|c|---------|d
My goal is to compare the datasets with the groundtruth. I want to create a function that generates a similarity measurement between one of the datasets and the groundtruth in order to evaluate how good my segmentation algorithm is. Obviously I would like the segmentation algorithm to consist of equal number of datapoints(segments) as the groundtruth but as illustrated with the datasets this is not a guarantee, neither is the number of datapoints known ahead of time.
I've already created a Jacard Index to generate a basic evaluation score. But I am now looking into an evaluation method that punish the abundance/absence of datapoints as well as limit the distance to a correct datapoint. That is, b doesn't have to match B, it just has to be close to a correct datapoint.
I've tried to look into a dynamic programming method where I introduced a penalty for removing or adding a datapoint as well as a distance penalty to move to the closest datapoint. I'm struggling though, due to:
1. I need to limit each datapoint to one correct datapoint
2. Figure out which datapoint to delete if needed
3. General lack of understanding in how to implement DP algorithms
Anyone have ideas how to do this? If dynamic programming is the way to go, I'd love some link recommendation as well as some pointers in how to go about it.

Basically, you can modify the DP for Levenshtein edit distance to compute distances for your problem. The Levenshtein DP amounts to finding shortest paths in an acyclic directed graph that looks like this
*-*-*-*-*
|\|\|\|\|
*-*-*-*-*
|\|\|\|\|
*-*-*-*-*
where the arcs are oriented left-to-right and top-to-bottom. The DAG has rows numbered 0 to m and columns numbered 0 to n, where m is the length of the first sequence, and n is the length of the second. Lists of instructions for changing the first sequence into the second correspond one-to-one (cost and all) to paths from the upper left to the lower right. The arc from (i, j) to (i + 1, j) corresponds to the instruction of deleting the ith element from the first sequence. The arc from (i, j) to (i, j + 1) corresponds to the instruction of adding the jth element from the second sequence. The arc from (i, j) corresponds to modifying the ith element of the first sequence to become the jth element of the second sequence.
All you have to do to get a quadratic-time algorithm for your problem is to define the cost of (i) adding a datapoint (ii) deleting a datapoint (iii) modifying a datapoint to become another datapoint and then compute shortest paths on the DAG in one of the ways described by Wikipedia.
(As an aside, this algorithm assumes that it is never profitable to make modifications that "cross over" one another. Under a fairly mild assumption about the modification costs, this assumption is superfluous. If you're interested in more details, see this answer of mine: Approximate matching of two lists of events (with duration) .)

How can I better understand the one-comparison-per-iteration binary search?

What is the point of the one-comparison-per-iteration binary search? And can you explain how it works?

There are two reasons to binary search with one comparison per iteration. The
less important is performance. Detecting an exact match early using two
comparisons per iteration saves an average one iteration of the loop, whereas
(assuming comparisons involve significant work) binary searching with one
comparison per iteration almost halves the work done per iteration.
Binary searching an array of integers, it probably makes little difference
either way. Even with a fairly expensive comparison, asymptotically the
performance is the same, and the half-rather-than-minus-one probably isn't worth
pursuing in most cases. Besides, expensive comparisons are often coded as functions that return negative, zero or positive for <, == or >, so you can get both comparisons for pretty much the price of one anyway.
The important reason to do binary searches with one comparison per iteration is
because you can get more useful results than just some-equal-match. The main
searches you can do are...
First key > goal
First key >= goal
First key == goal
Last key < goal
Last key <= goal
Last key == goal
These all reduce to the same basic algorithm. Understanding this well enough
that you can code all the variants easily isn't that difficult, but I've not
really seen a good explanation - only pseudocode and mathematical proofs. This
is my attempt at an explanation.
There are games where the idea is to get as close as possible to a target
without overshooting. Change that to "undershooting", and that's what "Find
First >" does. Consider the ranges at some stage during the search...
| lower bound | goal | upper bound
+-----------------+-------------------------+--------------
| Illegal | better worse |
+-----------------+-------------------------+--------------
The range between the current upper and lower bound still need to be searched.
Our goal is (normally) in there somewhere, but we don't yet know where. The
interesting point about items above the upper bound is that they are legal in
the sense that they are greater than the goal. We can say that the item just
above the current upper bound is our best-so-far solution. We can even say this
at the very start, even though there is probably no item at that position - in a
sense, if there is no valid in-range solution, the best solution that hasn't
been disproved is just past the upper bound.
At each iteration, we pick an item to compare between the upper and lower bound.
For binary search, that's a rounded half-way item. For binary tree search, it's
dictated by the structure of the tree. The principle is the same either way.
As we are searching for an item greater-than our goal, we compare the test item
using Item [testpos] > goal. If the result is false, we have overshot (or
undershot) our goal, so we keep our existing best-so-far solution, and adjust
our lower bound upwards. If the result is true, we have found a new best-so-far
solution, so we adjust the upper bound down to reflect that.
Either way, we never want to compare that test item again, so we adjust our
bound to eliminate (only just) the test item from the range to search. Being
careless with this usually results in infinite loops.
Normally, half-open ranges are used - an inclusive lower bound and an exclusive
upper bound. Using this system, the item at the upper bound index is not in the
search range (at least not now), but it is the best-so-far solution. When you
move the lower bound up, you move it to testpos+1 (to exclude the item you just
tested from the range). When you move the upper bound down, you move it to
testpos (the upper bound is exclusive anyway).
if (item[testpos] > goal)
{
// new best-so-far
upperbound = testpos;
}
else
{
lowerbound = testpos + 1;
}
When the range between the lower and upper bounds is empty (using half-open,
when both have the same index), your result is your most recent best-so-far
solution, just above your upper bound (ie at the upper bound index for
half-open).
So the full algorithm is...
while (upperbound > lowerbound)
{
testpos = lowerbound + ((upperbound-lowerbound) / 2);
if (item[testpos] > goal)
{
// new best-so-far
upperbound = testpos;
}
else
{
lowerbound = testpos + 1;
}
}
To change from first key > goal to first key >= goal, you literally switch
the comparison operator in the if line. The relative operator and goal could be replaced by a single parameter - a predicate function that returns true if (and only if) its parameter is on the greater-than side of the goal.
That gives you "first >" and "first >=". To get "first ==", use "first >=" and
add an equality check after the loop exits.
For "last <" etc, the principle is the same as above, but the range is
reflected. This just means you swap over the bound-adjustments (but not the
comment) as well as changing the operator. But before doing that, consider the following...
a > b == !(a <= b)
a >= b == !(a < b)
Also...
position (last key < goal) = position (first key >= goal) - 1
position (last key <= goal) = position (first key > goal ) - 1
When we move our bounds during the search, both sides are being moved towards the goal until they meet at the goal. And there is a special item just below the lower bound, just as there is just above the upper bound...
while (upperbound > lowerbound)
{
testpos = lowerbound + ((upperbound-lowerbound) / 2);
if (item[testpos] > goal)
{
// new best-so-far for first key > goal at [upperbound]
upperbound = testpos;
}
else
{
// new best-so-far for last key <= goal at [lowerbound - 1]
lowerbound = testpos + 1;
}
}
So in a way, we have two complementary searches running at once. When the upperbound and lowerbound meet, we have a useful search result on each side of that single boundary.
For all cases, there's the chance that that an original "imaginary" out-of-bounds
best-so-far position was your final result (there was no match in the
search range). This needs to be checked before doing a final == check for the
first == and last == cases. It might be useful behaviour, as well - e.g. if
you're searching for the position to insert your goal item, adding it after the
end of your existing items is the right thing to do if all the existing items
are smaller than your goal item.
A couple of notes on the selection of the testpos...
testpos = lowerbound + ((upperbound-lowerbound) / 2);
First off, this will never overflow, unlike the more obvious ((lowerbound +
upperbound)/2). It also works with pointers as well as integer
indexes.
Second, the division is assumed to round down. Rounding down for non-negatives
is OK (all you can be sure of in C) as the difference is always non-negative
anyway.
This is one aspect that may need care if you use non-half-open
ranges, though - make sure the test position is inside the search range, and not just outside (on one of the already-found best-so-far positions).
Finally, in a binary tree search, the moving of bounds is implicit and the
choice of testpos is built into the structure of the tree (which may be
unbalanced), yet the same principles apply for what the search is doing. In this
case, we choose our child node to shrink the implicit ranges. For first match
cases, either we've found a new smaller best match (go to the lower child in hopes of finding an even smaller and better one) or we've overshot (go to the higher child in hopes of recovering). Again, the four main cases can be handled by switching the comparison operator.
BTW - there are more possible operators to use for that template parameter. Consider an array sorted by year then month. Maybe you want to find the first item for a particular year. To do this, write a comparison function that compares the year and ignores the month - the goal compares as equal if the year is equal, but the goal value may be a different type to the key that doesn't even have a month value to compare. I think of this as a "partial key comparison", and plug that into your binary search template and you get what I think of as a "partial key search".
EDIT The paragraph below used to say "31 Dec 1999 to be equal to 1 Feb 2000". That wouldn't work unless the whole range in-between was also considered equal. The point is that all three parts of the begin- and end-of-range dates differ, so you're not deal with a "partial" key, but the keys considered equivalent for the search must form a contiguous block in the container, which will normally imply a contiguous block in the ordered set of possible keys.
It's not strictly just "partial" keys, either. Your custom comparison might consider 31 Dec 1999 to be equal to 1 Jan 2000, yet all other dates different. The point is the custom comparison must agree with the original key about the ordering, but it might not be so picky about considering all different values different - it can treat a range of keys as an "equivalence class".
An extra note about bounds that I really should have included before, but I may not have thought about it this way at the time.
One way of thinking about bounds is that they aren't item indexes at all. A bound is the boundary line between two items, so you can number the boundary lines as easily as you can number the items...
| | | | | | | | |
| +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ |
| |0| | |1| | |2| | |3| | |4| | |5| | |6| | |7| |
| +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ |
| | | | | | | | |
0 1 2 3 4 5 6 7 8
Obviously the numbering of bounds is related to the numbering of the items. As long as you number your bounds left-to-right and the same way you number your items (in this case starting from zero) the result is effectively the same as the common half-open convention.
It would be possible to select a middle bound to bisect the range precisely into two, but that's not what a binary search does. For binary search, you select an item to test - not a bound. That item will be tested in this iteration and must never be tested again, so it's excluded from both subranges.
| | | | | | | | |
| +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ |
| |0| | |1| | |2| | |3| | |4| | |5| | |6| | |7| |
| +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ | +-+ |
| | | | | | | | |
0 1 2 3 4 5 6 7 8
^
|<-------------------|------------->|
|
|<--------------->| | |<--------->|
low range i hi range
So the testpos and testpos+1 in the algorithm are the two cases of translating the item index into the bound index. Of course if the two bounds are equal, there's no items in that range to choose so the loop cannot continue, and the only possible result is that one bound value.
The ranges shown above are just the ranges still to be searched - the gap we intend to close between the proven-lower and proven-higher ranges.
In this model, the binary search is searching for the boundary between two ordered kinds of values - those classed as "lower" and those classed as "higher". The predicate test classifies one item. There is no "equal" class - equal-to-key values are part of the higher class (for x[i] >= key) or the lower class (for x[i] > key).

Fast min on span

Given a list of arrays and lots of setup time I need to quickly find the smallest value in some sub-span of each array. In concept:
class SpanThing
{
int Data;
SpanThing(int[][] data) /// must be rectangulare
{
Data = data;
//// process, can take a while
}
int[] MinsBruteForce(int from, int to)
{
int[] result = new int[data.length];
foreach(int index, int[] dat; Data)
{
result[i] = int.max;
foreach(int v; dat[from .. to]);
result[i] = min(result[i], v);
}
return result;
}
int[] MinsSmart(int from, int to)
{
// same as MinsBruteForce but faster
}
}
My current thinking on how to do this would be to build a binary tree over the data where each node contains the min in the related span. That way finding the min in the span for one row would consist of finding the min of only the tree nodes that compose it. This set would be the same for each row so could be computed once.
Does any one see any issues with this idea or known of better ways?
To clarify, the tree I'm talking about would be set up sutch that the root node would contain the min value for the entire row and for each node, it's left child would have the min value for the left half of the parent's span and the same for the right.
0 5 6 2 7 9 4 1 7 2 8 4 2
------------------------------------------------
| 5 | 6| | 7 | 9 | | 1 | 7 | 2 | 8 | 4 | 2
0 | 5 | 2 | 7 | 4 | 1 | 2 | 2
0 | 2 | 1 | 2
0 | 1
0
This tree can be mapped to an array and defined in such a way that the segment bounds can be computed resulting in fast look-up.
The case I'm optimizing for is where I have a fixed input set and lots of up front time but then need to do many fast tests on a veriety of spans.

Your proposed solution appears to give the answer using constant space overhead, constant time setup, and logarithmic time for queries. If you are willing to pay quadratic space (i.e., compute all intervals in advance) you can get answers in constant time. Your logarithmic scheme is almost certain to be preferred.
It wouldn't surprise me if it were possible to do better, but I'd be shocked if there were a simple data structure that could do better---and in practice, logarithmic time is almost always plenty fast enough. Go for it.

Your described approach sounds like you're trying to do some sort of memoization or caching, but that's only going to help you if you're checking the same spans or nested spans repeatedly.
The general case for min([0..n]) is going to be O(n), which is what you've got already.
Your code seems to care more about the actual numbers in the array than their order in the array. If you're going to be checking these spans repeatedly, is it possible to just sort the data first, which could be a single O(n log n) operation followed by a bunch of O(1) ops? Most languages have some sort of built in sorting algorithm in their standard libraries.

It's not clear how we can represent the hierarchy of intervals efficiently using the tree approach you've described. There are many ways to divide an interval --- do we have to consider every possibility?
Would a simple approach like this suffice: Suppose data is an N x M array. I would create a M x M x N array where entry (i,j,k) gives the "min" of data(k,i:j). The array entries will be populated on demand:
int[] getMins(int from, int to)
{
assert to >= from;
if (mins[from][to] == null)
{
mins[from][to] = new int[N];
// populate all entries (from,:,:)
for (int k = 0; k < N; k++)
{
int t = array[k][from];
for (int j = from; j < M; j++)
{
if (array[k][j] < t)
t = array[k][j];
mins[from][j][k] = t;
}
}
}
return mins[from][to];
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Getting rid of data dependency - parallel-processing

You cannot reduce that gap. Also this (loop unrolling) is not the most efficient way of parallel processing for this kind of loop. Your loop looks like a prefix-sum operation. There are fast parallel algorithms and implementations available for prefix-sum. For example, this question

Related

Linked list loop detection

Find max value 2d array N*N with fewer comparisons

Similarity measurement of time-sequenced data with different length

How can I better understand the one-comparison-per-iteration binary search?

Fast min on span

Categories

Resources