Drawing sum of values across multiple overlapping dynamic intervals - algorithm

I have a an array of intervals [a,b] (where [a,b] = set of all x such that a<=x<=b). Each one of these intervals has a value associated with it (think of it as the cost of something across such interval). Intervals can overlap. Intervals are dynamic (they can be added, removed, translated, and their size can be changed). Also, the value associated with any of such intervals can change.
I need to create a graph containing the sum of all such values across interval [start, end] which is defined as the interval containing all of such intervals. In order to do so I need an ordered list of where, along the real line, such values change, as well as what values they are changing between. Such list needs to be easily / quickly updated when an interval in the original array changes.
Side notes: assume not very large number of intervals (hundreds?).
Any suggestions on data structures / algorithms to do this effectively?

Interval tree is able to perform such operations

Related

How do I combine two properties (each has opposite impact) of the data points to filter out the best data point?

This question is more logical than programming. My dataset has several data points (you can think the dataset as an Array and the data points as it's elements). Each data point is defined by its two properties. For example, if x is one of the data points among X data points in the dataset, a and b are the properties or characteristics of x. Here, larger the value of a (that ranges from 0 to 1, think it as a probability), x has a good chance to be selected. Moreover, larger the value of b (think it as any number that is larger than 1), x has the least chance to be selected. Among X data points from the X, I need to select a data point that has the maximum a value and minimum b value. Note that there may be some instances when a single data point may not hold both the conditions at the same time. For example, x my have the largest a value but not the least b value at the same time and vice-versa. Hence, I want to combine both a and b to yield another meaningful weight value that helps me to filter out the right data point from X.
If there any mathematical solution to my problem?

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

Algorithm to assign best value between points based on distance

I am having trouble figuring out an algorithim to best assign values to different points on a diagram based on the distance between the points.
Essentially, I am given a diagram with a block and a dynamic amount of points. It should look something like this:
I am then given a list of values to assign to each point. Here are the rules and info:
I know the Lat,Long values for each point and the central block. In other words, I can get the direct distance from every object to another.
The list of values may be shorter that the total number of points. In this case, values can be repeated multiple times.
In the case where values must be repeated, the duplicate values should be as far away as possible from one another.
Here is an example using a value list of {1,2}:
In reality, this is a very simple example. In truth, there may be thousands of points.
Find out how many values you need to repeat, in your example you have 2 values and 5 points so, you need to have 2 repetition for 2 values, then you will have 2x2=4 positions [call this pNum] (you have to use different pairs as much as possible so that they are far apart from each other).
Calculate a distance array then find the max pNum values in the array, in other words find the greates 4 values in the array in your example.
assigne the repeated values for the the points found most far apart, and assign the rest of the points based on the array distance values.

Insert Interval into a disjoint set of intervals

Given sorted disjoint sets (p,q) where ‘p’ is the start time and ‘q’ is the end time. You will be given one input interval. Insert it in the right place. And return the resulting sorted disjoint sets.
Eg: (1,4);(5,7);(8,10);(13,18)
Input interval – (3,7)
Result : (1,7);(8,10);(13,18)
Input Interval – (1,3)
Result: (1,4);(5,7);(8,10);(13,18)
Input interval – (11,12)
Result: (1,4);(5,7);(8,10);(11,12);(13,18)
Inserting an interval in a sorted list of disjoint intervals , there is no efficient answer here
Your question and examples imply non-overlapping intervals. In this case you can just perform a binary search - whether comparison is done by start time or end time does not matter for non-overlapping intervals - and insert the new interval at the position found if not already present.
UPDATE
I missed the merging occurring in the first example. A bad case is inserting a large interval into a long list of short intervals where the long interval overlaps many short intervals. To avoid a linear search for all intervals that have to be merged one could perform two binary searches - one from the left comparing by start time and one from the right comparing by the end time.
Now it is trivial to decide if the interval is present, must be inserted or must be merged with the intervals between the positions found by the two searches. While this is not very complex it is probably very prone to off-by-one errors and requires some testing.

Algorithm for partitioning 1-dimensional space

I two sets of intervals that correspond to the same 1-dimensional (linear) space. Here is a rough visual--in reality, there are many more intervals and they are much more spread out, but this gives the basic idea.
Each of these intervals contains information, and I am writing a program to compare the information in one set of intervals (the red) to the information contained in the other set (the blue).
Here is my problem. I would like to partition the space into n chunks such that there is roughly an equal amount of comparison work to be done in each chunk (the amount of work depends on the number of intervals in that portion of the space). Also, the partition should not split any red or blue interval across two chunks.
So the input is two sets of intervals, and the desired output is a partition of the space such that
the intervals are (roughly) equally distributed across each element of the partition
no interval overlaps with multiple partition elements
Can anyone suggest an approach or an algorithm for doing this?
Define a "word" to be a maximal interval in which every point belongs either to a red interval or a blue interval. No chunk can end in the middle of a word, and every union of consecutive words is a potential chunk. Now apply a minimum raggedness word-wrap algorithm to the words, where the length of a word is defined to be the number of intervals it contains (line = chunk).

Resources