Related
Lets say that you are given n sorted arrays of numbers and you need to pick one number from each array such that the minimum distance between the n chosen elements is maximized.
Example:
arrays:
[0, 500]
[100, 350]
[200]
2<=n<=10 and every array could have ~10^3-10^4 elements.
In this example the optimal solution to maximize minimum distance is pick numbers: 500, 350, 200 or 0, 200, 350 where min distance is 150 and is the maximum possible of every combination.
I am looking for an algorithm to solve this. I know that I could binary search the max min distance but I can't see how to decide is there is a solution with max min distance of at least d, in order for the binary search to work. I am thinking maybe dynamic programming could help but haven't managed to find a solution with dp.
Of course generating all combination with n elements is not efficient. I have already tried backtracking but it is slow since it tries every combination.
n ≤ 10 suggests that we can take an exponential dependence on n. Here's
an O(2n m n)-time algorithm where m is the total size of the
arrays.
The dynamic programming approach I have in mind is, for each subset of
arrays, calculate all of the pairs (maximum number, minimum distance) on
the efficient frontier, where we have to choose one number from each of
the arrays in the subset. By efficient frontier I mean that if we have
two pairs (a, b) ≠ (c, d) with a ≤ c and b ≥ d, then (c, d) is not on
the efficient frontier. We'll want to keep these frontiers sorted for
fast merges.
The base case with the empty subset is easy: there's one pair, (minimum
distance = ∞, maximum number = −∞).
For every nonempty subset of arrays in some order that extends the
inclusion order, we compute a frontier for each array in the subset,
representing the subset of solutions where that array contributes the
maximum number. Then we merge these frontiers. (Naively this costs us
another factor of log n, which maybe isn't worth the hassle to avoid
given that n ≤ 10, but we can avoid it by merging the arrays once at the
beginning to enable future merges to use bucketing.)
To construct a new frontier from a subset of arrays and another array
also involves a merge. We initialize an iterator at the start of the
frontier (i.e., least maximum number) and an iterator at the start of
the array (i.e., least number). While neither iterator is past the end,
Emit a candidate pair (min(minimum distance, array number − maximum
number), array number).
If the min was less than or equal to minimum distance, increment the
frontier iterator. If the min was less than or equal to array number
− maximum number, increment the array iterator.
Cull the candidate pairs to leave only the efficient frontier. There is
an elegant way to do this in code that is more trouble to explain.
I am going to give an algorithm that for a given distance d, will output whether it is possible to make a selection where the distance between any pair of chosen numbers is at least d. Then, you can binary-search the maximum d for which the algorithm outputs "YES", in order to find the answer to your problem.
Assume the minimum distance d be given. Here is the algorithm:
for every permutation p of size n do:
last := -infinity
ok := true
for p_i in p do:
x := take the smallest element greater than or equal to last+d in the p_i^th array (can be done efficiently with binary search).
if no such x was found; then
ok = false
break
end
last = x
done
if ok; then
return "YES"
end
done
return "NO"
So, we brute-force the order of arrays. Then, for every possible order, we use a greedy method to choose elements from each array, following the order. For example, take the example you gave:
arrays:
[0, 500]
[100, 350]
[200]
and assume d = 150. For the permutation 1 3 2, we first take 0 from the 1st array, then we find the smallest element in the 3rd array that is greater than or equal to 0+150 (it is 200), then we find the smallest element in the 2nd array which is greater than or equal to 200+150 (it is 350). Since we could find an element from every array, the algorithm outputs "YES". But for d = 200 for instance, the algorithm would output "NO" because none of the possible orderings would result in a successful selection.
The complexity for the above algorithm is O(n! * n * log(m)) where m is the maximum number of elements in an array. I believe it would be sufficient, since n is very small. (For m = 10^4, 10! * 10 * 13 ~ 5*10^8. It can be computed under a second on a modern CPU.)
Lets look at an example with optimal choices, x (horizontal arrays A, B, C, D):
A x
B b x b
C x c
D d x
Our recurrence based on range could be: let f(low, excluded) represent the maximum closest distance between two chosen elements (from arrays 1 to n) of the subset without elements in excluded, where low is the lowest chosen element. Then:
(1)
f(low, excluded) when |excluded| = n-1:
max(low)
for low in the only permitted array
(2)
f(low, excluded):
max(
min(
a - low,
f(a, excluded')
)
)
for a ≥ low, a not in excluded'
where excluded' = excluded ∪ {low's array}
We can limit a. For one thing the maximum we can achieve is
(3)
m = (highest - low) / (n - |excluded| - 1)
which means a need not go higher than low + m.
Secondly, we can store results for all f(a, excluded'), keyed by excluded' (we have 2^10 possible keys), each in a decorated binary tree ordered by a. The decoration will be the highest result achievable in the right subtree, meaning we can find the max for all f(v, excluded'), v ≥ a in logarithmic time.
The latter establishes a dominance relationship and clearly we are intetested in both a larger a and a larger f(a, excluded') so as to maximise the min function in (2). Picking an a in the middle, we can use a binary search. If we have:
a - low < max(v, excluded'), v ≥ a
where max(v, excluded') is the lookup
for a in the decorated tree
then we look to the right since max(v, excluded) indicates there's a better answer on the right, where a - low is also larger.
And if we have:
a - low ≥ max(v, excluded), v ≥ a
then we record this candidate and look to the left since to the right, the answer is fixed at max(v, excluded), given that a - low could not decrease.
In order to conduct the binary search on the range, [low, low + m] (see (3)), rather than merge and label all the arrays at the outset, we can keep them separate and compare the closest candidates to mid out of each array we are currently permitted to choose a from. (The trees have the mixed results, keyed by subset.) (The flow of this part is not completely clear to me.)
Worst case with this method, given that n = C is constant seems to be
O(C * array_length * 2^C * C * log(array_length) * log(C * array_length))
C * array_length is the iteration on low
Each low can be paired with 2^C inclusions
C * log(array_length) is the separated binary-search
And log(C * array_length) is the tree lookup
Simplifying:
= O(array_length * log^2(array_length))
although in practice, there could be many dead-end branches that exit early where a full selection wouldn't be possible.
In case, it wasn't clear, the iteration is on a fixed lowest element in the selection. In other words, we want the best f(low, excluded) for all different lows (and excludeds). For bottom-up, we would iterate from the highest value down so our results for a get stored as we iterate.
I'm having troubles with the following problem
Given N x S grid and m segments parallel to the horizontal axis (all of them are tuples (x', x'', y)), answer Q online queries of form (x', x''). The answer to such a query is the smallest y (if there is one) such that we can place a segment (x', x'', y). All segments are non-overlapping yet beginning of one segment can be the ending of another i.e. segments (x', x'', y) and (x'', x''', y) are allowed. Being able to place a segment means there could exist a segment (x', x'', y) that wouldn't violate stated rules, segment isn't actually placed(board with initial segments isn't modified) but we only state there could be one.
Constraints
1 ≤ Q, S, m ≤ 10^5
1 ≤ N ≤ 10^9
Time: 1s
Memory: 256 Mb
Here is an example from the link below. Input segments are (2, 5, 1), (1, 2, 2), (4, 5, 2), (2, 3, 3), (3, 4, 3).
Answer for queries
1) (1, 2) → 1
2) (2, 3) → 2
3) (3, 4) → 2
4) (4, 5) → 3
5) (1, 3) → can't place a segment
A visualized answer for the third query (blue segment):
I don't quite understand how to approach the problem. It is supposed to be solved with persistent segment tree, but I am still unable to come up with something.
Could you help me please.
This is not my homework. The source problem can be found here http://informatics.mccme.ru/mod/statements/view3.php?chapterid=111614 . There's no English version of the statement avaliable and the test case presents input data in a different way, so don't mind the souce.
Here is an O(N log N) time solution.
Preliminaries (a good tutorial available here): segment tree, persistent segment tree.
Part 0. Original problem statement
I briefly describe the original problem statement as later I'm going to speak in its terms rather than in terms of abstract segments.
There is a train with S seats (S <= 10^5). It is known that seat s_i is occupied from time l_i to time r_i (no more than 10^5 such constraints, or passengers). Then we have to answer 10^5 queries of kind "find the lowest number of a seat with is free from time l_i to time r_i or say if there is none". All queries must be answered online, that is, you have to answer the previous query before seeing the next.
Throughout the text I denote with N both the number of seats, the number of passengers, and the number of queries, assuming they are the same order of magnitude. You can do more accurate analysis if needed.
Part 1. Answering a single query
Let's answer a query [L, R] assuming that there are no occupied places after time R. For each seat we maintain the last time when it is occupied. Call it last(S). Now the answer for the query is minimum S such that last(S) <= L. If we build a segment tree on seats then we'll be able to answer this query in O(log^2 N) time: binary search the value of S, check if range minimum on segment [0, S] is at most L.
However, it might be not enough to get Accepted. We need O(log N). Recall that each node of a segment tree stores minimum in corresponding range. We start at the root. If the minimum there is >= L then there is no available seat for such query. Otherwise either minimum in the left child or in the right child is <= L (or in both). In the first case we descend to the left child, in the second – to the right, and repeat until we reach a leaf. This leaf will correspond to the minimum seat with last(S) <= L.
Part 2. Solving the problem
We maintain a persistent tree on seats, storing last(S) for each seat (same as in the previous part). Let's process initial passengers one by one sorted by their left endpoint in increasing order. For a passenger (s_i, l_i, r_i) we update the segment tree at position s_i with value r_i. The tree is persistent, so we store the new copy somewhere.
To answer a query [L, R], find a latest version of the segment tree such that the update happened before R. If you do a binary search on versions, it takes O(log N) time.
In the version of the segment tree only passengers with their left endpoint < R are considered (even more, exactly such passengers are). So we can use the algorithm from the Part 1 to answer the query using this segment tree.
Statement :
Input : list<x',x'',Y>
Query Input : (X',X'')
Output : Ymin
Constraints :
1 ≤ Q, S, m ≤ 10^5
1 ≤ N ≤ 10^9
Time: 1s
Memory: 256 Mb
Answer:
Data structure method you can use :
1. Brute force : Directly iterate through the list and perform the check.
2. Sort : sort the list on Y [lowest to highest] and then iterate through it.
Note : Sorting large list will be time consuming.
Sort on Y
Ymin = -1 //Maintain Ymin
for i[0 : len(input)] : //Iterate through tuples
if Ymin != -1 && Y(i-1) != Yi : return Ymin // end case
else if x' > X'' : Ymin = Yi //its on right of query tuple
else if x'<X' && (x' + length <= X') : Ymin = Yi //on left of query tuple
else next
3. Hashmap : Map<Y, list< tuple<x',length> > > to store list of lines for each Y and iterate through them to get least Y.
Note : will take additional time for Map build.
Iterate through list and build a Map
Iterate through Map keys :
Iterate through list of tuples, for each tuple :
if x' > X'': Continue //its on right of query tuple
else if x'<X' && (x' + length <= X') : return Y //on left of query tuple
else next Y
4. Matrix : you can build matrix with 1 for occupied point and 0 for empty.
Note : will take additional time for Matrix build and iteration through matrix is time consuming so not useful.
Example :
0 1 1 1 0 0
1 1 0 1 0 0
0 1 1 1 1 0
As I am not very proficient in various optimization/tree algorithms, I am seeking help.
Problem Description:
Assume, a large sequence of sorted nodes is given with each node representing an integer value L. L is always getting bigger with each node and no nodes have the same L.
The goal now is to find the best combination of nodes, where the difference between the L-values of subsequent nodes is closest to a given integer value M(L) that changes over L.
Example:
So, in the beginning I would have L = 50 and M = 100. The next nodes have L = 70,140,159,240,310.
First, the value of 159 seems to be closest to L+M = 150, so it is chosen as the right value.
However, in the next step, M=100 is still given and we notice that L+M = 259, which is far away from 240.
If we now go back and choose the node with L=140 instead, which then is followed by 240, the overall match between the M values and the L-differences is stronger. The algorithm should be able to find back to the optimal path, even if a mistake was made along the way.
Some additional information:
1) the start node is not necessarily part of the best combination/path, but if required, one could first develop an algorithm, which chooses the best starter candidate.
2) the optimal combination of nodes is following the sorted sequence and not "jumping back" -> so 1,3,5,7 is possible but not 1,3,5,2,7.
3) in the end, the differences between the L values of chosen nodes should in the mean squared sense be closest to the M values
Every help is much appreciated!
If I understand your question correctly, you could use Dijktras algorithm:
https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
http://www.mathworks.com/matlabcentral/fileexchange/20025-dijkstra-s-minimum-cost-path-algorithm
For that you have to know your neighbours of every node and create an Adjacency Matrix. With the implementation of Dijktras algorithm which I posted above you can specify edge weights. You could specify your edge weight in a manner that it is L of the node accessed + M. So for every node combination you have your L of new node + M. In that way the algorithm should find the optimum path between your nodes.
To get all edge combinations you can use Matlabs graph functions:
http://se.mathworks.com/help/matlab/ref/graph.html
If I understand your problem correctly you need an undirected graph.
You can access all edges with the command
G.Edges after you have created the graph.
I know its not the perfect answer but I hope it helps!
P.S. Just watch out, Djikstras algorithm can only handle positive edge weights.
Suppose we are given a number M and a list of n numbers, L[1], ..., L[n], and we want to find a subsequence of at least q of the latter numbers that minimises the sum of squared errors (SSE) with respect to M, where the SSE of a list of k positions x[1], ..., x[k] with respect to M is given by
SSE(M, x[1], ..., x[k]) = sum((L[x[i]]-L[x[i-1]]-M)^2) over all 2 <= i <= k,
with the SSE of a list of 0 or 1 positions defined to be 0.
(I'm introducing the parameter q and associated constraint on the subsequence length here because without it, there always exists a subsequence of length exactly 2 that achieves the minimum possible SSE -- and I'm guessing that such a short sequence isn't helpful to you.)
This problem can be solved in O(qn^2) time and O(qn) space using dynamic programming.
Define f(i, j) to be the minimum sum of squared errors achievable under the following constraints:
The number at position i is selected, and is the rightmost selected position. (Here, i = 0 implies that no positions are selected.)
We require that at least j (instead of q) of these first i numbers are selected.
Also define g(i, j) to be the minimum of f(k, j) over all 0 <= k <= i. Thus g(n, q) will be the minimum sum of squared errors achievable on the entire original problem. For efficient (O(1)) calculation of g(i, j), note that
g(i>0, j>0) = min(g(i-1, j), f(i, j))
g(0, 0) = 0
g(0, j>0) = infinity
To calculate f(i, j), note that if i > 0 then any solution must be formed by appending the ith position to some solution Y that selects at least j-1 positions and whose rightmost selected position is to the left of i -- i.e. whose rightmost selected position is k, for some k < i. The total SSE of this solution to the (i, j) subproblem will be whatever the SSE of Y was, plus a fixed term of (L[x[i]]-L[x[k]]-M)^2 -- so to minimise this total SSE, it suffices to minimise the SSE of Y. But we can compute that minimum: it is g(k, j-1).
Since this holds for any 0 <= k < i, it suffices to try all such values of k, and take the one that gives the lowest total SSE:
f(i>=j, j>=2) = min of (g(k, j-1) + (L[x[i]]-L[x[k]]-M)^2) over all 0 <= k < i
f(i>=j, j<2) = 0 # If we only need 0 or 1 position, SSE is 0
f(i, j>i) = infinity # Can't choose > i positions if the rightmost chosen position is i
With the above recurrences and base cases, we can compute g(n, q), the minimum possible sum of squared errors for the entire problem. By memoising values of f(i, j) and g(i, j), the time to compute all needed values of f(i, j) is O(qn^2), since there are at most (n+1)*(q+1) possible distinct combinations of input parameters (i, j), and computing a particular value of f(i, j) requires at most (n+1) iterations of the loop that chooses values of k, each iteration of which takes O(1) time outside of recursive subcalls. Storing solution values of f(i, j) requires at most (n+1)*(q+1), or O(qn), space, and likewise for g(i, j). As established above, g(i, j) can be computed in O(1) time when all needed values of f(x, y) have been computed, so g(n, q) can be computed in the same time complexity.
To actually reconstruct a solution corresponding to this minimum SSE, you can trace back through the computed values of f(i, j) in reverse order, each time looking for a value of k that achieves a minimum value in the recurrence (there may in general be many such values of k), setting i to this value of k, and continuing on until i=0. This is a standard dynamic programming technique.
I now answer my own post with my current implementation, in order to structure my post and load images. Unfortunately, the code does not do what it should do. Imagine L,M and q given like in the images below. With the calcf and calcg functions I calculated the F and G matrices where F(i+1,j+1) is the calculated and stored f(i,j) and G(i+1,j+1) from g(i,j). The SSE of the optimal combination should be G(N+1,q+1), but the result is wrong. If anyone found the mistake, that would be much appreciated.
G and F Matrix of given problem in the workspace. G and F are created by calculating g(N,q) via calcg(L,N,q,M).
calcf and calcg functions
I have a set of 2D points, and I want to be able to make the following query with arguments x_min and n: what are the n points with largest y which have x > x_min?
To rephrase in Ruby:
class PointsThing
def initialize(points)
#points = points
end
def query(x_min, n)
#points.select { |point| point.x > x_min }.sort_by { |point| point.y }.take(n)
end
end
Ideally, my class would also support an insert and delete operation.
I can't think of a data structure for this which would enable the query to run in less than O(|#points|) time. Does anyone know one?
Sort the points by x descending. For each point in order, insert it into a purely functional red-black tree ordered by y descending. Keep all of the intermediate trees in an array.
To look up a particular x_min, use binary search to find the intermediate tree where exactly the points with x > x_min have been inserted. Traverse this tree to find the first n points.
The preprocessing cost is O(p log p) in time and space, where p is the number of points. The query time is O(log p + n), where n is the number of points to be returned in the query.
If your data are not sorted, then you have no choice but to check every point since you cannot know if there exists another point for which y is greater than that of all other points and for which x > x_min. In short: you can't know if another point should be included if you don't check them all.
In that case, I would assume that it would be impossible to check in sublinear time as you ask for, since you have to check them all. Best case for searching all would be linear.
If your data are sorted, then your best case will be constant time (all n points are those with the greatest y), and worst case would be linear (all n points are those with least y). Average case would be closer to constant I think if your x and x_min are both roughly random within a specific range.
If you want this to scale (that is, you could have large values of n), you will want to keep your resultant set sorted as well since you will need to check new potential points against it and to drop the lowest value when you insert (if size > n). Using a tree, this can be log time.
So, to do the entire thing, worst case is for unsorted points, in which case you're looking at nlog(n) time. Sorted points is better, in which case you're looking at average case of log(n) time (again, assuming roughly randomly distributed values for x and x_min), which yes is sub-linear.
In case it isn't at first obvious why sorted points will have have constant time to search through, I will go over that here quickly.
If the n points with the greatest y values all had x > x_min (the best case) then you are just grabbing what you need off the top, so that case is obvious.
For the average case, assuming roughly randomly distributed x and x_min, the odds that x > x_min are basically half. For any two random numbers a and b, a > b is just as likely to be true as b > a. This is the same thing with x and x_min; x > x_min is equally as likely to be true as x_min > x, meaning 0.5 probability. This means that, for your points, on average every second point checked will meet your x > x_min requirement, so on average you will check 2n points to find the n highest points that meet your criteria. So the best case was c time, average is 2c which is still constant.
Note, however, that for values of n approaching the size of the set this hides the fact that you are going through the entire set, essentially bringing it right back up to linear time. So my assertion that it is constant time does not hold true if you assume random values of n within the range of the size of your set.
If this is not a purely academic question and is prompted by some actual need, then it depends on the situation.
(edit)
I just realized that my constant-time assertions were assuming a data structure where you have direct access to the highest value and can go sequentially to lower values. If the data structure that those are provided to you in does not fit that description, then obviously that will not be the case.
Some precomputation would help in this case.
First partition the set of points taking x_min as pivot element.
Then for set of points lying on right side of x_min build a max_heap based on y co-ordinates.
Now run your query as: Perform n extract_max operations on the built max_heap.
The running time of your query would be log X + log (X-1) + ..... log (X-(n-1))
log X: For the first extract max operation.
log X-1: For the second extract max operation and so on.
X: Size of original Max heap.
Even in the worst case when your n << X , The time taken would be O(n log X).
Notation
Let P be the set of points.
Let top_y ( n, x_min) describe the query to collect the n points from P with the largest y-coordinates among those with x-coordinate greater than or equal to `x_min' .
Let x_0 be the minimum of x coordinates in your point set. Partition the x axis to the right of x_0 into a set of left-hand closed, right-hand open intervals I_i by the set of x coordinates of the point set P such that min(I_i) is the i-th but smallest x coordinate from P. Define the coordinate rank r(x) of x as the index of the interval x is an element of or 0 if x < x_0.
Note that r(x) can be computed in O(log #({I_i})) using a binary search tree.
Simple Solution
Sort your point set by decreasing y-coordinates and save this array A in time O(#P log #P) and space O(#P).
Process each query top_y ( n, x_min ) by traversing this array in order, skipping over items A_i: A_i.x < x_0, counting all other entries until the counter reaches n or you are at the end of A. This processing takes O(n) time and O(1) space.
Note that this may already be sufficient: Queries top_y ( n_0, a_0 ); a_0 < min { p.x | p \in P }, n_0 = c * #P, c = const require step 1 anyway and for n << #P and 'infrequent' queries any further optimizations weren't worth the effort.
Observation
Consider the sequences s_i,s_(i+1)of points with x-coordinates greater than or equal tomin(I_i), min(I_(i+1)), ordered by decreasing y-coordinate.s_(i+1)is a strict subsequence ofs_i`.
If p_1 \in s_(i+1) and p_2.x >= p_1.x then p_2 \in s_(i+1).
Refined Solution
A refined data structure allows for O(n) + O(log #P) query processing time.
Annotate the array A from the simple solution with a 'successor dispatch' for precisely those elements A_i with A_(i+1).x < A_i.x; This dispatch data would consist of an array disp:[r(A_(i+1).x) + 1 .. r(A_i.x)] of A-indexes of the next element in A whose x-coordinate ranks at least as high as the index into disp. The given dispatch indices suffice for processing the query, since ...
... disp[j] = disp[r(A_(i+1).x) + 1] for each j <= r(A_(i+1).x).
... for any x_min with r(x_min) > r(A_i.x), the algorithm wouldn't be here
The proper index to access disp is r(x_min) which remains constant throughout a query and thus takes O(log #P) to compute once per query while the index selection itself is O(1) at each A element.
disp can be precomputed. No two disp entries across all disp arrays are identical (Proof skipped, but it's easy [;-)] to see given the construction). Therefore the construction of disp arrays can be performed stack-based in a single sweep through the point set sorted in A. As there are #P entries, the disp structure takes O(#P) space and O(#P) time to construct, being dominated by space and time requirements for y-sorting. So in a certain sense, this structure comes for free.
Time requirements for query top_y(n,x_min)
Computing r(x_min): O(log #P);
Passage through A: O(n);
I was trying to split a binary-tree into k similar-sized parts (by removing k-1 edges). Is there any efficient algorithm for this problem? Or is it NP-hard? Any pointers to papers, problem definitions, etc?
-- One reasonable metric for evaluating the quality of partitioning could be the size gap between the largest and smallest partition; another metric could be making the smallest partition having as many vertices as possible.
I can suggest pretty fast solution for making the smallest part having as many vertices as possible metric.
Let suppose we guess the size S of smallest partit and want check if it's correct.
First I want to make a few statements:
If total size of tree bigger than S there is at least one subtree which is bigger than S and all subtrees of that subtree are smaller. (It's enough to check both biggest.)
If there is some way to split tree where size of smallest part >= S and we have subtree T all subtrees of which are smaller than S than we can grant that no edges inside T are deleted. (Cause any such deletion will create a partition which will be smaller than S)
If there is some way to split tree where size of smallest part >= S, and we have some subtree T which size >= S, has no deleted edges inside but is not one of parts, we can split the tree in other way where subtree T will be one of parts itself and all parts will be no smaller than S. (Just move some extra vertices from original part to any other part, this other part will not become smaller.)
So here is an algorithm to check if we can split the tree in k parts no smaller than S.
find all suitable vertices (roots of subtrees of size >= S and size for both childs < S) and add them in list. You can start from the root and move through all vertices while subtrees are bigger than S.
While list not empty and number of parts lesser then K take a vertice from the list and cut it off the tree. Than update size of subtrees for parent vertices and add to the list if one of them become suitable.
You even have no need to update all the parent vertices, only until you will find first which's new subtree size is bigger than S, parent vertices cant't be suitable for adding in list yet and can be updated later.
You may need to construct tree back to restore original subtree sizes assigned to the vertices.
Now we can use bisection method. We can determine upper bound as Smax = n/k and lower bound can be retrieved from equation (2*Smin- 1)*(K - 1) + Smin = N it will grants that if we will cut off k-1 subtrees with two child subtrees of size Smin - 1 each, we will have part of size Smin left. Smin = (n + k -1)/(2*k - 1)
And now we can check S = (Smax + Smin)/2
If we manage to construct partition using the method above than S is smaller or equal to it's largest possible value, also smallest part in constructed partition may be bigger than S and we can set new lower bound to it instead of S, if we fail S is bigger than possible.
Time complexity of one check is k multiplied by number of parent nodes updated each time, for well balanced tree number of updated nodes is constant (we will use trick explaned earlier and will not update all parent nodes), still it's not bigger than (n/k) in worst case for ultimately unbalanced tree. Searching for suitable vertices has very similar behavior (all vertices passed while searching will be updated later.).
Difference between n/k and (n + k -1)/(2*k - 1) is proportional to n/k.
So we have time complexity O(k * log(n/k)) in best case if we have precalculated subtree sizes, O(n) if subtree sizes are not precalculated and O(n * log(n/k)) in worst case.
This method may lead to situation when last of parts will be comparably big but I suppose once you've got suggested method you can figure out some improvements to minimize it.
Here is a polynomial deterministic solution:
Let's assume that the tree is rooted and there are two fixed values: MIN and MAX - minimum and maximum allowed size of one component.
Then one can use dynamic programming to check if there is a partition such that each component size is between MIN and MAX:
Let's assume f(node, cuts_count, current_count) is true if and only if there is a way to make exactly cuts_count cuts in node's subtree so that current_count vertices are connected to the node so that condition 2) holds true.
The base case for the leaves is: f(leaf, 1, 0)(cut the edge from the parent to the leaf) is true if and only if MIN <= 1 and MAX >= 1 f(leaf, 0, 1)(do not cut it) is always true(it is false for all other values of cuts_count and current_count).
To compute f for a node(not a leaf), one can use the following algorithm:
//Combine all possible children states.
for cuts_left in 0..k
for cuts_right in 0..k
for cnt_left in 0..left_subtree_size
for cnt_right in 0..right_subtree_size
if f(left_child, cuts_left, cnt_left) is true and
f(right_child, cuts_right, cnt_right) is true and then
f(node, cuts_left + cuts_right, cnt_left + cnt_right + 1) = true
//Cut an edge from this node to its parent.
for cuts in 0..k-1
for cnt in 0..node's_subtree_size
if f(node, cuts, node's_subtree_size) is true and MIN <= cnt <= MAX:
f(node, cuts + 1, 0) = true
What this pseudo code does is combining all possible states of node's children to compute all reachable states for this node(the first bunch of for loops) and then produces the rest of reachable states by cutting the edge between this node and its parent(the second bunch of for loops)(the state means (node, cuts_count, current_count) tuple. I call it reachable if f(state) is true).
That is the case for a node with two children, the case with one child can be processes in similar manner.
Finally, if f(root, k, 0) is true then it is possible to find the partition which stratifies the condition 2) and it is not possible otherwise. We need to "pretend" that we did k cuts here because we also cut an imaginary edge from root to its parent(this edge and this parent doesn't exist actually) when we computed f for the root(to avoid corner case).
The space complexity of this algorithm(for fixed MIN and MAX) is O(n^2 * k)(n is the number of nodes), time complexity is O(k^2 * n^2). It might seem that the complexity is actually O(k^2 * n^3), but is not so because the product of number of vertices in left and right subtree of a node is exactly the number of pairs of node's such that their least common ancestor is this node. But the total number of pairs of nodes is O(n^2)(and each pair has only one least common ancestor). Thus, the sum of products of left and right subtree sizes over all nodes is O(n^2).
One can simply try all possible MIN and MAX values and choose the best, but it can be done faster. The key observation is that if there is a solution for MIN and MAX, there is always a solution for MIN and MAX + 1. Thus, one can iterate over all possible values of MIN(n / k different values) and apply binary search to find the smallest MAX which gives a valid solution(log n iterations). So the overall time complexity is O(n^2 * k^2 * n / k * log n) = O(n^3 * k * log n). However, if you want to maximize MIN(not to minimize the difference between MAX and MIN), you can simply use this algorithm and ignore MAX value everywhere(by setting its value to n). Then no binary search over MAX would be required, but one would be able to binary search over MIN instead and obtain an O(n^2 * k^2 * log n) solution.
To reconstruct the partition itself, one can start from f(root, k, 0) and apply the steps we used to compute f, but this time in opposite direction(from root to leaves). It is also possible to save the information about how to get the value of each state(what children's states were combined or what was the state before the edge was cut)(and update it appropriately during the initial computation of f) and then reconstruct the partition using this data(if my explanation of this step seems not very clear, reading an article on dynamic programming and reconstructing the answer might help).
So, there is a polynomial solution for this problem on a binary tree(even though it is NP-hard for an arbitrary graph).