I have a list of objects with each Item having a cost and a set of resources associated with it (see below). I'm looking for a way to select a subset from this list based on the combined cost and each resource must be contained at most once (not every resource has to be included though). The way the subset's combined cost is calculated should be exchangeable (e.g. max, min, avg). If two subsets have the same combined cost the subset with more items is selected.
Item | cost resources [1..3]
================================
P1 | 0.5 B
P2 | 4 A B C
P3 | 1.5 A B
P4 | 2 C
P5 | 2 A
This would allow for these combinations:
Variant | Items sum
==========================
V1 | P1 P4 P5 4.5
V2 | P2 4
V3 | P3 P4 3.5
For a maximum selection V1 would be selected. The number of items can span from anywhere between 1 and a few dozen, the same is true for the number of resources.
My brute force approach would just sum up the cost of all possible permutations and select the max/min one, but I assume there is a much more efficient way to do this. I'm coding in Java 8 but I'm fine with pseudocode or Matlab.
I found some questions which appeared to be similar (i.e. (1), (2), (3)) but I couldn't quite transfer them to my problem, so forgive me if you think this is a duplicate :/
Thanks in advance!
~
Clarification
A friend of mine was confused about what kinds of sets I want. No matter how I select my subset in the end, I always want to generate subsets with as many items in them as possible. If I have added P3 to my subset and can add P4 without creating a conflict (that is, a resource is used twice within the subset) then I want P3+P4, not just P3.
Clarification2
"Variants don't have to contain all resources" means that if it's impossible to add an item to fill in a missing resource slot without creating a conflict (because all items with the missing resource also have another resource already present) then the subset is complete.
This problem is NP-Hard, even without the "Resources" factor, you are dealing with the knapsack-problem.
If you can transform your costs to relatively small integers, you may be able to modify the Dynamic Programming solution of Knapsack by adding one more dimension per resource allocated, and have a formula similar to (showing concept, make sure all edge cases work or modify if needed):
D(_,_,2,_,_) = D(_,_,_,2,_) = D(_,_,_,_,2) = -Infinity
D(x,_,_,_,_) = -Infinity x < 0
D(x,0,_,_,_) = 0 //this stop clause is "weaker" than above stop clauses - it can applies only if above don't.
D(x,i,r1,r2,r3) = max{1+ D(x-cost[i],i-1,r1+res1[i],r2+res2[i],r3+res3[i]) , D(x,i-1,r1,r2,r3)}
Where cost is array of costs, and res1,res2,res3,... are binary arrays of resources needed by eahc item.
Complexity will be O(W*n*2^#resources)
After giving my problem some more thoughts I came up with a solution I am quite proud of. This solution:
will find all possible complete variants, that is, variants where no additional item can be added without causing a conflict
will also find a few non-complete variants. I can live with that.
can select the final variant by any means you want.
works with non-integer item-values.
I realized that this is indeed not a variant of the knapsack problem, as the items have a value but no weight associated with them (or, you could interpret it as a variant of the multi-dimensional knapsack problem variant but with all weights equal). The code uses some lambda expressions, if you don't use Java 8 you'll have to replace those.
public class BenefitSelector<T extends IConflicting>
{
public ArrayList<T> select(ArrayList<T> proposals, Function<T, Double> valueFunction)
{
if (proposals.isEmpty())
return null;
ArrayList<ArrayList<T>> variants = findVariants(proposals);
double value = 0;
ArrayList<T> selected = null;
for (ArrayList<T> v : variants)
{
double x = 0;
for (T p : v)
x += valueFunction.apply(p);
if (x > value)
{
value = x;
selected = v;
}
}
return selected;
}
private ArrayList<ArrayList<T>> findVariants(ArrayList<T> list)
{
ArrayList<ArrayList<T>> ret = new ArrayList<>();
Conflict c = findConflicts(list);
if (c == null)
ret.add(list);
else
{
ret.addAll(findVariants(c.v1));
ret.addAll(findVariants(c.v2));
}
return ret;
}
private Conflict findConflicts(ArrayList<T> list)
{
// Sort conflicts by the number of items remaining in the first list
TreeSet<Conflict> ret = new TreeSet<>((c1, c2) -> Integer.compare(c1.v1.size(), c2.v1.size()));
for (T p : list)
{
ArrayList<T> conflicting = new ArrayList<>();
for (T p2 : list)
if (p != p2 && p.isConflicting(p2))
conflicting.add(p2);
// If conflicts are found create subsets by
// - v1: removing p
// - v2: removing all objects offended by p
if (!conflicting.isEmpty())
{
Conflict c = new Conflict(p);
c.v1.addAll(list);
c.v1.remove(p);
c.v2.addAll(list);
c.v2.removeAll(conflicting);
ret.add(c);
}
}
// Return only the conflict with the highest number of elements in v1 remaining.
// The algorithm seems to behave in such a way that it is sufficient to only
// descend into this one conflict. As the root list contains all items and we use
// the remainder of objects there should be no way to miss an item.
return ret.isEmpty() ? null
: ret.last();
}
private class Conflict
{
/** contains all items from the superset minus the offending object */
private final ArrayList<T> v1 = new ArrayList<>();
/** contains all items from the superset minus all offended objects */
private final ArrayList<T> v2 = new ArrayList<>();
// Not used right now but useful for debugging
private final T offender;
private Conflict(T offender)
{
this.offender = offender;
}
}
}
Tested with variants of the following setup:
public static void main(String[] args)
{
BenefitSelector<Scavenger> sel = new BenefitSelector<>();
ArrayList<Scavenger> proposals = new ArrayList<>();
proposals.add(new Scavenger("P1", new Resource[] {Resource.B}, 0.5));
proposals.add(new Scavenger("P2", new Resource[] {Resource.A, Resource.B, Resource.C}, 4));
proposals.add(new Scavenger("P3", new Resource[] {Resource.C}, 2));
proposals.add(new Scavenger("P4", new Resource[] {Resource.A, Resource.B}, 1.5));
proposals.add(new Scavenger("P5", new Resource[] {Resource.A}, 2));
proposals.add(new Scavenger("P6", new Resource[] {Resource.C, Resource.D}, 3));
proposals.add(new Scavenger("P7", new Resource[] {Resource.D}, 1));
ArrayList<Scavenger> result = sel.select(proposals, (p) -> p.value);
System.out.println(result);
}
private static class Scavenger implements IConflicting
{
private final String name;
private final Resource[] resources;
private final double value;
private Scavenger(String name, Resource[] resources, double value)
{
this.name = name;
this.resources = resources;
this.value = value;
}
#Override
public boolean isConflicting(IConflicting other)
{
return !Collections.disjoint(Arrays.asList(resources), Arrays.asList(((Scavenger) other).resources));
}
#Override
public String toString()
{
return name;
}
}
This results in [P1(B), P5(A), P6(CD)] with a combined value of 5.5, which is higher than any other combination (e.g. [P2(ABC), P7(D)]=5). As variants aren't lost until they are selected dealing with equal variants is easy as well.
Related
Is it possible to express the following logic more succinctly using Java 8 stream constructs:
public static Set<Pair> findSummingPairsLookAhead(int[] data, int sum){
Set<Pair> collected = new HashSet<>();
Set<Integer> lookaheads = new HashSet<>();
for(int i = 0; i < data.length; i++) {
int elem = data[i];
if(lookaheads.contains(elem)) {
collected.add(new Pair(elem, sum - elem));
}
lookaheads.add(sum - elem);
}
return collected;
}
Something to the effect of Arrays.stream(data).forEach(...).
Thanks in advance.
An algorithm that involves mutating a state during iteration is not well-suited for streams. However, it is often possible to rethink an algorithm in terms of bulk operations that do not explicitly mutate any intermediate state.
In your case, the task is to collect a set of Pair(x, sum - x) where sum - x appears before x in the list. So, we can first build a map of numbers to the index of their first occurrence in the list and then use that map to filter the list and build the set of pairs:
Map<Integer, Integer> firstIdx = IntStream.range(0, data.length)
.boxed()
.collect(toMap(i -> data[i], i -> i, (a, b) -> a));
Set<Pair> result = IntStream.range(0, data.length)
.filter(i -> firstIdx.contains(sum - data[i]))
.filter(i -> firstIdx.get(sum - data[i]) < i)
.mapToObj(i -> new Pair(data[i], sum - data[i]))
.collect(toSet());
You can shorten the two filters by either using && or getOrDefault if you find that clearer.
It's worth mentioning that your imperative style implementation is probably the most effective way to express your expectations. But if you really want to implement same logic using Java 8 Stream API, you can consider utilizing .reduce() method, e.g.
import org.apache.commons.lang3.tuple.Pair;
import java.util.Arrays;
import java.util.Set;
import java.util.concurrent.ConcurrentSkipListSet;
final class SummingPairsLookAheadExample {
public static void main(String[] args) {
final int[] data = new int[]{1,2,3,4,5,6};
final int sum = 8;
final Set<Pair> pairs = Arrays.stream(data)
.boxed()
.parallel()
.reduce(
Pair.of(Collections.synchronizedSet(new HashSet<Pair>()), Collections.synchronizedSet(new HashSet<Integer>())),
(pair,el) -> doSumming(pair, el, sum),
(a,b) -> a
).getLeft();
System.out.println(pairs);
}
synchronized private static Pair<Set<Pair>, Set<Integer>> doSumming(Pair<Set<Pair>, Set<Integer>> pair, int el, int sum) {
if (pair.getRight().contains(el)) {
pair.getLeft().add(Pair.of(el, sum - el));
}
pair.getRight().add(sum - el);
return pair;
}
}
Output
[(5,3), (6,2)]
The first parameter in .reduce() method is accumulator's initial value. This object will be passed to each iteration step. In our case we use a pair of Set<Pair> (expected result) and Set<Integer> (same as variable lookaheads in your example). Second parameter is a lambda (BiFunction) that does the logic (extracted to a separate private method to make code more compact). And the last one is binary operator. It's pretty verbose, but it does not rely on any side effects. #Eugene pointed out that my previous example had issues with parallel execution, so I've updated this example to be safe in parallel execution as well. If you don't run it in parallel you can simply remove synchronized keyword from helper method and use regular sets instead of synchronized one as initial values for accumulator.
Are you trying to get the unique Pairs which's sum equals to specified sum?
Arrays.stream(data).boxed()
.collect(Collectors.groupingBy(i -> i <= sum / 2 ? i : sum - i, toList())).values().stream()
.filter(e -> e.size() > 1 && (e.get(0) * 2 == sum || e.stream().anyMatch(i -> i == sum - e.get(0))))
.map(e -> Pair.of(sum - e.get(0), e.get(0)))
.collect(toList());
A list with unique pairs is returned. you can change it to set by toSet() if you want.
What you have in place is fine (and the java-8 gods are happy). The main problem is that you are relying on side-effects and streams are not very happy about it - they even mention it explicitly in the documentation.
Well I can think of this (I've replaced Pair with SimpleEntry so that I could compile)
public static Set<AbstractMap.SimpleEntry<Integer, Integer>> findSummingPairsLookAhead2(int[] data, int sum) {
Set<Integer> lookaheads = Collections.synchronizedSet(new HashSet<>());
return Arrays.stream(data)
.boxed()
.map(x -> {
lookaheads.add(sum - x);
return x;
})
.filter(lookaheads::contains)
.collect(Collectors.mapping(
x -> new AbstractMap.SimpleEntry<Integer, Integer>(x, sum - x),
Collectors.toSet()));
}
But we are still breaking the side-effects property of map - in a safe way, but still bad. Think about people that will come after you and look at this code; they might find it at least weird.
If you don't ever plan to run this in parallel, you could drop the Collections.synchronizedSet - but do that at your own risk.
I hope someone is able to help me with what is, at least to me, quite a tricky algorithm.
The Problem
I have a List (1 <= size <= 5, but size unknown until run-time) of Lists (1 <= size <= 2) that I need to combine. Here is an example of what I am looking at:-
ListOfLists = { {1}, {2,3}, {2,3}, {4}, {2,3} }
So, there are 2 stages to what I need to do:-
(1). I need to combine the inner lists in such a way that any combination has exactly ONE item from each list, that is, the possible combinations in the result set here would be:-
1,2,2,4,2
1,2,2,4,3
1,2,3,4,2
1,2,3,4,3
1,3,2,4,2
1,3,2,4,3
1,3,3,4,2
1,3,3,4,3
The Cartesian Product takes care of this, so stage 1 is done.....now, here comes the twist which I can't figure out - at least I can't figure out a LINQ way of doing it (I am still a LINQ noob).
(2). I now need to filter out any duplicate results from this Cartesian Product. A duplicate in this case constitutes any line in the result set with the same quantity of each distinct list element as another line, that is,
1,2,2,4,3 is the "same" as 1,3,2,4,2
because each distinct item within the first list occurs the same number of times in both lists (1 occurs once in each list, 2 appears twice in each list, ....
The final result set should therefore look like this...
1,2,2,4,2
1,2,2,4,3
--
1,2,3,4,3
--
--
--
1,3,3,4,3
Another example is the worst-case scenario (from a combination point of view) where the ListOfLists is {{2,3}, {2,3}, {2,3}, {2,3}, {2,3}}, i.e. a list containing inner lists of the maximum size - in this case there would obviously be 32 results in the Cartesian Product result-set, but the pruned result-set that I am trying to get at would just be:-
2,2,2,2,2
2,2,2,2,3 <-- all other results with four 2's and one 3 (in any order) are suppressed
2,2,2,3,3 <-- all other results with three 2's and two 3's are suppressed, etc
2,2,3,3,3
2,3,3,3,3
3,3,3,3,3
To any mathematically-minded folks out there - I hope you can help. I have actually got a working solution to part 2, but it is a total hack and is computationally-intensive, and I am looking for guidance in finding a more elegant, and efficient LINQ solution to the issue of pruning.
Thanks for reading.
pip
Some resources used so far (to get the Cartesian Product)
computing-a-cartesian-product-with-linq
c-permutation-of-an-array-of-arraylists
msdn
UPDATE - The Solution
Apologies for not posting this sooner...see below
You should implement your own IEqualityComparer<IEnumerable<int>> and then use that in Distinct().
The choice of hash code in the IEqualityComparer depends on your actual data, but I think something like this should be adequate if your actual data resemble those in your examples:
class UnorderedQeuenceComparer : IEqualityComparer<IEnumerable<int>>
{
public bool Equals(IEnumerable<int> x, IEnumerable<int> y)
{
return x.OrderBy(i => i).SequenceEqual(y.OrderBy(i => i));
}
public int GetHashCode(IEnumerable<int> obj)
{
return obj.Sum(i => i * i);
}
}
The important part is that GetHashCode() should be O(N), sorting would be too slow.
void Main()
{
var query = from a in new int[] { 1 }
from b in new int[] { 2, 3 }
from c in new int[] { 2, 3 }
from d in new int[] { 4 }
from e in new int[] { 2, 3 }
select new int[] { a, b, c, d, e };
query.Distinct(new ArrayComparer());
//.Dump();
}
public class ArrayComparer : IEqualityComparer<int[]>
{
public bool Equals(int[] x, int[] y)
{
if (x == null || y == null)
return false;
return x.OrderBy(i => i).SequenceEqual<int>(y.OrderBy(i => i));
}
public int GetHashCode(int[] obj)
{
if ( obj == null || obj.Length == 0)
return 0;
var hashcode = obj[0];
for (int i = 1; i < obj.Length; i++)
{
hashcode ^= obj[i];
}
return hashcode;
}
}
The finalised solution to the whole combining of multisets, then pruning the result-sets to remove duplicates problem ended up in a helper class as a static method. It takes svick's much appreciated answer and injects the IEqualityComparer dependency into the existing CartesianProduct answer I found at Eric Lipperts's blog here (I'd recommend reading his post as it explains the iterations in his thinking and why the linq implimentation is the best).
static IEnumerable<IEnumerable<T>> CartesianProduct<T>(IEnumerable<IEnumerable<T>> sequences,
IEqualityComparer<IEnumerable<T>> sequenceComparer)
{
IEnumerable<IEnumerable<T>> emptyProduct = new[] { Enumerable.Empty<T>() };
var resultsSet = sequences.Aggregate(emptyProduct, (accumulator, sequence) => from accseq in accumulator
from item in sequence
select accseq.Concat(new[] { item }));
if (sequenceComparer != null)
return resultsSet.Distinct(sequenceComparer);
else
return resultsSet;
}
A machine is taking measurements and giving me discrete numbers continuously like so:
1 2 5 7 8 10 11 12 13 14 18
Let us say these measurements can be off by 2 points and a measurement is generated every 5 seconds. I want to ignore the measurements that may potentially be same
Like continuous 2 and 3 could be same because margin of error is 2 so how do I partition the data such that I get only distinct measurements but I would also want to handle the situation in which the measurements are continuously increasing like so:
1 2 3 4 5 6 7 8 9 10
In this case if we keep ignoring the consecutive numbers with difference of less than 2 then we might lose actual measurements.
Is there a class of algorithms for this? How would you solve this?
Just drop any number that comes 'in range of' the previous (kept) one. It should simply work.
For your increasing example:
1 is kept, 2 is dropped because it is in range of 1, 3 is dropped because it is in range of 1, then 4 is kept, 5 and 6 are dropped in range of 4, then 7 is kept, etc, so you still keep the increasing trend if it's big enough (which is what you want, right?
For the original example, you'd get 1,5,8,11,14,18 as a result.
In some lines of work, the standard way to deal with problems of this nature is by using the Kalman filter.
To quote Wikipedia:
Its [Kalman filter's] purpose is to use measurements
observed over time, containing noise
(random variations) and other
inaccuracies, and produce values that
tend to be closer to the true values
of the measurements and their
associated calculated values.
The filter itself is very easy to implement, but does require calibration.
I would have two queues:
Temporary Queue
Final Queue/List
Your first value would go into the temporary queue and in the final list. As new values come in, check to see if the new value is within the deadband of the last value in the list. If it is then add it to the temporary queue. If not then add it to the final list. If your temporary queue starts to increase in size before you get a new value outside of the deadband, then once you are outside of the deadband do a check to see if the values are monotonically increasing or decreasing the whole time. If they are always increasing or decreasing then add the contents of the queue to the final list, otherwise just add the single new value to the final list. This is the general gist of it.
Here is some code I whipped up quickly that implements a class to do what I described above:
public class MeasurementsFilter
{
private Queue<int> tempQueue = new Queue<int>();
private List<int> finalList = new List<int>();
private int deadband;
public MeasurementsFilter(int deadband)
{
this.deadband = deadband;
}
public void Reset()
{
finalList.Clear();
tempQueue.Clear();
}
public int[] FinalValues()
{
return finalList.ToArray();
}
public void AddNewValue(int value)
{
// if we are just starting then the first value always goes in the list and queue
if (tempQueue.Count == 0)
{
tempQueue.Enqueue(value);
finalList.Add(value);
}
else
{
// if the new value is within the deadband of the last value added to the final list
// then enqueue the value and wait
if ((tempQueue.Peek() - deadband <= value) && (value <= tempQueue.Peek() + deadband))
{
tempQueue.Enqueue(value);
}
// else the new value is outside of the deadband of the last value added to the final list
else
{
tempQueue.Enqueue(value);
if (QueueIsAlwaysIncreasingOrAlwaysDecreasing())
{
//dequeue first item (we already added it to the list before, but we need it for comparison purposes)
int currentItem = tempQueue.Dequeue();
while (tempQueue.Count > 0)
{
// if we are not seeing two in a row of the same (i.e. they are not duplicates of each other)
// then add the newest value to the final list
if (currentItem != tempQueue.Peek())
{
currentItem = tempQueue.Dequeue();
finalList.Add(currentItem);
}
// otherwise if we are seeing two in a row (i.e. duplicates)
// then discard the value and loop to the next value
else
{
currentItem = tempQueue.Dequeue();
}
}
// add the last item from the final list back into the queue for future deadband comparisons
tempQueue.Enqueue(finalList[finalList.Count - 1]);
}
else
{
// clear the queue and add the new value to the list and as the starting point of the queue
// for future deadband comparisons
tempQueue.Clear();
tempQueue.Enqueue(value);
finalList.Add(value);
}
}
}
}
private bool QueueIsAlwaysIncreasingOrAlwaysDecreasing()
{
List<int> queueList = new List<int>(tempQueue);
bool alwaysIncreasing = true;
bool alwaysDecreasing = true;
int tempIncreasing = int.MinValue;
int tempDecreasing = int.MaxValue;
int i = 0;
while ((alwaysIncreasing || alwaysDecreasing) && (i < queueList.Count))
{
if (queueList[i] >= tempIncreasing)
tempIncreasing = queueList[i];
else
alwaysIncreasing = false;
if (queueList[i] <= tempDecreasing)
tempDecreasing = queueList[i];
else
alwaysDecreasing = false;
i++;
}
return (alwaysIncreasing || alwaysDecreasing);
}
}
Here is some test code that you can throw into a Winform Load event or button click:
int[] values = new int[] { 1, 2, 2, 1, 4, 8, 3, 2, 1, 0, 6 };
MeasurementsFilter filter = new MeasurementsFilter(2);
for (int i = 0; i < values.Length; i++)
{
filter.AddNewValue(values[i]);
}
int[] finalValues = filter.FinalValues();
StringBuilder printValues = new StringBuilder();
for (int i = 0; i < finalValues.Length; i++)
{
printValues.Append(finalValues[i]);
printValues.Append(" ");
}
MessageBox.Show("The final values are: " + printValues);
Given two points P,Q and a delta, I defined the equivalence relation ~=, where P ~= Q if EuclideanDistance(P,Q) <= delta. Now, given a set S of n points, in the example S = (A, B, C, D, E, F) and n = 6 (the fact points are actually endpoints of segments is negligible), is there an algorithm that has complexity better than O(n^2) in the average case to find a partition of the set (the representative element of the subsets is unimportant)?
Attempts to find theoretical definitions of this problem were unsuccessful so far: k-means clustering, nearest neighbor search and others seems to me different problems. The picture shows what I need to do in my application.
Any hint? Thanks
EDIT: while the actual problem (cluster near points given some kind of invariant) should be solvable in better better than O(n^2) in the average case, there's a serious flaw in my problem definition: =~ is not a equivalence relation because of the simple fact it doesn't respect the transitive property. I think this is the main reason this problem is not easy to solve and need advanced techiques. Will post very soon my actual solution: should work when near points all satisfy the =~ as defined. Can fail when poles apart points doesn't respect the relation but they are in relation with the center of gravity of clustered points. It works well with my input data space, may not with yours. Do anyone know a full formal tratment of this problem (with solution)?
One way to restate the problem is as follows: given a set of n 2D points, for each point p find the set of points that are contained with the circle of diameter delta centred at p.
A naive linear search gives the O(n^2) algorithm you allude to.
It seems to me that this is the best one can do in the worst case. When all points in the set are contained within a circle of diameter <= delta, each of n queries would have to return O(n) points, giving an O(n^2) overall complexity.
However, one should be able to do better on more reasonable datasets.
Take a look at this (esp. the section on space partitioning) and KD-trees. The latter should give you a sub-O(n^2) algorithm in reasonable cases.
There might be a different way of looking at the problem, one that would give better complexity; I can't think of anything off the top of my head.
Definitely a problem for Quadtree.
You could also try sorting on each coordonate and playing with these two lists (sorting is n*log(n), and you can check only the points that satisfies dx <= delta && dy <= delta. Also, you could put them in a sorted list with two levels of pointers: one for parsing on OX and another for OY.
For each point, calculate the distance D(n) from the origin, this is an O(n) operation.
Use a O(n^2) algorithm to find matches where D(a-b) < delta, skipping D(a)-D(b) > delta.
The result, on average, must be better than O(n^2) due to the (hopefully large) number skipped.
This is a C# KdTree implementation that should solve the "Find all neighbors of a point P within a delta". It makes heavy use of functional programming techniques (yes, I love Python). It's tested but I still have doubts doubts in understanding _TreeFindNearest(). The code (or pseudo code) to solve the problem "Partition a set of n points given a ~= relation in better than O(n^2) in the average case" is posted in another answer.
/*
Stripped C# 2.0 port of ``kdtree'', a library for working with kd-trees.
Copyright (C) 2007-2009 John Tsiombikas <nuclear#siggraph.org>
Copyright (C) 2010 Francesco Pretto <ceztko#gmail.com>
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. The name of the author may not be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.
*/
using System;
using System.Collections.Generic;
using System.Text;
namespace ITR.Data.NET
{
public class KdTree<T>
{
#region Fields
private Node _Root;
private int _Count;
private int _Dimension;
private CoordinateGetter<T>[] _GetCoordinate;
#endregion // Fields
#region Constructors
public KdTree(params CoordinateGetter<T>[] coordinateGetters)
{
_Dimension = coordinateGetters.Length;
_GetCoordinate = coordinateGetters;
}
#endregion // Constructors
#region Public methods
public void Insert(T location)
{
_TreeInsert(ref _Root, 0, location);
_Count++;
}
public void InsertAll(IEnumerable<T> locations)
{
foreach (T location in locations)
Insert(location);
}
public IEnumerable<T> FindNeighborsRange(T location, double range)
{
return _TreeFindNeighborsRange(_Root, 0, location, range);
}
#endregion // Public methods
#region Tree traversal
private void _TreeInsert(ref Node current, int currentPlane, T location)
{
if (current == null)
{
current = new Node(location);
return;
}
int nextPlane = (currentPlane + 1) % _Dimension;
if (_GetCoordinate[currentPlane](location) <
_GetCoordinate[currentPlane](current.Location))
_TreeInsert(ref current._Left, nextPlane, location);
else
_TreeInsert(ref current._Right, nextPlane, location);
}
private IEnumerable<T> _TreeFindNeighborsRange(Node current, int currentPlane,
T referenceLocation, double range)
{
if (current == null)
yield break;
double squaredDistance = 0;
for (int it = 0; it < _Dimension; it++)
{
double referenceCoordinate = _GetCoordinate[it](referenceLocation);
double currentCoordinate = _GetCoordinate[it](current.Location);
squaredDistance +=
(referenceCoordinate - currentCoordinate)
* (referenceCoordinate - currentCoordinate);
}
if (squaredDistance <= range * range)
yield return current.Location;
double coordinateRelativeDistance =
_GetCoordinate[currentPlane](referenceLocation)
- _GetCoordinate[currentPlane](current.Location);
Direction nextDirection = coordinateRelativeDistance <= 0.0
? Direction.LEFT : Direction.RIGHT;
int nextPlane = (currentPlane + 1) % _Dimension;
IEnumerable<T> subTreeNeighbors =
_TreeFindNeighborsRange(current[nextDirection], nextPlane,
referenceLocation, range);
foreach (T location in subTreeNeighbors)
yield return location;
if (Math.Abs(coordinateRelativeDistance) <= range)
{
subTreeNeighbors =
_TreeFindNeighborsRange(current.GetOtherChild(nextDirection),
nextPlane, referenceLocation, range);
foreach (T location in subTreeNeighbors)
yield return location;
}
}
#endregion // Tree traversal
#region Node class
public class Node
{
#region Fields
private T _Location;
internal Node _Left;
internal Node _Right;
#endregion // Fields
#region Constructors
internal Node(T nodeValue)
{
_Location = nodeValue;
_Left = null;
_Right = null;
}
#endregion // Contructors
#region Children Indexers
public Node this[Direction direction]
{
get { return direction == Direction.LEFT ? _Left : Right; }
}
public Node GetOtherChild(Direction direction)
{
return direction == Direction.LEFT ? _Right : _Left;
}
#endregion // Children Indexers
#region Properties
public T Location
{
get { return _Location; }
}
public Node Left
{
get { return _Left; }
}
public Node Right
{
get { return _Right; }
}
#endregion // Properties
}
#endregion // Node class
#region Properties
public int Count
{
get { return _Count; }
set { _Count = value; }
}
public Node Root
{
get { return _Root; }
set { _Root = value; }
}
#endregion // Properties
}
#region Enums, delegates
public enum Direction
{
LEFT = 0,
RIGHT
}
public delegate double CoordinateGetter<T>(T location);
#endregion // Enums, delegates
}
The following C# method, together with KdTree class, Join() (enumerate all collections passed as argument) and Shuffled() (returns a shuffled version of the passed collection) methods solve the problem of my question. There may be some flawed cases (read EDITs in the question) when referenceVectors are the same vectors as vectorsToRelocate, as I do in my problem.
public static Dictionary<Vector2D, Vector2D> FindRelocationMap(
IEnumerable<Vector2D> referenceVectors,
IEnumerable<Vector2D> vectorsToRelocate)
{
Dictionary<Vector2D, Vector2D> ret = new Dictionary<Vector2D, Vector2D>();
// Preliminary filling
IEnumerable<Vector2D> allVectors =
Utils.Join(referenceVectors, vectorsToRelocate);
foreach (Vector2D vector in allVectors)
ret[vector] = vector;
KdTree<Vector2D> kdTree = new KdTree<Vector2D>(
delegate(Vector2D vector) { return vector.X; },
delegate(Vector2D vector) { return vector.Y; });
kdTree.InsertAll(Utils.Shuffled(ret.Keys));
HashSet<Vector2D> relocatedVectors = new HashSet<Vector2D>();
foreach (Vector2D vector in referenceVectors)
{
if (relocatedVectors.Contains(vector))
continue;
relocatedVectors.Add(vector);
IEnumerable<Vector2D> neighbors =
kdTree.FindNeighborsRange(vector, Tolerances.EUCLID_DIST_TOLERANCE);
foreach (Vector2D neighbor in neighbors)
{
ret[neighbor] = vector;
relocatedVectors.Add(neighbor);
}
}
return ret;
}
What is the complexity of the algorithm is that is used to find the smallest snippet that contains all the search key words?
As stated, the problem is solved by a rather simple algorithm:
Just look through the input text sequentially from the very beginning and check each word: whether it is in the search key or not. If the word is in the key, add it to the end of the structure that we will call The Current Block. The Current Block is just a linear sequence of words, each word accompanied by a position at which it was found in the text. The Current Block must maintain the following Property: the very first word in The Current Block must be present in The Current Block once and only once. If you add the new word to the end of The Current Block, and the above property becomes violated, you have to remove the very first word from the block. This process is called normalization of The Current Block. Normalization is a potentially iterative process, since once you remove the very first word from the block, the new first word might also violate The Property, so you'll have to remove it as well. And so on.
So, basically The Current Block is a FIFO sequence: the new words arrive at the right end, and get removed by normalization process from the left end.
All you have to do to solve the problem is look through the text, maintain The Current Block, normalizing it when necessary so that it satisfies The Property. The shortest block with all the keywords in it you ever build is the answer to the problem.
For example, consider the text
CxxxAxxxBxxAxxCxBAxxxC
with keywords A, B and C. Looking through the text you'll build the following sequence of blocks
C
CA
CAB - all words, length 9 (CxxxAxxxB...)
CABA - all words, length 12 (CxxxAxxxBxxA...)
CABAC - violates The Property, remove first C
ABAC - violates The Property, remove first A
BAC - all words, length 7 (...BxxAxxC...)
BACB - violates The Property, remove first B
ACB - all words, length 6 (...AxxCxB...)
ACBA - violates The Property, remove first A
CBA - all words, length 4 (...CxBA...)
CBAC - violates The Property, remove first C
BAC - all words, length 6 (...BAxxxC)
The best block we built has length 4, which is the answer in this case
CxxxAxxxBxxAxx CxBA xxxC
The exact complexity of this algorithm depends on the input, since it dictates how many iterations the normalization process will make, but ignoring the normalization the complexity would trivially be O(N * log M), where N is the number of words in the text and M is the number of keywords, and O(log M) is the complexity of checking whether the current word belongs to the keyword set.
Now, having said that, I have to admit that I suspect that this might not be what you need. Since you mentioned Google in the caption, it might be that the statement of the problem you gave in your post is not complete. Maybe in your case the text is indexed? (With indexing the above algorithm is still applicable, just becomes more efficient). Maybe there's some tricky database that describes the text and allows for a more efficient solution (like without looking through the entire text)? I can only guess and you are not saying...
I think the solution proposed by AndreyT assumes no duplicates exists in the keywords/search terms. Also, the current block can get as big as the text itself if text contains lot of duplicate keywords.
For example:
Text: 'ABBBBBBBBBB'
Keyword text: 'AB'
Current Block: 'ABBBBBBBBBB'
Anyway, I have implemented in C#, did some basic testing, would be nice to get some feedback on whether it works or not :)
static string FindMinWindow(string text, string searchTerms)
{
Dictionary<char, bool> searchIndex = new Dictionary<char, bool>();
foreach (var item in searchTerms)
{
searchIndex.Add(item, false);
}
Queue<Tuple<char, int>> currentBlock = new Queue<Tuple<char, int>>();
int noOfMatches = 0;
int minLength = Int32.MaxValue;
int startIndex = 0;
for(int i = 0; i < text.Length; i++)
{
char item = text[i];
if (searchIndex.ContainsKey(item))
{
if (!searchIndex[item])
{
noOfMatches++;
}
searchIndex[item] = true;
var newEntry = new Tuple<char, int> ( item, i );
currentBlock.Enqueue(newEntry);
// Normalization step.
while (currentBlock.Count(o => o.Item1.Equals(currentBlock.First().Item1)) > 1)
{
currentBlock.Dequeue();
}
// Figuring out minimum length.
if (noOfMatches == searchTerms.Length)
{
var length = currentBlock.Last().Item2 - currentBlock.First().Item2 + 1;
if (length < minLength)
{
startIndex = currentBlock.First().Item2;
minLength = length;
}
}
}
}
return noOfMatches == searchTerms.Length ? text.Substring(startIndex, minLength) : String.Empty;
}
This is an interesting question.
To restate it more formally:
Given a list L (the web page) of length n and a set S (the query) of size k, find the smallest sublist of L that contains all the elements of S.
I'll start with a brute-force solution in hopes of inspiring others to beat it.
Note that set membership can be done in constant time, after one pass through the set. See this question.
Also note that this assumes all the elements of S are in fact in L, otherwise it will just return the sublist from 1 to n.
best = (1,n)
For i from 1 to n-k:
Create/reset a hash found[] mapping each element of S to False.
For j from i to n or until counter == k:
If found[L[j]] then counter++ and let found[L[j]] = True;
If j-i < best[2]-best[1] then let best = (i,j).
Time complexity is O((n+k)(n-k)). Ie, n^2-ish.
Here's a solution using Java 8.
static Map.Entry<Integer, Integer> documentSearch(Collection<String> document, Collection<String> query) {
Queue<KeywordIndexPair> queue = new ArrayDeque<>(query.size());
HashSet<String> words = new HashSet<>();
query.stream()
.forEach(words::add);
AtomicInteger idx = new AtomicInteger();
IndexPair interval = new IndexPair(0, Integer.MAX_VALUE);
AtomicInteger size = new AtomicInteger();
document.stream()
.map(w -> new KeywordIndexPair(w, idx.getAndIncrement()))
.filter(pair -> words.contains(pair.word)) // Queue.contains is O(n) so we trade space for efficiency
.forEach(pair -> {
// only the first and last elements are useful to the algorithm, so we don't bother removing
// an element from any other index. note that removing an element using equality
// from an ArrayDeque is O(n)
KeywordIndexPair first = queue.peek();
if (pair.equals(first)) {
queue.remove();
}
queue.add(pair);
first = queue.peek();
int diff = pair.index - first.index;
if (size.incrementAndGet() == words.size() && diff < interval.interval()) {
interval.begin = first.index;
interval.end = pair.index;
size.set(0);
}
});
return new AbstractMap.SimpleImmutableEntry<>(interval.begin, interval.end);
}
There are 2 static nested classes KeywordIndexPair and IndexPair, the implementation of which should be apparent from the names. Using a smarter programming language that supports tuples those classes wouldn't be necessary.
Test:
Document: apple, banana, apple, apple, dog, cat, apple, dog, banana, apple, cat, dog
Query: banana, cat
Interval: 8, 10
For all the words, maintain min and max index in case there is going to be more than one entry; if not both min and mix index will same.
import edu.princeton.cs.algs4.ST;
public class DicMN {
ST<String, Words> st = new ST<>();
public class Words {
int min;
int max;
public Words(int index) {
min = index;
max = index;
}
}
public int findMinInterval(String[] sw) {
int begin = Integer.MAX_VALUE;
int end = Integer.MIN_VALUE;
for (int i = 0; i < sw.length; i++) {
if (st.contains(sw[i])) {
Words w = st.get(sw[i]);
begin = Math.min(begin, w.min);
end = Math.max(end, w.max);
}
}
if (begin != Integer.MAX_VALUE) {
return (end - begin) + 1;
}
return 0;
}
public void put(String[] dw) {
for (int i = 0; i < dw.length; i++) {
if (!st.contains(dw[i])) {
st.put(dw[i], new Words(i));
}
else {
Words w = st.get(dw[i]);
w.min = Math.min(w.min, i);
w.max = Math.max(w.max, i);
}
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
DicMN dic = new DicMN();
String[] arr1 = { "one", "two", "three", "four", "five", "six", "seven", "eight" };
dic.put(arr1);
String[] arr2 = { "two", "five" };
System.out.print("Interval:" + dic.findMinInterval(arr2));
}
}