Algorithm to find if two sets of sets of numbers are isomorphic or not (under permutation) - algorithm

Given two systems consisting of set of sets of numbers, I would like to know if they are isomorphic under permutation.
For example
{{1,2,3,4,5},{2,4,5,6,7},{2,3,4,6,7}} is a system of 3 sets of 5 numbers.
{{1,2,3,4,6},{2,3,5,6,7},{2,3,4,8,9}} is a another system of 3 sets of 5 numbers. I want to check if these systems are isomorphic.
There are not. The first system uses numbers { 1,2,3,4,5,6,7 }, the second one uses numbers { 1,2,3,4,5,6,7,8,9}.
Here is another example.
{{1,2,3}, {1,2,4}, {3,4,5}} and {{1,2,4}, {1,3,5}, {2,3,5}}. Those two systems of 3 sets of 3 numbers are isomorphic.
If I use permutation (5 3 1 2 4) where 1 becomes 5, 2 becomes 3, etc. The first set becomes {5,3,1}. The second becomes {5,3,2}. The third one becomes {1,2,4}. So the transformed system by this permutation is {{5,3,1},{5,3,2},{1,2,4}} that is equivalently rewritten to {{1,2,4},{1,3,5},{2,3,5}} as I am not interested in order. This is the second system, so the answer is yes.
Currently, on the first example, I apply all 9! permutations of {1,2,3,...,9}
to the first system and check if I can get the second one. It gives me an answer, but very slowly.
Is there a clever algorithm ?
(I only want the answer, yes or no. I am not interested in getting a permutation that transform the first system to the second one.)

As pointed out in the comments, this might correspond to graph-theoretic problems that are still under investigation regarding the complexity and the algorithms that can be employed to tackle them.
However, the complexity always refers to some input size. And here, it is not clear what your input size is. As an example: I think that the most appropriate algorithm might depend on whether you are going to scale up...
the number of numbers (1...9 in your example) or
the number of sets in each set (3, in your example) or
the size of the sets in the sets (5, in your example)
Using your current approach, scaling the number of numbers would not be feasible, because you can't compute all permutations for numbers much larger than 9 due to the exponential running time. But if your intention was to check the isomorphy of sets containing 1000 sets, an algorithm that was polynomial in the number of sets (if such an algorithm existed) might still be slower in practice.
Here, I'd like to sketch an approach that I tried. I did not perform a detailed complexity analysis (which might be pointless if there exist no polynomial time solution at all - and to prove or disprove that can't be the subject of an answer here).
The basic idea is as follows:
Initially, you compute the valid "domains" for each input number. These are possible values that each number may be mapped to, based on the permutation. If the given numbers are 1,2 and 3, then the domains initially could be
1 -> { 1, 2, 3 }
2 -> { 1, 2, 3 }
3 -> { 1, 2, 3 }
But for the given sets, one can already derive some information that allows reducing the domains. For example: Any number that appears n times in the first sets must be mapped to a number that appears n times in the second sets.
Imagine that the given sets are
{{1,2},{1,3}}
{{3,1},{3,2}}
Then the domains would only be
1 -> { 3 }
2 -> { 1, 2 }
3 -> { 1, 2 }
because the 1 appears twice in the first sets, and the only value that appears twice in the second sets is the 3.
After the initial domains are computed, one can perform a backtracking of the possible assignments (permutations) of the numbers. The backtracking can roughly be done as
for (each number n that has no permutation value assigned) {
assign a permutation value (from the current domain of n) to n
update the domains of all other numbers
if the domains are no longer valid, then backtrack
if the solution was found, then return it
}
(The idea is somehow "inspired" by the Arc Consistency 3 Algorithm, although technically, the problems are not directly related)
During the backtracking, one can employ different pruning criteria. That is, one can think of various tricks in order to quickly check whether a certain assignment (a partial permutation) and the domains that are implied by this assignent are "valid" or not.
The obvious (necessary) criterion for an assignment to be valid is that none of the domains may be empty. More generally: Each domain may not appear more often than the number of elements that it contains. When you find out that the domains are
1 -> { 4 }
2 -> { 2,3 }
3 -> { 2,3 }
4 -> { 2,3 }
then there can no longer be a valid solution, and the algorithm may track back.
Of course, bactracking tends to have exponential complexity in the input size. But it might be that there simply exists no efficient algorithm for this problem. For this case, the pruning that may be employed during the backtracking may at least help to reduce the running time for certain cases (or for small input sizes in general) compared to a brute-force exhausting search.
Here is an implementation of my experiments, in Java. This is not particularly elegant, but shows that it basically works: It quickly finds a solution if there exists one, and (for the given input sizes) does not take long to detect when there is no solution.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.Comparator;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
public class SetSetIsomorphisms
{
public static void main(String[] args)
{
Map<Integer, Integer> p = new LinkedHashMap<Integer, Integer>();
p.put(0, 3);
p.put(1, 4);
p.put(2, 8);
p.put(3, 2);
p.put(4, 1);
p.put(5, 5);
p.put(6, 0);
p.put(7, 9);
p.put(8, 7);
p.put(9, 6);
Set<Set<Integer>> sets0 = new LinkedHashSet<Set<Integer>>();
sets0.add(new LinkedHashSet<Integer>(Arrays.asList(1,2,3,4,5)));
sets0.add(new LinkedHashSet<Integer>(Arrays.asList(2,4,5,6,7)));
sets0.add(new LinkedHashSet<Integer>(Arrays.asList(0,8,3,9,7)));
Set<Set<Integer>> sets1 = new LinkedHashSet<Set<Integer>>();
for (Set<Integer> set0 : sets0)
{
sets1.add(applyMapping(set0, p));
}
// Uncomment these lines for a case where NO permutation is found
//sets1.remove(sets1.iterator().next());
//sets1.add(new LinkedHashSet<Integer>(Arrays.asList(4,8,2,3,5)));
System.out.println("Initially valid? "+
areIsomorphic(sets0, sets1, p));
boolean areIsomorphic = areIsomorphic(sets0, sets1);
System.out.println("Result: "+areIsomorphic);
}
private static <T> boolean areIsomorphic(
Set<Set<T>> sets0, Set<Set<T>> sets1)
{
System.out.println("sets0:");
for (Set<T> set0 : sets0)
{
System.out.println(" "+set0);
}
System.out.println("sets1:");
for (Set<T> set1 : sets1)
{
System.out.println(" "+set1);
}
Set<T> all0 = flatten(sets0);
Set<T> all1 = flatten(sets1);
System.out.println("All elements");
System.out.println(" "+all0);
System.out.println(" "+all1);
if (all0.size() != all1.size())
{
System.out.println("Different number of elements");
return false;
}
Map<T, Set<T>> domains = computeInitialDomains(sets0, sets1);
System.out.println("Domains initially:");
print(domains, "");
Map<T, T> assignment = new LinkedHashMap<T, T>();
return compute(assignment, domains, sets0, sets1, "");
}
private static <T> Map<T, Set<T>> computeInitialDomains(
Set<Set<T>> sets0, Set<Set<T>> sets1)
{
Set<T> all0 = flatten(sets0);
Set<T> all1 = flatten(sets1);
Map<T, Set<T>> domains = new LinkedHashMap<T, Set<T>>();
for (T e0 : all0)
{
Set<T> domain0 = new LinkedHashSet<T>();
for (T e1 : all1)
{
if (isFeasible(e0, sets0, e1, sets1))
{
domain0.add(e1);
}
}
domains.put(e0, domain0);
}
return domains;
}
private static <T> boolean isFeasible(
T e0, Set<Set<T>> sets0,
T e1, Set<Set<T>> sets1)
{
int c0 = countContaining(sets0, e0);
int c1 = countContaining(sets1, e1);
return c0 == c1;
}
private static <T> int countContaining(Set<Set<T>> sets, T value)
{
int count = 0;
for (Set<T> set : sets)
{
if (set.contains(value))
{
count++;
}
}
return count;
}
private static <T> boolean compute(
Map<T, T> assignment, Map<T, Set<T>> domains,
Set<Set<T>> sets0, Set<Set<T>> sets1, String indent)
{
if (!validCounts(domains.values()))
{
System.out.println(indent+"There are too many domains "
+ "with too few elements");
print(domains, indent);
return false;
}
if (assignment.keySet().equals(domains.keySet()))
{
System.out.println(indent+"Found assignment: "+assignment);
return true;
}
List<Entry<T, Set<T>>> entryList =
new ArrayList<Map.Entry<T,Set<T>>>(domains.entrySet());
Collections.sort(entryList, new Comparator<Map.Entry<T,Set<T>>>()
{
#Override
public int compare(Entry<T, Set<T>> e0, Entry<T, Set<T>> e1)
{
return Integer.compare(
e0.getValue().size(),
e1.getValue().size());
}
});
for (Entry<T, Set<T>> entry : entryList)
{
T key = entry.getKey();
if (assignment.containsKey(key))
{
continue;
}
Set<T> domain = entry.getValue();
for (T value : domain)
{
Map<T, Set<T>> newDomains = copy(domains);
removeFromOthers(newDomains, key, value);
assignment.put(key, value);
newDomains.get(key).clear();
newDomains.get(key).add(value);
System.out.println(indent+"Using "+assignment);
Set<Set<T>> setsContainingKey =
computeSetsContainingValue(sets0, key);
Set<Set<T>> setsContainingValue =
computeSetsContainingValue(sets1, value);
Set<T> keyElements = flatten(setsContainingKey);
Set<T> valueElements = flatten(setsContainingValue);
for (T otherKey : keyElements)
{
Set<T> otherValues = newDomains.get(otherKey);
otherValues.retainAll(valueElements);
}
System.out.println(indent+"Domains when "+assignment);
print(newDomains, indent);
boolean done = compute(assignment, newDomains,
sets0, sets1, indent+" ");
if (done)
{
return true;
}
assignment.remove(key);
}
}
return false;
}
private static boolean validCounts(
Collection<? extends Collection<?>> collections)
{
Map<Collection<?>, Integer> counts =
new LinkedHashMap<Collection<?>, Integer>();
for (Collection<?> c : collections)
{
Integer count = counts.get(c);
if (count == null)
{
count = 0;
}
counts.put(c, count+1);
}
for (Entry<Collection<?>, Integer> entry : counts.entrySet())
{
Collection<?> c = entry.getKey();
Integer count = entry.getValue();
if (count > c.size())
{
return false;
}
}
return true;
}
private static <K, V> Map<K, Set<V>> copy(Map<K, Set<V>> map)
{
Map<K, Set<V>> copy = new LinkedHashMap<K, Set<V>>();
for (Entry<K, Set<V>> entry : map.entrySet())
{
K k = entry.getKey();
Set<V> values = entry.getValue();
copy.put(k, new LinkedHashSet<V>(values));
}
return copy;
}
private static <T> Set<Set<T>> computeSetsContainingValue(
Iterable<? extends Set<T>> sets, T value)
{
Set<Set<T>> containing = new LinkedHashSet<Set<T>>();
for (Set<T> set : sets)
{
if (set.contains(value))
{
containing.add(set);
}
}
return containing;
}
private static <T> void removeFromOthers(
Map<T, Set<T>> map, T key, T value)
{
for (Entry<T, Set<T>> entry : map.entrySet())
{
if (!entry.getKey().equals(key))
{
Set<T> values = entry.getValue();
values.remove(value);
}
}
}
private static <T> Set<T> flatten(
Iterable<? extends Collection<? extends T>> collections)
{
Set<T> set = new LinkedHashSet<T>();
for (Collection<? extends T> c : collections)
{
set.addAll(c);
}
return set;
}
private static <T> Set<T> applyMapping(
Set<T> set, Map<T, T> map)
{
Set<T> result = new LinkedHashSet<T>();
for (T e : set)
{
result.add(map.get(e));
}
return result;
}
private static <T> boolean areIsomorphic(
Set<Set<T>> sets0, Set<Set<T>> sets1, Map<T, T> p)
{
for (Set<T> set0 : sets0)
{
Set<T> set1 = applyMapping(set0, p);
if (!sets1.contains(set1))
{
return false;
}
}
return true;
}
private static void print(Map<?, ?> map, String indent)
{
for (Entry<?, ?> entry : map.entrySet())
{
System.out.println(indent+entry.getKey()+": "+entry.getValue());
}
}
}

I believe your problem is equivalent to the Graph Isomorphism problem (GI). Your set of sets can be modelled as a (bipartite) graph, with nodes representing the base values of your set (e.g., 1, 2, 3, ... 7), while nodes on the right represent sets (e.g., {1,2,3,4,6} or {2,3,5,6,7}). Draw an edge connecting a node on the left with a node on the right if the number is an element of the set; in my example, 1 is connected only to {1,2,3,4,6} while 2 is connected to both {1,2,3,4,6} and to {2,3,5,6,7}. 1 is connected to all sets which contain it; {1,2,3,4,6} is connected to all numbers contained in it.
Any bipartite graph can be realized in this manner. Conversely, GI can be reduced to solving GI on bipartite graphs. (Any graph can be made into a bipartite graph by replacing each edge with two new edges and a new vertex. Isomorphism in the resulting bipartite graphs is equivalent to isomorphism in the original graphs.)
GI is in NP, but it is not known whether it is NP complete. In practice, GI can be solved quickly for hundreds of vertices with e.g., NAUTY.

Related

How to find if the distance between destination and source is less than 5?

I am trying to write a method that will return true if the distance between two nodes is less than 5 in a graph. I try to write with the minimum distances algorithm as shown :
class Movie{ //this is the node in the graph
String name;
List<Movie> movies;
}
private static boolean isgoodMovies(Movie origin, Movie destination){
Queue<Movie> nextToVisit = new LinkedList<>();
Set<Movie> visited = new HashSet<>();
HashMap<Movie, Integer> distances = new HashMap<>();
nextToVisit.add(origin);
distances.put(origin, 0);
while (!nextToVisit.isEmpty()){
Movie visitedNode = nextToVisit.remove();
if(visited.equals(destination)) {break;}
if(!visited.contains(visitedNode)) {continue;}
visited.add(visitedNode);
for (Movie movie : visitedNode.movies) {
nextToVisit.add(movie);
distances.put(movie, distances.get(visitedNode) + 1);
}
}
return distances.get(origin) < 5;
}
By modifying the minimum distances algorithm, I return the boolean based on the distance of the origin node. I want to optimize it in a way that I do not use a hashmap or any collection, simply having a distance variable. Do you think it is possible?
You could use the recursion here if the number of movies isn't huge (you can get StackOverflowError if the number of method invocations exceeds the maximum stack depth). So, don't use any Collection except for a HashSet as shown below:
private static boolean isGoodMovies(Movie origin, Movie destination) {
Set<Movie> visited = new HashSet<>();
return isGoodMovies(origin, destination, visited, 0);
}
private static boolean isGoodMovies(Movie current, Movie destination, visited, int depth) {
if (depth >= 5) {
return false;
}
if (destination.equals(current)) {
return true;
}
boolean isGood = false;
for (Movie child : current.movies) {
if (!visited.contains(child) {
visited.add(child);
isGood |= isGoodMovies(child, destination, depth + 1);
}
}
return isGood;
}

Randomised Path on graph - set length, no crossing, no dead ends

I am working on a game with a 8 wide 5 high grid. I have a 'snake' feature which needs to enter the grid and "walk" around for a set distance (20 for example). There are certain restrictions for the movement of the snake:
It needs go over the predetermined amount of blocks (20)
It cannot go over itself or double back (no dead ends)
Currently I am using a Randomised Depth First search, however I have found that it occasionally goes back over itself (crosses its own path) and am not sure if this is the best way to go about it.
Options considered: I have looked at using A*, but am struggling to figure out a good way to do it without a predetermined goal and the conditions above. I have also considered adding a heuristic to favour blocks that are not on the outside of the grid - but am not sure either of these will solve the issue at hand.
Any help is appreciated and I can add more detail or code if necessary:
public List<GridNode> RandomizedDepthFirst(int distance, GridNode startNode)
{
Stack<GridNode> frontier = new Stack<GridNode>();
frontier.Push(startNode);
List<GridNode> visited = new List<GridNode>();
visited.Add(startNode);
while (frontier.Count > 0 && visited.Count < distance)
{
GridNode current = frontier.Pop();
if (current.nodeState != GridNode.NodeState.VISITED)
{
current.nodeState = GridNode.NodeState.VISITED;
GridNode[] vals = current.FindNeighbours().ToArray();
List<GridNode> neighbours = new List<GridNode>();
foreach (GridNode g in vals.OrderBy(x => XMLReader.NextInt(0,0)))
{
neighbours.Add(g);
}
foreach (GridNode g in neighbours)
{
frontier.Push(g);
}
if (!visited.Contains(current))
{
visited.Add(current);
}
}
}
return visited;
}
An easy way to account for back tracking is using a recursive dfs search.
Consider the following graph:
And a java implementation of a dfs search, removing nodes from the path when backtracking (note the comments. Run it online here) :
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Stack;
public class Graph {
//all graph nodes
private Node[] nodes;
public Graph(int numberOfNodes) {
nodes = new Node[numberOfNodes];
//construct nodes
for (int i = 0; i < numberOfNodes; i++) {
nodes[i] = new Node(i);
}
}
// add edge from a to b
public Graph addEdge(int from, int to) {
nodes[from].addNeighbor(nodes[to]);
//unless unidirectional: //if a is connected to b
//than b should be connected to a
nodes[to].addNeighbor(nodes[from]);
return this; //makes it convenient to add multiple edges
}
//returns a list of path size of pathLength.
//if path not found : returns an empty list
public List<Node> dfs(int pathLength, int startNode) {
List<Node> path = new ArrayList<>(); //a list to hold all nodes in path
Stack<Node> frontier = new Stack<>();
frontier.push(nodes[startNode]);
dfs(pathLength, frontier, path);
return path;
}
private boolean dfs(int pathLength, Stack<Node> frontier, List<Node> path) {
if(frontier.size() < 1) {
return false; //stack is empty, no path found
}
Node current = frontier.pop();
current.setVisited(true);
path.add(current);
if(path.size() == pathLength) {
return true; //path size of pathLength found
}
System.out.println("testing node "+ current); //for testing
Collections.shuffle(current.getNeighbors()); //shuffle list of neighbours
for(Node node : current.getNeighbors()) {
if(! node.isVisited()) {
frontier.push(node);
if(dfs(pathLength, frontier, path)) { //if solution found
return true; //return true. continue otherwise
}
}
}
//if all neighbours tested and no solution found, current node
//is not part of the path
path.remove(current); // remove it
current.setVisited(false); //this accounts for loops: you may get to this node
//from another edge
return false;
}
public static void main(String[] args){
Graph graph = new Graph(9); //make graph
graph.addEdge(0, 4) //add edges
.addEdge(0, 1)
.addEdge(1, 2)
.addEdge(1, 4)
.addEdge(4, 3)
.addEdge(2, 3)
.addEdge(2, 5)
.addEdge(3, 5)
.addEdge(1, 6)
.addEdge(6, 7)
.addEdge(7, 8);
//print path with length of 6, starting with node 1
System.out.println( graph.dfs(6,1));
}
}
class Node {
private int id;
private boolean isVisited;
private List<Node>neighbors;
Node(int id){
this.id = id;
isVisited = false;
neighbors = new ArrayList<>();
}
List<Node> getNeighbors(){
return neighbors;
}
void addNeighbor(Node node) {
neighbors.add(node);
}
boolean isVisited() {
return isVisited;
}
void setVisited(boolean isVisited) {
this.isVisited = isVisited;
}
#Override
public String toString() {return String.valueOf(id);} //convenience
}
Output:
testing node 1
testing node 6
testing node 7
testing node 8
testing node 2
testing node 5
testing node 3
testing node 4
[1, 2, 5, 3, 4, 0]
Note that nodes 6,7,8 which are dead-end, are tested, but not included in the final path.

algorithms how to deal with multiple criteria

find 2 rectangles A[i] and A[j] in an array A[n] rectangles such that A[i].width > A[j].width and A[i].length - A[j].length is the longest.
Is there a way to reduce the complexity to O(nlogn)? I can't find a way to get an O(logn) search for the second rectangle. Sorting doesn't seem to help here due to the possibilities of 2 criteria being completely opposite of each other. Maybe I'm just going at it wrong? direct me to the right path please. Thank you.
Note: Homework assignment using different object and using 2 criteria instead of 3, but the context is the same.
Since this is homework, here is a high-level answer, with the implementation left as a problem for the OP:
Sort the elements of the array ascending by width. Then scan down the array subtracting the current length from the highest length encountered so far. keep track of the greatest difference encountered so far (and the corresponding i and j). When done you will have the greatest length difference A[i].length-A[j].length where A[i].width > A[j].width
Analysis: sorting the elements takes O(n*Log(n)), all other steps take O(n).
Here is some java code to achieve the same::
import java.util.Arrays;
import java.util.Comparator;
import java.util.Random;
public class RequiredRectangle {
public static void main(String[] args) {
// test the method
int n=10;
Rectangle[] input = new Rectangle[n];
Random r = new Random();
for(int i=0;i<n;i++){
input[i] = new Rectangle(r.nextInt(100)+1,r.nextInt(100)+1);
}
System.out.println("INPUT:: "+Arrays.deepToString(input));
Rectangle[] output = getMaxLengthDiffAndGreaterBreadthPair(input);
System.out.println("OUTPUT:: "+Arrays.deepToString(output));
}
public static Rectangle[] getMaxLengthDiffAndGreaterBreadthPair(Rectangle[] input){
Rectangle[] output = new Rectangle[2];
Arrays.sort(input, new Comparator<Rectangle>() {
public int compare(Rectangle rc1,Rectangle rc2){
return rc1.getBreadth()-rc2.getBreadth();
}
});
int len=0;
Rectangle obj1,obj2;
for(int i=0;i<input.length;i++){
obj2=input[i];
for(int j=i+1;j<input.length;j++){
obj1=input[j];
int temp=obj1.getLength() - obj2.getLength();
if(temp>len && obj1.getBreadth() > obj2.getBreadth()){
len=temp;
output[0]=obj1;
output[1]=obj2;
}
}
}
return output;
}
}
class Rectangle{
private int length;
private int breadth;
public int getLength(){
return length;
}
public int getBreadth(){
return breadth;
}
public Rectangle(int length,int breadth){
this.length=length;
this.breadth=breadth;
}
#Override
public boolean equals(Object obj){
Rectangle rect = (Rectangle)obj;
if(this.length==rect.length && this.breadth==rect.breadth)
return true;
return false;
}
#Override
public String toString(){
return "["+length+","+breadth+"]";
}
}
`
Sample Output:
INPUT:: [[8,19], [68,29], [92,14], [1,27], [87,24], [57,42], [45,5], [66,27], [45,28], [29,11]]
OUTPUT:: [[87,24], [8,19]]

Subset sum with positive and negative integers

I've to implement a variation of the subset sum problem, my input will be positive and negative decimal, also I will need to know the subset, knowing that exists unfortunately it's not enough.
I've tried the algorithms found on wikipedia, but I can't make them work with negative numbers, and also I can't find the way to obtain the subset if it exists.
Could anyone point me where I could find some pseudo-code, documentation or implementation, for this algorithm.
I wrote the code in Java
it checks all the possibilities
import java.util.*;
public class StackOverFlow {
public static <T> Set<Set<T>> powerSet(Set<T> originalSet) {
Set<Set<T>> sets = new HashSet<Set<T>>();
if (originalSet.isEmpty()) {
sets.add(new HashSet<T>());
return sets;
}
List<T> list = new ArrayList<T>(originalSet);
T head = list.get(0);
Set<T> rest = new HashSet<T>(list.subList(1, list.size()));
for (Set<T> set : powerSet(rest)) {
Set<T> newSet = new HashSet<T>();
newSet.add(head);
newSet.addAll(set);
sets.add(newSet);
sets.add(set);
}
return sets;
}
public static int sumSet(Set<Integer> set){
int sum =0;
for (Integer s : set) {
sum += s;
}
return sum;
}
public static void main(String[] args) {
Set<Integer> mySet = new HashSet<Integer>();
mySet.add(-1);
mySet.add(2);
mySet.add(3);
int mySum = 4;
for (Set<Integer> s : powerSet(mySet)) {
if(mySum == sumSet(s))
System.out.println(s + " = " + sumSet(s));
}
}
}
I hope it helps

Bloom filter design

I wanted to know where I can find an implementation of the Bloom filter, with some explanation about the choice of the hash functions.
Additionally I have the following questions:
1) the Bloom filter is known to have false positives. Is it possible to reduce them by using two filters, one for used elements, and one for non-used elements (assuming the set is finite and known a-priori) and compare the two?
2) are there other similar algorithms in the CS literature?
My intuition is that you'll get a better reduction in false positives by using the additional space that the anti-filter would have occupied to just expand the positive filter.
As for resources, the papers referenced for March 8 from my course syllabus would be useful.
An Java implementation of the Bloom filter can be found from here. In case you cannot view it, I will paste the code in the following (with comments in Chinese).
import java.util.BitSet;
publicclass BloomFilter
{
/* BitSet初始分配2^24个bit */
privatestaticfinalint DEFAULT_SIZE =1<<25;
/* 不同哈希函数的种子,一般应取质数 */
privatestaticfinalint[] seeds =newint[] { 5, 7, 11, 13, 31, 37, 61 };
private BitSet bits =new BitSet(DEFAULT_SIZE);
/* 哈希函数对象 */
private SimpleHash[] func =new SimpleHash[seeds.length];
public BloomFilter()
{
for (int i =0; i < seeds.length; i++)
{
func[i] =new SimpleHash(DEFAULT_SIZE, seeds[i]);
}
}
// 将字符串标记到bits中
publicvoid add(String value)
{
for (SimpleHash f : func)
{
bits.set(f.hash(value), true);
}
}
//判断字符串是否已经被bits标记
publicboolean contains(String value)
{
if (value ==null)
{
returnfalse;
}
boolean ret =true;
for (SimpleHash f : func)
{
ret = ret && bits.get(f.hash(value));
}
return ret;
}
/* 哈希函数类 */
publicstaticclass SimpleHash
{
privateint cap;
privateint seed;
public SimpleHash(int cap, int seed)
{
this.cap = cap;
this.seed = seed;
}
//hash函数,采用简单的加权和hash
publicint hash(String value)
{
int result =0;
int len = value.length();
for (int i =0; i < len; i++)
{
result = seed * result + value.charAt(i);
}
return (cap -1) & result;
}
}
}
In terms of designing Bloom filter, the number of hash functions your bloom filter need can be determined as in here also refering the Wikipedia article about Bloom filters, then you find a section Probability of false positives. This section explains how the number of hash functions influences the probabilities of false positives and gives you the formula to determine k from the desired expected prob. of false positives.
Quote from the Wikipedia article:
Obviously, the probability of false
positives decreases as m (the number
of bits in the array) increases, and
increases as n (the number of inserted
elements) increases. For a given m and
n, the value of k (the number of hash
functions) that minimizes the
probability is
It's very easy to implement a Bloom filter using Java 8 features. You just need a long[] to store the bits, and a few hash functions, which you can represent with ToIntFunction<T>. I made a brief write up on doing this from scratch.
The part to be careful about is selecting the right bit from the array.
public class BloomFilter<T> {
private final long[] array;
private final int size;
private final List<ToIntFunction<T>> hashFunctions;
public BloomFilter(long[] array, int logicalSize, List<ToIntFunction<T>> hashFunctions) {
this.array = array;
this.size = logicalSize;
this.hashFunctions = hashFunctions;
}
public void add(T value) {
for (ToIntFunction<T> function : hashFunctions) {
int hash = mapHash(function.applyAsInt(value));
array[hash >>> 6] |= 1L << hash;
}
}
public boolean mightContain(T value) {
for (ToIntFunction<T> function : hashFunctions) {
int hash = mapHash(function.applyAsInt(value));
if ((array[hash >>> 6] & (1L << hash)) == 0) {
return false;
}
}
return true;
}
private int mapHash(int hash) {
return hash & (size - 1);
}
public static <T> Builder<T> builder() {
return new Builder<>();
}
public static class Builder<T> {
private int size;
private List<ToIntFunction<T>> hashFunctions;
public Builder<T> withSize(int size) {
this.size = size;
return this;
}
public Builder<T> withHashFunctions(List<ToIntFunction<T>> hashFunctions) {
this.hashFunctions = hashFunctions;
return this;
}
public BloomFilter<T> build() {
return new BloomFilter<>(new long[size >>> 6], size, hashFunctions);
}
}
}
I think we should look at the application of Bloom Filters, and the secret is in the name, it is a Filter and not a data-structure. It is mostly used for saving resources by checking whether items are not part of a set. If you want to minimize false positives to 0, you will have to insert all the items that aren't apart of the set, since there are no false negatives for a well designed Bloom Filter, except that bloom filter would be gigantic and unpractical, might as well just store the items in a skip-list :) I wrote a simple tutorial on Bloom Filters if anyone is interested.
http://techeffigy.wordpress.com/2014/06/05/bloom-filter-tutorial/

Resources