Subset sum with positive and negative integers - algorithm

I've to implement a variation of the subset sum problem, my input will be positive and negative decimal, also I will need to know the subset, knowing that exists unfortunately it's not enough.
I've tried the algorithms found on wikipedia, but I can't make them work with negative numbers, and also I can't find the way to obtain the subset if it exists.
Could anyone point me where I could find some pseudo-code, documentation or implementation, for this algorithm.

I wrote the code in Java
it checks all the possibilities
import java.util.*;
public class StackOverFlow {
public static <T> Set<Set<T>> powerSet(Set<T> originalSet) {
Set<Set<T>> sets = new HashSet<Set<T>>();
if (originalSet.isEmpty()) {
sets.add(new HashSet<T>());
return sets;
}
List<T> list = new ArrayList<T>(originalSet);
T head = list.get(0);
Set<T> rest = new HashSet<T>(list.subList(1, list.size()));
for (Set<T> set : powerSet(rest)) {
Set<T> newSet = new HashSet<T>();
newSet.add(head);
newSet.addAll(set);
sets.add(newSet);
sets.add(set);
}
return sets;
}
public static int sumSet(Set<Integer> set){
int sum =0;
for (Integer s : set) {
sum += s;
}
return sum;
}
public static void main(String[] args) {
Set<Integer> mySet = new HashSet<Integer>();
mySet.add(-1);
mySet.add(2);
mySet.add(3);
int mySum = 4;
for (Set<Integer> s : powerSet(mySet)) {
if(mySum == sumSet(s))
System.out.println(s + " = " + sumSet(s));
}
}
}
I hope it helps

Related

Generating all the elements of a power set

Power set is just set of all subsets for given set.
It includes all subsets (with empty set).
It's well-known that there are 2^N elements in this set, where N is count of elements in original set.
To build power set, following thing can be used:
Create a loop, which iterates all integers from 0 till 2^N-1
Proceed to binary representation for each integer
Each binary representation is a set of N bits (for lesser numbers, add leading zeros).
Each bit corresponds, if the certain set member is included in current subset.
import java.util.NoSuchElementException;
import java.util.BitSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;
public class PowerSet<E> implements Iterator<Set<E>>, Iterable<Set<E>> {
private final E[] ary;
private final int subsets;
private int i;
public PowerSet(Set<E> set) {
ary = (E[])set.toArray();
subsets = (int)Math.pow(2, ary.length) - 1;
}
public Iterator<Set<E>> iterator() {
return this;
}
#Override
public void remove() {
throw new UnsupportedOperationException("Cannot remove()!");
}
#Override
public boolean hasNext() {
return i++ < subsets;
}
#Override
public Set<E> next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
Set<E> subset = new TreeSet<E>();
BitSet bitSet = BitSet.valueOf(new long[] { i });
if (bitSet.cardinality() == 0) {
return subset;
}
for (int e = bitSet.nextSetBit(0); e != -1; e = bitSet.nextSetBit(e + 1)) {
subset.add(ary[e]);
}
return subset;
}
// Unit Test
public static void main(String[] args) {
Set<Integer> numbers = new TreeSet<Integer>();
for (int i = 1; i < 4; i++) {
numbers.add(i);
}
PowerSet<Integer> pSet = new PowerSet<Integer>(numbers);
for (Set<Integer> subset : pSet) {
System.out.println(subset);
}
}
}
The output I am getting is:
[2]
[3]
[2, 3]
java.util.NoSuchElementException
at PowerSet.next(PowerSet.java:47)
at PowerSet.next(PowerSet.java:20)
at PowerSet.main(PowerSet.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at edu.rice.cs.drjava.model.compiler.JavacCompiler.runCommand(JavacCompiler.java:272)
So, the problems are:
I am got getting all the elements(debugging shows me next is called only for even i's).
The exception should not have been thrown.
The problem is in your hasNext. You have i++ < subsets there. What happens is that since hasNext is called once from next() and once more during the iteration for (Set<Integer> subset : pSet) you increment i by 2 each time. You can see this since
for (Set<Integer> subset : pSet) {
}
is actually equivalent to:
Iterator<PowerSet> it = pSet.iterator();
while (it.hasNext()) {
Set<Integer> subset = it.next();
}
Also note that
if (bitSet.cardinality() == 0) {
return subset;
}
is redundant. Try instead:
#Override
public boolean hasNext() {
return i <= subsets;
}
#Override
public Set<E> next() {
if (!hasNext()) {
throw new NoSuchElementException();
}
Set<E> subset = new TreeSet<E>();
BitSet bitSet = BitSet.valueOf(new long[] { i });
for (int e = bitSet.nextSetBit(0); e != -1; e = bitSet.nextSetBit(e + 1)) {
subset.add(ary[e]);
}
i++;
return subset;
}

algorithms how to deal with multiple criteria

find 2 rectangles A[i] and A[j] in an array A[n] rectangles such that A[i].width > A[j].width and A[i].length - A[j].length is the longest.
Is there a way to reduce the complexity to O(nlogn)? I can't find a way to get an O(logn) search for the second rectangle. Sorting doesn't seem to help here due to the possibilities of 2 criteria being completely opposite of each other. Maybe I'm just going at it wrong? direct me to the right path please. Thank you.
Note: Homework assignment using different object and using 2 criteria instead of 3, but the context is the same.
Since this is homework, here is a high-level answer, with the implementation left as a problem for the OP:
Sort the elements of the array ascending by width. Then scan down the array subtracting the current length from the highest length encountered so far. keep track of the greatest difference encountered so far (and the corresponding i and j). When done you will have the greatest length difference A[i].length-A[j].length where A[i].width > A[j].width
Analysis: sorting the elements takes O(n*Log(n)), all other steps take O(n).
Here is some java code to achieve the same::
import java.util.Arrays;
import java.util.Comparator;
import java.util.Random;
public class RequiredRectangle {
public static void main(String[] args) {
// test the method
int n=10;
Rectangle[] input = new Rectangle[n];
Random r = new Random();
for(int i=0;i<n;i++){
input[i] = new Rectangle(r.nextInt(100)+1,r.nextInt(100)+1);
}
System.out.println("INPUT:: "+Arrays.deepToString(input));
Rectangle[] output = getMaxLengthDiffAndGreaterBreadthPair(input);
System.out.println("OUTPUT:: "+Arrays.deepToString(output));
}
public static Rectangle[] getMaxLengthDiffAndGreaterBreadthPair(Rectangle[] input){
Rectangle[] output = new Rectangle[2];
Arrays.sort(input, new Comparator<Rectangle>() {
public int compare(Rectangle rc1,Rectangle rc2){
return rc1.getBreadth()-rc2.getBreadth();
}
});
int len=0;
Rectangle obj1,obj2;
for(int i=0;i<input.length;i++){
obj2=input[i];
for(int j=i+1;j<input.length;j++){
obj1=input[j];
int temp=obj1.getLength() - obj2.getLength();
if(temp>len && obj1.getBreadth() > obj2.getBreadth()){
len=temp;
output[0]=obj1;
output[1]=obj2;
}
}
}
return output;
}
}
class Rectangle{
private int length;
private int breadth;
public int getLength(){
return length;
}
public int getBreadth(){
return breadth;
}
public Rectangle(int length,int breadth){
this.length=length;
this.breadth=breadth;
}
#Override
public boolean equals(Object obj){
Rectangle rect = (Rectangle)obj;
if(this.length==rect.length && this.breadth==rect.breadth)
return true;
return false;
}
#Override
public String toString(){
return "["+length+","+breadth+"]";
}
}
`
Sample Output:
INPUT:: [[8,19], [68,29], [92,14], [1,27], [87,24], [57,42], [45,5], [66,27], [45,28], [29,11]]
OUTPUT:: [[87,24], [8,19]]

Big O complexities of my Huffman Algorithm

Can someone please tell me the Space and time Complexities, in Bog O notation, of this Huffman code with a little explanation. Would be very much appreciated, thanks. And please do mention the Big O of each method separately, would be great. Thanks.
package HuffmanProject;
import java.util.*;
class MyHCode {
public static void main(String[] args) {
String test = "My name is Zaryab Ali";
int[] FreqArray = new int[256];
for (char c : test.toCharArray()) {
FreqArray[c]++;
}
MyHTree tree = ImplementTree(FreqArray);
System.out.println("CHARACTER\tFREQUENCY\tBINARY EQUIVALEENT CODE");
PrintMyHCode(tree, new StringBuffer());
}
public static MyHTree ImplementTree(int[] FreqArray) {
PriorityQueue<MyHTree> trees = new PriorityQueue<MyHTree>();
for (int i = 0; i < FreqArray.length; i++) {
if (FreqArray[i] > 0) {
trees.offer(new MyHLeaf(FreqArray[i], (char) i));
}
}
while (trees.size() > 1) {
MyHTree FChild = trees.poll();
MyHTree SChild = trees.poll();
trees.offer(new MyHNode(FChild, SChild));
}
return trees.poll();
}
public static void PrintMyHCode(MyHTree tree, StringBuffer prefix) {
if (tree instanceof MyHLeaf) {
MyHLeaf leaf = (MyHLeaf) tree;
System.out.println(leaf.CharValue + "\t\t" + leaf.frequency + "\t\t" + prefix);
}
else if (tree instanceof MyHNode) {
MyHNode node = (MyHNode) tree;
prefix.append('0');
PrintMyHCode(node.left, prefix);
prefix.deleteCharAt(prefix.length() - 1);
prefix.append('1');
PrintMyHCode(node.right, prefix);
prefix.deleteCharAt(prefix.length() - 1);
}
}
}
abstract class MyHTree implements Comparable<MyHTree> {
public int frequency;
public MyHTree(int f) {
frequency = f;
}
public int compareTo(MyHTree tree) {
return frequency - tree.frequency;
}
}
class MyHLeaf extends MyHTree {
public char CharValue;
public MyHLeaf(int f, char v) {
super(f);
CharValue = v;
}
}
class MyHNode extends MyHTree {
public MyHTree left, right;
public MyHNode(MyHTree l, MyHTree r) {
super(l.frequency + r.frequency);
left = l;
right = r;
}
}
The PrintMyHCode() method iterates through the left & right subtrees until the leaft node is found. If there are n elements in the tree then the complexity of this method would be O(n).
The ImplementTree() method adds values in array to the tree and then it polls on their childs.
If there are n elements in the array:
1. The complexity of the for loop in this method will be O(n) as each elements is added to the tree directly
2. The complexity of while loop in this method will be O(logn) assuming that every node has atleast two children for it.
Hence, the total time complexity for ImplementTree() method in Big O notation would be O(nlogn).
Hope, this answer works for you.

Algorithm to find if two sets of sets of numbers are isomorphic or not (under permutation)

Given two systems consisting of set of sets of numbers, I would like to know if they are isomorphic under permutation.
For example
{{1,2,3,4,5},{2,4,5,6,7},{2,3,4,6,7}} is a system of 3 sets of 5 numbers.
{{1,2,3,4,6},{2,3,5,6,7},{2,3,4,8,9}} is a another system of 3 sets of 5 numbers. I want to check if these systems are isomorphic.
There are not. The first system uses numbers { 1,2,3,4,5,6,7 }, the second one uses numbers { 1,2,3,4,5,6,7,8,9}.
Here is another example.
{{1,2,3}, {1,2,4}, {3,4,5}} and {{1,2,4}, {1,3,5}, {2,3,5}}. Those two systems of 3 sets of 3 numbers are isomorphic.
If I use permutation (5 3 1 2 4) where 1 becomes 5, 2 becomes 3, etc. The first set becomes {5,3,1}. The second becomes {5,3,2}. The third one becomes {1,2,4}. So the transformed system by this permutation is {{5,3,1},{5,3,2},{1,2,4}} that is equivalently rewritten to {{1,2,4},{1,3,5},{2,3,5}} as I am not interested in order. This is the second system, so the answer is yes.
Currently, on the first example, I apply all 9! permutations of {1,2,3,...,9}
to the first system and check if I can get the second one. It gives me an answer, but very slowly.
Is there a clever algorithm ?
(I only want the answer, yes or no. I am not interested in getting a permutation that transform the first system to the second one.)
As pointed out in the comments, this might correspond to graph-theoretic problems that are still under investigation regarding the complexity and the algorithms that can be employed to tackle them.
However, the complexity always refers to some input size. And here, it is not clear what your input size is. As an example: I think that the most appropriate algorithm might depend on whether you are going to scale up...
the number of numbers (1...9 in your example) or
the number of sets in each set (3, in your example) or
the size of the sets in the sets (5, in your example)
Using your current approach, scaling the number of numbers would not be feasible, because you can't compute all permutations for numbers much larger than 9 due to the exponential running time. But if your intention was to check the isomorphy of sets containing 1000 sets, an algorithm that was polynomial in the number of sets (if such an algorithm existed) might still be slower in practice.
Here, I'd like to sketch an approach that I tried. I did not perform a detailed complexity analysis (which might be pointless if there exist no polynomial time solution at all - and to prove or disprove that can't be the subject of an answer here).
The basic idea is as follows:
Initially, you compute the valid "domains" for each input number. These are possible values that each number may be mapped to, based on the permutation. If the given numbers are 1,2 and 3, then the domains initially could be
1 -> { 1, 2, 3 }
2 -> { 1, 2, 3 }
3 -> { 1, 2, 3 }
But for the given sets, one can already derive some information that allows reducing the domains. For example: Any number that appears n times in the first sets must be mapped to a number that appears n times in the second sets.
Imagine that the given sets are
{{1,2},{1,3}}
{{3,1},{3,2}}
Then the domains would only be
1 -> { 3 }
2 -> { 1, 2 }
3 -> { 1, 2 }
because the 1 appears twice in the first sets, and the only value that appears twice in the second sets is the 3.
After the initial domains are computed, one can perform a backtracking of the possible assignments (permutations) of the numbers. The backtracking can roughly be done as
for (each number n that has no permutation value assigned) {
assign a permutation value (from the current domain of n) to n
update the domains of all other numbers
if the domains are no longer valid, then backtrack
if the solution was found, then return it
}
(The idea is somehow "inspired" by the Arc Consistency 3 Algorithm, although technically, the problems are not directly related)
During the backtracking, one can employ different pruning criteria. That is, one can think of various tricks in order to quickly check whether a certain assignment (a partial permutation) and the domains that are implied by this assignent are "valid" or not.
The obvious (necessary) criterion for an assignment to be valid is that none of the domains may be empty. More generally: Each domain may not appear more often than the number of elements that it contains. When you find out that the domains are
1 -> { 4 }
2 -> { 2,3 }
3 -> { 2,3 }
4 -> { 2,3 }
then there can no longer be a valid solution, and the algorithm may track back.
Of course, bactracking tends to have exponential complexity in the input size. But it might be that there simply exists no efficient algorithm for this problem. For this case, the pruning that may be employed during the backtracking may at least help to reduce the running time for certain cases (or for small input sizes in general) compared to a brute-force exhausting search.
Here is an implementation of my experiments, in Java. This is not particularly elegant, but shows that it basically works: It quickly finds a solution if there exists one, and (for the given input sizes) does not take long to detect when there is no solution.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.Comparator;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
public class SetSetIsomorphisms
{
public static void main(String[] args)
{
Map<Integer, Integer> p = new LinkedHashMap<Integer, Integer>();
p.put(0, 3);
p.put(1, 4);
p.put(2, 8);
p.put(3, 2);
p.put(4, 1);
p.put(5, 5);
p.put(6, 0);
p.put(7, 9);
p.put(8, 7);
p.put(9, 6);
Set<Set<Integer>> sets0 = new LinkedHashSet<Set<Integer>>();
sets0.add(new LinkedHashSet<Integer>(Arrays.asList(1,2,3,4,5)));
sets0.add(new LinkedHashSet<Integer>(Arrays.asList(2,4,5,6,7)));
sets0.add(new LinkedHashSet<Integer>(Arrays.asList(0,8,3,9,7)));
Set<Set<Integer>> sets1 = new LinkedHashSet<Set<Integer>>();
for (Set<Integer> set0 : sets0)
{
sets1.add(applyMapping(set0, p));
}
// Uncomment these lines for a case where NO permutation is found
//sets1.remove(sets1.iterator().next());
//sets1.add(new LinkedHashSet<Integer>(Arrays.asList(4,8,2,3,5)));
System.out.println("Initially valid? "+
areIsomorphic(sets0, sets1, p));
boolean areIsomorphic = areIsomorphic(sets0, sets1);
System.out.println("Result: "+areIsomorphic);
}
private static <T> boolean areIsomorphic(
Set<Set<T>> sets0, Set<Set<T>> sets1)
{
System.out.println("sets0:");
for (Set<T> set0 : sets0)
{
System.out.println(" "+set0);
}
System.out.println("sets1:");
for (Set<T> set1 : sets1)
{
System.out.println(" "+set1);
}
Set<T> all0 = flatten(sets0);
Set<T> all1 = flatten(sets1);
System.out.println("All elements");
System.out.println(" "+all0);
System.out.println(" "+all1);
if (all0.size() != all1.size())
{
System.out.println("Different number of elements");
return false;
}
Map<T, Set<T>> domains = computeInitialDomains(sets0, sets1);
System.out.println("Domains initially:");
print(domains, "");
Map<T, T> assignment = new LinkedHashMap<T, T>();
return compute(assignment, domains, sets0, sets1, "");
}
private static <T> Map<T, Set<T>> computeInitialDomains(
Set<Set<T>> sets0, Set<Set<T>> sets1)
{
Set<T> all0 = flatten(sets0);
Set<T> all1 = flatten(sets1);
Map<T, Set<T>> domains = new LinkedHashMap<T, Set<T>>();
for (T e0 : all0)
{
Set<T> domain0 = new LinkedHashSet<T>();
for (T e1 : all1)
{
if (isFeasible(e0, sets0, e1, sets1))
{
domain0.add(e1);
}
}
domains.put(e0, domain0);
}
return domains;
}
private static <T> boolean isFeasible(
T e0, Set<Set<T>> sets0,
T e1, Set<Set<T>> sets1)
{
int c0 = countContaining(sets0, e0);
int c1 = countContaining(sets1, e1);
return c0 == c1;
}
private static <T> int countContaining(Set<Set<T>> sets, T value)
{
int count = 0;
for (Set<T> set : sets)
{
if (set.contains(value))
{
count++;
}
}
return count;
}
private static <T> boolean compute(
Map<T, T> assignment, Map<T, Set<T>> domains,
Set<Set<T>> sets0, Set<Set<T>> sets1, String indent)
{
if (!validCounts(domains.values()))
{
System.out.println(indent+"There are too many domains "
+ "with too few elements");
print(domains, indent);
return false;
}
if (assignment.keySet().equals(domains.keySet()))
{
System.out.println(indent+"Found assignment: "+assignment);
return true;
}
List<Entry<T, Set<T>>> entryList =
new ArrayList<Map.Entry<T,Set<T>>>(domains.entrySet());
Collections.sort(entryList, new Comparator<Map.Entry<T,Set<T>>>()
{
#Override
public int compare(Entry<T, Set<T>> e0, Entry<T, Set<T>> e1)
{
return Integer.compare(
e0.getValue().size(),
e1.getValue().size());
}
});
for (Entry<T, Set<T>> entry : entryList)
{
T key = entry.getKey();
if (assignment.containsKey(key))
{
continue;
}
Set<T> domain = entry.getValue();
for (T value : domain)
{
Map<T, Set<T>> newDomains = copy(domains);
removeFromOthers(newDomains, key, value);
assignment.put(key, value);
newDomains.get(key).clear();
newDomains.get(key).add(value);
System.out.println(indent+"Using "+assignment);
Set<Set<T>> setsContainingKey =
computeSetsContainingValue(sets0, key);
Set<Set<T>> setsContainingValue =
computeSetsContainingValue(sets1, value);
Set<T> keyElements = flatten(setsContainingKey);
Set<T> valueElements = flatten(setsContainingValue);
for (T otherKey : keyElements)
{
Set<T> otherValues = newDomains.get(otherKey);
otherValues.retainAll(valueElements);
}
System.out.println(indent+"Domains when "+assignment);
print(newDomains, indent);
boolean done = compute(assignment, newDomains,
sets0, sets1, indent+" ");
if (done)
{
return true;
}
assignment.remove(key);
}
}
return false;
}
private static boolean validCounts(
Collection<? extends Collection<?>> collections)
{
Map<Collection<?>, Integer> counts =
new LinkedHashMap<Collection<?>, Integer>();
for (Collection<?> c : collections)
{
Integer count = counts.get(c);
if (count == null)
{
count = 0;
}
counts.put(c, count+1);
}
for (Entry<Collection<?>, Integer> entry : counts.entrySet())
{
Collection<?> c = entry.getKey();
Integer count = entry.getValue();
if (count > c.size())
{
return false;
}
}
return true;
}
private static <K, V> Map<K, Set<V>> copy(Map<K, Set<V>> map)
{
Map<K, Set<V>> copy = new LinkedHashMap<K, Set<V>>();
for (Entry<K, Set<V>> entry : map.entrySet())
{
K k = entry.getKey();
Set<V> values = entry.getValue();
copy.put(k, new LinkedHashSet<V>(values));
}
return copy;
}
private static <T> Set<Set<T>> computeSetsContainingValue(
Iterable<? extends Set<T>> sets, T value)
{
Set<Set<T>> containing = new LinkedHashSet<Set<T>>();
for (Set<T> set : sets)
{
if (set.contains(value))
{
containing.add(set);
}
}
return containing;
}
private static <T> void removeFromOthers(
Map<T, Set<T>> map, T key, T value)
{
for (Entry<T, Set<T>> entry : map.entrySet())
{
if (!entry.getKey().equals(key))
{
Set<T> values = entry.getValue();
values.remove(value);
}
}
}
private static <T> Set<T> flatten(
Iterable<? extends Collection<? extends T>> collections)
{
Set<T> set = new LinkedHashSet<T>();
for (Collection<? extends T> c : collections)
{
set.addAll(c);
}
return set;
}
private static <T> Set<T> applyMapping(
Set<T> set, Map<T, T> map)
{
Set<T> result = new LinkedHashSet<T>();
for (T e : set)
{
result.add(map.get(e));
}
return result;
}
private static <T> boolean areIsomorphic(
Set<Set<T>> sets0, Set<Set<T>> sets1, Map<T, T> p)
{
for (Set<T> set0 : sets0)
{
Set<T> set1 = applyMapping(set0, p);
if (!sets1.contains(set1))
{
return false;
}
}
return true;
}
private static void print(Map<?, ?> map, String indent)
{
for (Entry<?, ?> entry : map.entrySet())
{
System.out.println(indent+entry.getKey()+": "+entry.getValue());
}
}
}
I believe your problem is equivalent to the Graph Isomorphism problem (GI). Your set of sets can be modelled as a (bipartite) graph, with nodes representing the base values of your set (e.g., 1, 2, 3, ... 7), while nodes on the right represent sets (e.g., {1,2,3,4,6} or {2,3,5,6,7}). Draw an edge connecting a node on the left with a node on the right if the number is an element of the set; in my example, 1 is connected only to {1,2,3,4,6} while 2 is connected to both {1,2,3,4,6} and to {2,3,5,6,7}. 1 is connected to all sets which contain it; {1,2,3,4,6} is connected to all numbers contained in it.
Any bipartite graph can be realized in this manner. Conversely, GI can be reduced to solving GI on bipartite graphs. (Any graph can be made into a bipartite graph by replacing each edge with two new edges and a new vertex. Isomorphism in the resulting bipartite graphs is equivalent to isomorphism in the original graphs.)
GI is in NP, but it is not known whether it is NP complete. In practice, GI can be solved quickly for hundreds of vertices with e.g., NAUTY.

Lossless hierarchical run length encoding

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.
For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F
I am not concerned that an option is picked between two identical possible nestings E.g.
ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.
However I do want the choice to be MOST greedy. For instance:
ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.
In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.
Welcome ideas.
I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.
You can paste the following code into LINQPad and run it, and it should produce the following output:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)
As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.
Basically, the code runs like this:
For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
If it can't find a repeating sequence, it spits out the single symbol at that location
It then skips what it encoded, and continues from #1
Anyway, here's the code:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}
Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.
ABCBCABCBCDEEF
s->ttDuuF
t->Avv
v->BC
u->E
ABABCDABABCD
s->ABtt
t->ABCD
Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.
The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

Resources