Longest string that can be made out of a list of strings - algorithm

I'm looking for an efficient algorithm which will give me the longest string which can be made out of a list of strings. More precisely:
Given a file containing large number of strings, find the longest string from the list of strings presented in the file, which is a concatenation of other one or more strings.
Note: The answer string should also belong to the list of strings in file.
Example input:
The
he
There
After
ThereAfter
Output:
ThereAfter

Sort list in descending length order for strings in the list (first in the list is longest string). Quicksort is sorting with average time complexity O(nlogn).
Then, iterate on strings in the list starting left.
From string S, iterate on elements s to its right. If s is a substring of S, remove s from S. Continue iterating to the right until S is empty, meaning that it is made of items of the list.
public static class ListCompare implements Comparator<String> {
public int compare(String s1, String s2) {
if (s1.length() < s2.length())
return 1;
else if (s1.length() > s2.length())
return -1;
else
return 0;
}
}
public static String longestSurString(String[] ss) {
Arrays.sort(ss, new ListCompare ());
for (String S: ss) {
String b = new String(s);
for (String s: ss) {
if (!s.equals(b) && S.contains(s)) {
S = S.replace(s, "");
}
}
if (S.length() == 0)
return b;
}
return null;
}

Let us number the strings from S1, S2, ..., Sn
If I understand the problem statement correctly, than for Sito be a candidate for an answer, it must be equal to the concatenation of some
Sj_1, Sj_2, ..., Sj_k where forall x in 1..k: i != j_x. That is Si must be the concatenation of a subset of the strings that it is not a member of.
Given that, add all the strings to a trie. This will find all the prefix pairs, that is all (Si, Sj_1) from the above. Removing the Sj_1 prefix from Si renders a new string T, that must either be equal to Sj_k, or can be similarly reduced by searching for Sj_2 in the trie.

Best Solution using Trie & HashMap data structure:
We need first store all string and then just check prefix which words length is more.
public class TrieNode {
private HashMap<Character, TrieNode> children;
private boolean isWord;
public TrieNode() {
children = new HashMap<>();
isWord = false;
}
public void add(String s) {
HashMap<Character,TrieNode> temp_child = children;
for(int i=0; i<s.length(); i++){
char c = s.charAt(i);
TrieNode trieNode;
if(temp_child.containsKey(c)){
trieNode = temp_child.get(c);
}else{
trieNode = new TrieNode();
temp_child.put(c, trieNode);
}
temp_child = trieNode.children;
//set leaf node
if(i==s.length()-1)
trieNode.isWord = true;
}
}
public boolean isWord(String s) {
TrieNode trieNode = searchNode(s);
if(trieNode != null && trieNode.isWord)
return true;
else
return false;
}
public TrieNode searchNode(String s){
HashMap<Character, TrieNode> temp_child = children;
TrieNode trieNode = null;
for(int i=0; i<s.length(); i++){
char c = s.charAt(i);
if(temp_child.containsKey(c)){
trieNode = temp_child.get(c);
temp_child = trieNode.children;
}else{
return null;
}
}
return trieNode;
}

Related

BASH brace expansion algorithm

I am stuck on this algorithmic question :
design an algorithm that parse an expression like this :
"((a,b,cy)n,m)" should give :
an - bn - cyn - m
The expression can nest, therefore :
"((a,b)o(m,n)p,b)" parses to ;
aomp - aonp - bomp - bonp - b.
I thought of using stacks, but it is too complicated.
thanks.
You can parse it with a Recursive Descent Parser.
Let's say the comma separated strings are components, so for an expression ((a, b, cy)n, m), (a, b, cy)n and m are two components. a, b and cy are also components. So this is a recursive definition.
For a component (a, b, cy)n, let's say (a, b, cy) and n are two component parts of the component. Component parts will later be combined to produce final result (i.e., an - bn - cyn).
Let's say an expression is comma separated components, for example, (a, cy)n, m is an expression. It has two components (a, cy)n and m, and the component (a, cy)n has two component parts (a, cy) and n, and component part (a, cy) is a brace expression containing a nested expression: a, cy, which also has two components a and cy.
With these definitions (you might use other terms), we can write down the grammar for your expression:
expression = component, component, ...
component = component_part component_part ...
component_part = letters | (expression)
One line is one grammar rule. The first line means an expression is a list of comma separated components. The second line means a component can be constructed with one or more component parts. The third line means a component part can be either a continuous sequence of letters or a nested expression inside a pair of braces.
Then you can use a Recursive Descent Parser to solve your problem with the above grammar.
We will define one method/function for each grammar rule. So basically we will have three methods ParseExpression, ParseComponent, ParseComponentPart.
Algorithm
As I stated above, an expression is comma separated components, so in our ParseExpression method, it simply calls ParseComponent, and then check if the next char is comma or not, like this (I'm using C#, I think you can easily convert it to other languages):
private List<string> ParseExpression()
{
var result = new List<string>();
while (!Eof())
{
// Parsing a component will produce a list of strings,
// they are added to the final string list
var items = ParseComponent();
result.AddRange(items);
// If next char is ',' simply skip it and parse next component
if (Peek() == ',')
{
// Skip comma
ReadNextChar();
}
else
{
break;
}
}
return result;
}
You can see that, when we are parsing an expression, we recursively call ParseComponent (it will then recursively call ParseComponentPart). It's a top-down approach, that's why it's called Recursive Descent Parsing.
ParseComponent is similar, like this:
private List<string> ParseComponent()
{
List<string> leftItems = null;
while (!Eof())
{
// Parse a component part will produce a list of strings (rightItems)
// We need to combine already parsed string list (leftItems) in this component
// with the newly parsed 'rightItems'
var rightItems = ParseComponentPart();
if (rightItems == null)
{
// No more parts, return current result (leftItems) to the caller
break;
}
if (leftItems == null)
{
leftItems = rightItems;
}
else
{
leftItems = Combine(leftItems, rightItems);
}
}
return leftItems;
}
The combine method simply combines two string list:
// Combine two lists of strings and return the combined string list
private List<string> Combine(List<string> leftItems, List<string> rightItems)
{
var result = new List<string>();
foreach (var leftItem in leftItems)
{
foreach (var rightItem in rightItems)
{
result.Add(leftItem + rightItem);
}
}
return result;
}
Then is the ParseComponentPart:
private List<string> ParseComponentPart()
{
var nextChar = Peek();
if (nextChar == '(')
{
// Skip '('
ReadNextChar();
// Recursively parse the inner expression
var items = ParseExpression();
// Skip ')'
ReadNextChar();
return items;
}
else if (char.IsLetter(nextChar))
{
var letters = ReadLetters();
return new List<string> { letters };
}
else
{
// Fail to parse a part, it means a component is ended
return null;
}
}
Full Source Code (C#)
The other parts are mostly helper methods, full C# source code is listed below:
using System;
using System.Collections.Generic;
using System.Text;
namespace Examples
{
public class BashBraceParser
{
private string _expression;
private int _nextCharIndex;
/// <summary>
/// Parse the specified BASH brace expression and return the result string list.
/// </summary>
public IList<string> Parse(string expression)
{
_expression = expression;
_nextCharIndex = 0;
return ParseExpression();
}
private List<string> ParseExpression()
{
// ** This part is already posted above **
}
private List<string> ParseComponent()
{
// ** This part is already posted above **
}
private List<string> ParseComponentPart()
{
// ** This part is already posted above **
}
// Combine two lists of strings and return the combined string list
private List<string> Combine(List<string> leftItems, List<string> rightItems)
{
// ** This part is already posted above **
}
// Peek next char without moving the cursor
private char Peek()
{
if (Eof())
{
return '\0';
}
return _expression[_nextCharIndex];
}
// Read next char and move the cursor to next char
private char ReadNextChar()
{
return _expression[_nextCharIndex++];
}
private void UnreadChar()
{
_nextCharIndex--;
}
// Check if the whole expression string is scanned.
private bool Eof()
{
return _nextCharIndex == _expression.Length;
}
// Read a continuous sequence of letters.
private string ReadLetters()
{
if (!char.IsLetter(Peek()))
{
return null;
}
var str = new StringBuilder();
while (!Eof())
{
var ch = ReadNextChar();
if (char.IsLetter(ch))
{
str.Append(ch);
}
else
{
UnreadChar();
break;
}
}
return str.ToString();
}
}
}
Use The Code
var parser = new BashBraceParser();
var result = parser.Parse("((a,b)o(m,n)p,b)");
var output = String.Join(" - ", result);
// Result: aomp - aonp - bomp - bonp - b
Console.WriteLine(output);
public class BASHBraceExpansion {
public static ArrayList<StringBuilder> parse_bash(String expression, WrapperInt p) {
ArrayList<StringBuilder> elements = new ArrayList<StringBuilder>();
ArrayList<StringBuilder> result = new ArrayList<StringBuilder>();
elements.add(new StringBuilder(""));
while(p.index < expression.length())
{
if (expression.charAt(p.index) == '(')
{
p.advance();
ArrayList<StringBuilder> temp = parse_bash(expression, p);
ArrayList<StringBuilder> newElements = new ArrayList<StringBuilder>();
for(StringBuilder e : elements)
{
for(StringBuilder t : temp)
{
StringBuilder s = new StringBuilder(e);
newElements.add(s.append(t));
}
}
System.out.println("elements :");
elements = newElements;
}
else if (expression.charAt(p.index) == ',')
{
result.addAll(elements);
elements.clear();
elements.add(new StringBuilder(""));
p.advance();
}
else if (expression.charAt(p.index) == ')')
{
p.advance();
result.addAll(elements);
return result;
}
else
{
for(StringBuilder sb : elements)
{
sb.append(expression.charAt(p.index));
}
p.advance();
}
}
return elements;
}
public static void print(ArrayList<StringBuilder> list)
{
for(StringBuilder s : list)
{
System.out.print(s + " * ");
}
System.out.println();
}
public static void main(String[] args) {
WrapperInt p = new WrapperInt();
ArrayList<StringBuilder> list = parse_bash("((a,b)o(m,n)p,b)", p);
//ArrayList<StringBuilder> list = parse_bash("(a,b)", p);
WrapperInt q = new WrapperInt();
ArrayList<StringBuilder> list1 = parse_bash("((a,b,cy)n,m)", q);
ArrayList<StringBuilder> list2 = parse_bash("((a,b)dr(f,g)(k,m),L(p,q))", new WrapperInt());
System.out.println("*****RESULT : ******");
print(list);
print(list1);
print(list2);
}
}
public class WrapperInt {
public WrapperInt() {
index = 0;
}
public int advance()
{
index ++;
return index;
}
public int index;
}
// aomp - aonp - bomp - bonp - b.

Subset sum with positive and negative integers

I've to implement a variation of the subset sum problem, my input will be positive and negative decimal, also I will need to know the subset, knowing that exists unfortunately it's not enough.
I've tried the algorithms found on wikipedia, but I can't make them work with negative numbers, and also I can't find the way to obtain the subset if it exists.
Could anyone point me where I could find some pseudo-code, documentation or implementation, for this algorithm.
I wrote the code in Java
it checks all the possibilities
import java.util.*;
public class StackOverFlow {
public static <T> Set<Set<T>> powerSet(Set<T> originalSet) {
Set<Set<T>> sets = new HashSet<Set<T>>();
if (originalSet.isEmpty()) {
sets.add(new HashSet<T>());
return sets;
}
List<T> list = new ArrayList<T>(originalSet);
T head = list.get(0);
Set<T> rest = new HashSet<T>(list.subList(1, list.size()));
for (Set<T> set : powerSet(rest)) {
Set<T> newSet = new HashSet<T>();
newSet.add(head);
newSet.addAll(set);
sets.add(newSet);
sets.add(set);
}
return sets;
}
public static int sumSet(Set<Integer> set){
int sum =0;
for (Integer s : set) {
sum += s;
}
return sum;
}
public static void main(String[] args) {
Set<Integer> mySet = new HashSet<Integer>();
mySet.add(-1);
mySet.add(2);
mySet.add(3);
int mySum = 4;
for (Set<Integer> s : powerSet(mySet)) {
if(mySum == sumSet(s))
System.out.println(s + " = " + sumSet(s));
}
}
}
I hope it helps

Is it possible to design a tree where nodes have infinitely many children?

How can design a tree with lots (infinite number) of branches ?
Which data structure we should use to store child nodes ?
You can't actually store infinitely many children, since that won't fit into memory. However, you can store unboundedly many children - that is, you can make trees where each node can have any number of children with no fixed upper bound.
There are a few standard ways to do this. You could have each tree node store a list of all of its children (perhaps as a dynamic array or a linked list), which is often done with tries. For example, in C++, you might have something like this:
struct Node {
/* ... Data for the node goes here ... */
std::vector<Node*> children;
};
Alternatively, you could use the left-child/right-sibling representation, which represents a multiway tree as a binary tree. This is often used in priority queues like binomial heaps. For example:
struct Node {
/* ... data for the node ... */
Node* firstChild;
Node* nextSibling;
};
Hope this helps!
Yes! You can create a structure where children are materialized on demand (i.e. "lazy children"). In this case, the number of children can easily be functionally infinite.
Haskell is great for creating "functionally infinite" data structures, but since I don't know a whit of Haskell, here's a Python example instead:
class InfiniteTreeNode:
''' abstract base class for a tree node that has effectively infinite children '''
def __init__(self, data):
self.data = data
def getChild(self, n):
raise NotImplementedError
class PrimeSumNode(InfiniteTreeNode):
def getChild(self, n):
prime = getNthPrime(n) # hypothetical function to get the nth prime number
return PrimeSumNode(self.data + prime)
prime_root = PrimeSumNode(0)
print prime_root.getChild(3).getChild(4).data # would print 18: the 4th prime is 7 and the 5th prime is 11
Now, if you were to do a search of PrimeSumNode down to a depth of 2, you could find all the numbers that are sums of two primes (and if you can prove that this contains all even integers, you can win a big mathematical prize!).
Something like this
Node {
public String name;
Node n[];
}
Add nodes like so
public Node[] add_subnode(Node n[]) {
for (int i=0; i<n.length; i++) {
n[i] = new Node();
p("\n Enter name: ");
n[i].name = sc.next();
p("\n How many children for "+n[i].name+"?");
int children = sc.nextInt();
if (children > 0) {
Node x[] = new Node[children];
n[i].n = add_subnode(x);
}
}
return n;
}
Full working code:
class People {
private Scanner sc;
public People(Scanner sc) {
this.sc = sc;
}
public void main_thing() {
Node head = new Node();
head.name = "Head";
p("\n How many nodes do you want to add to Head: ");
int nodes = sc.nextInt();
head.n = new Node[nodes];
Node[] n = add_subnode(head.n);
print_nodes(head.n);
}
public Node[] add_subnode(Node n[]) {
for (int i=0; i<n.length; i++) {
n[i] = new Node();
p("\n Enter name: ");
n[i].name = sc.next();
p("\n How many children for "+n[i].name+"?");
int children = sc.nextInt();
if (children > 0) {
Node x[] = new Node[children];
n[i].n = add_subnode(x);
}
}
return n;
}
public void print_nodes(Node n[]) {
if (n!=null && n.length > 0) {
for (int i=0; i<n.length; i++) {
p("\n "+n[i].name);
print_nodes(n[i].n);
}
}
}
public static void p(String msg) {
System.out.print(msg);
}
}
class Node {
public String name;
Node n[];
}
I recommend you to use a Node class with a left child Node and right child Node and a parent Node.
public class Node
{
Node<T> parent;
Node<T> leftChild;
Node<T> rightChild;
T value;
Node(T val)
{
value = val;
leftChild = new Node<T>();
leftChild.parent = this;
rightChild = new Node<T>();
rightChild.parent = this;
}
You can set grand father and uncle and sibling like this.
Node<T> grandParent()
{
if(this.parent.parent != null)
{
return this.parent.parent;
}
else
return null;
}
Node<T> uncle()
{
if(this.grandParent() != null)
{
if(this.parent == this.grandParent().rightChild)
{
return this.grandParent().leftChild;
}
else
{
return this.grandParent().rightChild;
}
}
else
return null;
}
Node<T> sibling()
{
if(this.parent != null)
{
if(this == this.parent.rightChild)
{
return this.parent.leftChild;
}
else
{
return this.parent.rightChild;
}
}
else
return null;
}
And is impossible to have infinite child, at least you have infinite memory.
good luck !
Hope this will help you.

How to walk binary abstract syntax tree to generate infix notation with minimally correct parentheses

I am being passed a binary AST representing a math formula. Each internal node is an operator and leaf nodes are the operands. I need to walk the tree and output the formula in infix notation. This is pretty easy to do by walking the tree with a recursive algorithm such as the Print() method shows below. The problem with the Print() method is that the order of operations is lost when converting to infix because no parentheses are generated.
I wrote the PrintWithParens() method which outputs a correct infix formula, however it adds extraneous parentheses. You can see in three of the four cases of my main method it adds parenthesis when none are necessary.
I have been racking my brain trying to figure out what the correct algorithm for PrintWithMinimalParens() should be. I'm sure there must be an algorithm that can output only parentheses when necessary to group terms, however I have been unable to implement it correctly. I think I must need to look at the precedence of the operators in the tree below the current node, but the algorithm I have there now does't work (see the last 2 cases in my main method. No parentheses are needed, but my logic adds them).
public class Test {
static abstract class Node {
Node left;
Node right;
String text;
abstract void Print();
abstract void PrintWithParens();
abstract void PrintWithMinimalParens();
int precedence()
{
return 0;
}
}
enum Operator {
PLUS(1,"+"),
MINUS(1, "-"),
MULTIPLY(2, "*"),
DIVIDE(2, "/"),
POW(3, "^")
;
private final int precedence;
private final String text;
private Operator(int precedence, String text)
{
this.precedence = precedence;
this.text = text;
}
#Override
public String toString() {
return text;
}
public int getPrecedence() {
return precedence;
}
}
static class OperatorNode extends Node {
private final Operator op;
OperatorNode(Operator op)
{
this.op = op;
}
#Override
void Print() {
left.Print();
System.out.print(op);
right.Print();
}
#Override
void PrintWithParens() {
System.out.print("(");
left.PrintWithParens();
System.out.print(op);
right.PrintWithParens();
System.out.print(")");
}
#Override
void PrintWithMinimalParens() {
boolean needParens =
(left.precedence() != 0 && left.precedence() < this.op.precedence)
||
(right.precedence() != 0 && right.precedence() < this.op.precedence);
if(needParens)
System.out.print("(");
left.PrintWithMinimalParens();
System.out.print(op);
right.PrintWithMinimalParens();
if(needParens)
System.out.print(")");
}
#Override
int precedence() {
return op.getPrecedence();
}
}
static class TextNode extends Node {
TextNode(String text)
{
this.text = text;
}
#Override
void Print() {
System.out.print(text);
}
#Override
void PrintWithParens() {
System.out.print(text);
}
#Override
void PrintWithMinimalParens() {
System.out.print(text);
}
}
private static void printExpressions(Node rootNode) {
System.out.print("Print() : ");
rootNode.Print();
System.out.println();
System.out.print("PrintWithParens() : ");
rootNode.PrintWithParens();
System.out.println();
System.out.print("PrintWithMinimalParens() : ");
rootNode.PrintWithMinimalParens();
System.out.println();
System.out.println();
}
public static void main(String[] args)
{
System.out.println("Desired: 1+2+3+4");
Node rootNode = new OperatorNode(Operator.PLUS);
rootNode.left = new TextNode("1");
rootNode.right = new OperatorNode(Operator.PLUS);
rootNode.right.left = new TextNode("2");
rootNode.right.right = new OperatorNode(Operator.PLUS);
rootNode.right.right.left = new TextNode("3");
rootNode.right.right.right = new TextNode("4");
printExpressions(rootNode);
System.out.println("Desired: 1+2*3+4");
rootNode = new OperatorNode(Operator.PLUS);
rootNode.left = new TextNode("1");
rootNode.right = new OperatorNode(Operator.PLUS);
rootNode.right.left = new OperatorNode(Operator.MULTIPLY);
rootNode.right.left.left = new TextNode("2");
rootNode.right.left.right = new TextNode("3");
rootNode.right.right = new TextNode("4");
printExpressions(rootNode);
System.out.println("Desired: 1+2*(3+4)");
rootNode = new OperatorNode(Operator.PLUS);
rootNode.left = new TextNode("1");
rootNode.right = new OperatorNode(Operator.MULTIPLY);
rootNode.right.left = new TextNode("2");
rootNode.right.right = new OperatorNode(Operator.PLUS);
rootNode.right.right.left = new TextNode("3");
rootNode.right.right.right = new TextNode("4");
printExpressions(rootNode);
System.out.println("Desired: 1+2^8*3+4");
rootNode = new OperatorNode(Operator.PLUS);
rootNode.left = new TextNode("1");
rootNode.right = new OperatorNode(Operator.MULTIPLY);
rootNode.right.left = new OperatorNode(Operator.POW);
rootNode.right.left.left = new TextNode("2");
rootNode.right.left.right = new TextNode("8");
rootNode.right.right = new OperatorNode(Operator.PLUS);
rootNode.right.right.left = new TextNode("3");
rootNode.right.right.right = new TextNode("4");
printExpressions(rootNode);
}
}
Output:
Desired: 1+2+3+4
Print() : 1+2+3+4
PrintWithParens() : (1+(2+(3+4)))
PrintWithMinimalParens() : 1+2+3+4
Desired: 1+2*3+4
Print() : 1+2*3+4
PrintWithParens() : (1+((2*3)+4))
PrintWithMinimalParens() : 1+2*3+4
Desired: 1+2*(3+4)
Print() : 1+2*3+4
PrintWithParens() : (1+(2*(3+4)))
PrintWithMinimalParens() : 1+(2*3+4)
Desired: 1+2^8*3+4
Print() : 1+2^8*3+4
PrintWithParens() : (1+((2^8)*(3+4)))
PrintWithMinimalParens() : 1+(2^8*3+4)
Is is possible to implement the PrintWithMinimalParens() that I want? Does the fact that order is implicit in the tree make doing what I want impossible?
In your code you are comparing each operator with its children to see if you need parentheses around it. But you should actually be comparing it with its parent. Here are some rules that can determine if parentheses can be omitted:
You never need parentheses around the operator at the root of the AST.
If operator A is the child of operator B, and A has a higher precedence than B, the parentheses around A can be omitted.
If a left-associative operator A is the left child of a left-associative operator B with the same precedence, the parentheses around A can be omitted. A left-associative operator is one for which x A y A z is parsed as (x A y) A z.
If a right-associative operator A is the right child of a right-associative operator B with the same precedence, the parentheses around A can be omitted. A right-associative operator is one for which x A y A z is parsed as x A (y A z).
If you can assume that an operator A is associative, i.e. that (x A y) A z = x A (y A z) for all x,y,z, and A is the child of the same operator A, you can choose to omit parentheses around the child A. In this case, reparsing the expression will yield a different AST that gives the same result when evaluated.
Note that for your first example, the desired result is only correct if you can assume that + is associative (which is true when dealing with normal numbers) and implement rule #5. This is because your input tree is built in a right-associative fashion, while operator + is normally left-associative.
You're enclosing an entire expression in parentheses if either the left or the right child has a lower-precedence operator even if one of them is a higher- or equal-precedence operator.
I think you need to separate your boolean needParens into distinct cases for the left and right children. Something like this (untested):
void PrintWithMinimalParens() {
boolean needLeftChildParens =
(left.precedence() != 0 && left.precedence() < this.op.precedence);
boolean needRightChildParens =
(right.precedence() != 0 && right.precedence() < this.op.precedence);
if(needLeftChildParens)
System.out.print("(");
left.PrintWithMinimalParens();
if(needLeftChildParens)
System.out.print(")");
System.out.print(op);
if(needRightChildParens)
System.out.print("(");
right.PrintWithMinimalParens();
if(needRightChildParens)
System.out.print(")");
}
Also, I don't think your last example is correct. Looking at your tree I think it should be:
1+2^8*(3+4)

Lossless hierarchical run length encoding

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.
For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F
I am not concerned that an option is picked between two identical possible nestings E.g.
ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.
However I do want the choice to be MOST greedy. For instance:
ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.
In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.
Welcome ideas.
I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.
You can paste the following code into LINQPad and run it, and it should produce the following output:
ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)
As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.
Basically, the code runs like this:
For each position in the sequence, try to find the longest match (actually, it doesn't, it takes the first 2+ match it finds, I left the rest as an exercise for you since I have to leave my computer for a few hours now)
It then tries to encode that sequence, the one that repeats, recursively, and spits out a X*seq type of object
If it can't find a repeating sequence, it spits out the single symbol at that location
It then skips what it encoded, and continues from #1
Anyway, here's the code:
void Main()
{
string[] examples = new[]
{
"ABCBCABCBCDEEF",
"ABBABBABBABA",
"ABCDABCDCDCDCD",
};
foreach (string example in examples)
{
StringBuilder sb = new StringBuilder();
foreach (var r in Encode(example))
sb.Append(r.ToString());
Debug.WriteLine(example + " = " + sb.ToString());
}
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
return Encode<T>(values, EqualityComparer<T>.Default);
}
public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
List<T> sequence = new List<T>(values);
int index = 0;
while (index < sequence.Count)
{
var bestSequence = FindBestSequence<T>(sequence, index, comparer);
if (bestSequence == null || bestSequence.Length < 1)
throw new InvalidOperationException("Unable to find sequence at position " + index);
yield return bestSequence;
index += bestSequence.Length;
}
}
private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
int sequenceLength = 1;
while (startIndex + sequenceLength * 2 <= sequence.Count)
{
if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
{
bool atLeast2Repeats = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
{
atLeast2Repeats = false;
break;
}
}
if (atLeast2Repeats)
{
int count = 2;
while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
{
bool anotherRepeat = true;
for (int index = 0; index < sequenceLength; index++)
{
if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
{
anotherRepeat = false;
break;
}
}
if (anotherRepeat)
count++;
else
break;
}
List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
return new SequenceRepeat<T>(count, repeatedSequence);
}
}
sequenceLength++;
}
// fall back, we could not find anything that repeated at all
return new SingleSymbol<T>(sequence[startIndex]);
}
public abstract class Repeat<T>
{
public int Count { get; private set; }
protected Repeat(int count)
{
Count = count;
}
public abstract int Length
{
get;
}
}
public class SingleSymbol<T> : Repeat<T>
{
public T Value { get; private set; }
public SingleSymbol(T value)
: base(1)
{
Value = value;
}
public override string ToString()
{
return string.Format("{0}", Value);
}
public override int Length
{
get
{
return Count;
}
}
}
public class SequenceRepeat<T> : Repeat<T>
{
public Repeat<T>[] Values { get; private set; }
public SequenceRepeat(int count, Repeat<T>[] values)
: base(count)
{
Values = values;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
}
public override int Length
{
get
{
int oneLength = 0;
foreach (var value in Values)
oneLength += value.Length;
return Count * oneLength;
}
}
}
public class GroupRepeat<T> : Repeat<T>
{
public Repeat<T> Group { get; private set; }
public GroupRepeat(int count, Repeat<T> group)
: base(count)
{
Group = group;
}
public override string ToString()
{
return string.Format("({0}{1})", Count, Group);
}
public override int Length
{
get
{
return Count * Group.Length;
}
}
}
Looking at the problem theoretically, it seems similar to the problem of finding the smallest context free grammar which generates (only) the string, except in this case the non-terminals can only be used in direct sequence after each other, so e.g.
ABCBCABCBCDEEF
s->ttDuuF
t->Avv
v->BC
u->E
ABABCDABABCD
s->ABtt
t->ABCD
Of course, this depends on how you define "smallest", but if you count terminals on the right side of rules, it should be the same as the "length in original symbols" after doing the nested run-length encoding.
The problem of the smallest grammar is known to be hard, and is a well-studied problem. I don't know how much the "direct sequence" part adds to or subtracts from the complexity.

Resources