Creating an anagram with a string of characters using Backtracking - algorithm

I was looking at this solution here
Finding anagrams for a given word
Also if it helps here is the interview question
An anagram is a rearrangement of the letters in a given string into a sequence
of dictionary words, like Steven Skiena into Vainest Knees. Propose an algorithm
to construct all the anagrams of a given string.
What is the time complexity of this? Is this a valid backtracking algoirthm or is this more taking advantage of space to create a solution?
I am looking at the solution from Daniel, and that looks like it may take up a lot of space with a tree?
My real question though is, if I was asked this on an interview, what time of explanation do I give? A pseudo code answer or a real program?

A real solution probably takes too long to write in an interview. Here is how you can implement the back-tracking part. It's just pseudo-code:
function getAnagrams(str) {
helper('', str.replace(' ', '').toLowerCase().sort());
}
function helper(prefix, str) {
foreach (word in dictionary) {
if (word.isShuffleOf(str)) {
solution.add(prefix == '' ? word : prefix + ' ' + word);
} else if (str.contains(word)) {
helper(prefix == '' ? word : prefix + ' ' + word, str.subtract(word));
}
}
}

Related

Shortest word in a dictionary containing all letters

I was asked this question in an interview.
Given an array of characters, find the shortest word in a dictionary that contains all the characters. Also, propose an implementation for the dictionary that would optimize this function call.
for e.g. char[] chars = { 'R' , 'C' }. The result should be the word "CAR".
I could not come up with anything that would run reasonably quickly. I thought of pre-processing the dictionary by building a hash table to retrieve all words of a particular length. Then I could only think of retrieving all words in the increasing order of length and checking if the required characters were present in any of those ( maybe by using a bitmask . )
This is a common software interview question, and its solution is this: sort the dictionary itself by length and sort each value alphabetically. When given the characters, sort them and find the needed letters.
First sort the dictionary in ascending order of length.
For each letter, construct a bit map of the locations in the dictionary of the words containing that letter. Each bit map will be long, but there will not be many.
For each search, take the intersection of the bitmaps for the letters in the array. The first one bit in the result will be at the index corresponding to the location in the dictionary of the shortest word containing all the letters.
The other answers are better, but I realized this is entirely precomputable.
For each word
sort the letters and remove duplicates
The sequence of letters can be viewed as a bitmask, A=0bit, B=1bit...Z=26bit. Set the bits of a mask A according to the letters in this word.
For each combination of set bits in the mask A, make a subset mask B
If there is already a word associated with this mask B
and this word is shorter, replace the associated word with this one
otherwise try next B
If there is no word associated with mask B
Associate this word with mask B.
This would take a huge amount of setup time, and the subsequent association storage would be in the vicinity of 1.7GB, but you'd be able to find the shortest word containing a superset of the letters in O(1) time guaranteed.
The obvious preprocessing is to sort all words in the dictionary by their length and alphabetical re-ordering: "word" under "dorw", for example. Then you can use general search algorithms (e.g., regex) to search for the letters you need. An efficient (DFA) search requires only one pass over the dictionary in the worst case, and much less if the first match is short.
Here is a solution in C#:
using System.Collections.Generic;
using System.Linq;
public class ShortestWordFinder
{
public ShortestWordFinder(IEnumerable<string> dictionary)
{
this.dictionary = dictionary;
}
public string ShortestWordContaining(IEnumerable<char> chars)
{
var wordsContaining = dictionary.Where(s =>
{
foreach (var c in chars)
{
if (!s.Contains(c))
{
return false;
}
s = s.Remove(s.IndexOf(c), 1);
}
return true;
}).ToList();
if (!wordsContaining.Any())
{
return null;
}
var minLength = wordsContaining.Min(word => word.Length);
return wordsContaining.First(word => word.Length == minLength);
}
private readonly IEnumerable<string> dictionary;
}
Simple test:
using System.Diagnostics;
using Xunit;
public class ShortestWordFinderTests
{
[Fact]
public void Works()
{
var words = new[] { "dog", "moose", "gargoyle" };
var finder = new ShortestWordFinder(words);
Trace.WriteLine(finder.ShortestWordContaining("o"));
Trace.WriteLine(finder.ShortestWordContaining("oo"));
Trace.WriteLine(finder.ShortestWordContaining("oy"));
Trace.WriteLine(finder.ShortestWordContaining("eyg"));
Trace.WriteLine(finder.ShortestWordContaining("go"));
Assert.Null(finder.ShortestWordContaining("ooo"));
}
}
Pre processing
a. Sort words into alphabetic char arrays. Retain mapping from sorted to original word
b. Split dictionary by word length as you suggest
c. Sort entries in each word length set alphabetically
On function call
Sort char array alphabetically
Start with group of same length as array
Loop through entries testing for characters until first letter of entry lexicographically greater than first in your char array then break. If match then return original word (see a above for mapping)
Back to 2 for next longest word group
Interesting extensions. Multiple words might map to same entry in a. Which one (s) should you return...

Determine if one string is a prefix of another

I have written down a simple function that determines if str1 is a prefix of str2. It's a very simple function, that looks like this (in JS):
function isPrefix(str1, str2) // determine if str1 is a prefix of a candidate string
{
if(str2.length < str1.length) // candidate string can't be smaller than prefix string
return false;
var i = 0;
while(str1.charAt(i) == str2.charAt(i) && i <= str1.length)
i++;
if(i < str1.length) // i terminated => str 1 is smaller than str 2
return false;
return true;
}
As you can see, it loops through the entire length of the prefix string to gauge if it is a prefix of the candidate string. This means it's complexity is O(N), which isn't bad but this becomes a problem when I have a huge data set to consider looping through to determine which strings have the prefix string as a part of the prefix. This makes the complexity multiple like O(M*N) where M is the total number of strings in a given data set. Not good.
I explored the Internet a bit to determine that the best answer would be a Patricia/Radix trie. Where strings are stored as prefixes. Even then, when I attempt to insert/look-up a string, there will be a considerable overhead in string matching if I use the aforementioned prefix gauging function.
Say I had a prefix string 'rom' and a set of candidate words
var dataset =["random","rapid","romance","romania","rome","rose"];
that would like this in a radix trie :
r
/ \
a o
/ \ / \
ndom pid se m
/ \
an e
/ \
ia ce
This means, for every node, I will be using the prefix match function, to determine which node has a value that matches the prefix string at the index. Somehow, this solution still seems arduous and does not sit too well with me. Is there something better or anyway I can improve the core prefix matching function ?
Looks like you've got two different problems.
One is to determine if a string is contained as a prefix in another string. For this I would suggest using a function already implemented in the language's string library. In JavaScript you could do this
if (str2.indexOf(str1) === 0) {
// string str1 is a prefix of str2
}
See documentation for String.indexOf here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf
For the other problem, in a bunch of strings, find out which ones have a given string as a prefix, building a data structure like a Trie or the one you mention seems like the way to go, if you want fast look-ups.
Check out this thread on stackoverflow - How to check if a string "StartsWith" another string? . Mark Byers solution seems to be very efficient. Also for Java there are built in String functions "endsWith" and "startsWith" - http://docs.oracle.com/javase/tutorial/java/data/comparestrings.html

Breaking a string apart into a sequence of words

I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
#memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
#memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
One option would be to store all valid English words in a trie. Once you've done this, you could start walking the trie from the root downward, following the letters in the string. Whenever you find a node that's marked as a word, you have two options:
Break the input at this point, or
Continue extending the word.
You can claim that you've found a match once you have broken the input up into a set of words that are all legal and have no remaining characters left. Since at each letter you either have one forced option (either you are building a word that isn't valid and should stop -or- you can keep extending the word) or two options (split or keep going), you could implement this function using exhaustive recursion:
PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
// If you walked off the trie, this path fails.
if trieNode is null, return.
// If this trie node is a word, consider what happens if you split
// the word here.
if trieNode.isWord:
// If there is no input left, you're done and have a partition.
if lettersLeft is empty, output wordBreaks + wordSoFar and return
// Otherwise, try splitting here.
PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)
// Otherwise, consume the next letter and continue:
PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0],
wordBreaks, trieNode.child[lettersLeft[0])
In the pathologically worst case this will list all partitions of the string, which can t exponentially long. However, this only occurs if you can partition the string in a huge number of ways that all start with valid English words, and is unlikely to occur in practice. If the string has many partitions, we might spend a lot of time finding them, though. For example, consider the string "dotheredo." We can split this many ways:
do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo
To avoid this, you might want to institute a limit of the number of answers you report, perhaps two or three.
Since we cut off the recursion when we walk off the trie, if we ever try a split that doesn't leave the remainder of the string valid, we will detect this pretty quickly.
Hope this helps!
import java.util.*;
class Position {
int indexTest,no;
Position(int indexTest,int no)
{
this.indexTest=indexTest;
this.no=no;
} } class RandomWordCombo {
static boolean isCombo(String[] dict,String test)
{
HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
Stack<Position> pos=new Stack<Position>();
for(String each:dict)
{
if(dic.containsKey(""+each.charAt(0)))
{
//System.out.println("=========it is here");
ArrayList<String> temp=dic.get(""+each.charAt(0));
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
else
{
ArrayList<String> temp=new ArrayList<String>();
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
}
Iterator it = dic.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println("key: "+pair.getKey());
for(String str:(ArrayList<String>)pair.getValue())
{
System.out.print(str);
}
}
pos.push(new Position(0,0));
while(!pos.isEmpty())
{
Position position=pos.pop();
System.out.println("position index: "+position.indexTest+" no: "+position.no);
if(dic.containsKey(""+test.charAt(position.indexTest)))
{
ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest));
if(strings.size()>1&&position.no<strings.size()-1)
pos.push(new Position(position.indexTest,position.no+1));
String str=strings.get(position.no);
if(position.indexTest+str.length()==test.length())
return true;
pos.push(new Position(position.indexTest+str.length(),0));
}
}
return false;
}
public static void main(String[] st)
{
String[] dic={"world","hello","super","hell"};
System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
} }
I have done similar problem. This solution gives true or false if given string is combination of dictionary words. It can be easily converted to get space-separated string. Its average complexity is O(n), where n: no of dictionary words in given string.

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

How can I detect common substrings in a list of strings

Given a set of strings, for example:
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
I want to be able to detect that these are three sets of files:
EntireS[1,2]
J27[Red,Green]P[1,2]
JournalP[1,2][Red,Green,Blue]
Are there any known ways of approaching this problem - any published papers I can read on this?
The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.
Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.
I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem
There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.
Edited to add:
Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.
I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.
To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.
Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).
In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.
Great question! The steps to a solution are:
tokenizing input
using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
serialization of the data structure to a regular expression or similar syntax
I've implemented this approach in regroup.py. Here's an example:
$ cat | ./regroup.py --cluster-prefix-len=2
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
^D
EFgre(en|y)
EntireS[12]
J27(Green|Red)P[12]
JournalP[12](Bl(ack|ue)|(Green|Red))
Something like that might work.
Build a trie that represents all your strings.
In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".
A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]
If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.
If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.
I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.
try "frak" . It creates regex expression from set of strings. Maybe some modification of it will help you.
https://github.com/noprompt/frak
Hope it helps.
There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements a lot of metrics like Levenshtein distance.
http://sourceforge.net/projects/simmetrics/
You should be able to achieve this with generalized suffix trees: look for long paths in the suffix tree which come from multiple source strings.
There are many solutions proposed that solve the general case of finding common substrings. However, the problem here is more specialized. You're looking for common prefixes, not just substrings. This makes it a little simpler.
A nice explanation for finding longest common prefix can be found at
http://www.geeksforgeeks.org/longest-common-prefix-set-1-word-by-word-matching/
So my proposed "pythonese" pseudo-code is something like this (refer to the link for an implementation of find_lcp:
def count_groups(items):
sorted_list = sorted(items)
prefix = sorted_list[0]
ctr = 0
groups = {}
saved_common_prefix = ""
for i in range(1, sorted_list):
common_prefix = find_lcp(prefix, sorted_list[i])
if len(common_prefix) > 0: #we are still in the same group of items
ctr += 1
saved_common_prefix = common_prefix
prefix = common_prefix
else: # we must have switched to a new group
groups[saved_common_prefix] = ctr
ctr = 0
saved_common_prefix = ""
prefix = sorted_list[i]
return groups
For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.
A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like
[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]
Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2.
For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.
So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).
import java.util.*;
class StringProblem
{
public List<String> subString(String name)
{
List<String> list=new ArrayList<String>();
for(int i=0;i<=name.length();i++)
{
for(int j=i+1;j<=name.length();j++)
{
String s=name.substring(i,j);
list.add(s);
}
}
return list;
}
public String commonString(List<String> list1,List<String> list2,List<String> list3)
{
list2.retainAll(list1);
list3.retainAll(list2);
Iterator it=list3.iterator();
String s="";
int length=0;
System.out.println(list3); // 1 1 2 3 1 2 1
while(it.hasNext())
{
if((s=it.next().toString()).length()>length)
{
length=s.length();
}
}
return s;
}
public static void main(String args[])
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter the String1:");
String name1=sc.nextLine();
System.out.println("Enter the String2:");
String name2=sc.nextLine();
System.out.println("Enter the String3:");
String name3=sc.nextLine();
// String name1="salman";
// String name2="manmohan";
// String name3="rahman";
StringProblem sp=new StringProblem();
List<String> list1=new ArrayList<String>();
list1=sp.subString(name1);
List<String> list2=new ArrayList<String>();
list2=sp.subString(name2);
List<String> list3=new ArrayList<String>();
list3=sp.subString(name3);
sp.commonString(list1,list2,list3);
System.out.println(" "+sp.commonString(list1,list2,list3));
}
}

Resources