Lexicographical sorting

Lexicographical sorting - algorithm

I'm doing a problem that says "concatenate the words to generate the lexicographically lowest possible string." from a competition.
Take for example this string: jibw ji jp bw jibw
The actual output turns out to be: bw jibw jibw ji jp
When I do sorting on this, I get: bw ji jibw jibw jp.
Does this mean that this is not sorting? If it is sorting, does "lexicographic" sorting take into consideration pushing the shorter strings to the back or something?
I've been doing some reading on lexigographical order and I don't see any point or scenarios on which this is used, do you have any?

It seems that what you're looking for is a better understanding of the question, so let me just make it clear. The usual sorting on strings is lexicographic sorting. If you sort the strings [jibw, ji, jp, bw, jibw] into lexicographic order, the sorted sequence is [bw, ji, jibw, jibw, jp], which is what you got. So your problem is not with understanding the word "lexicographic"; you already understand it correctly.
Your problem is that you're misreading the question. The question doesn't ask you to sort the strings in lexicographic order. (If it did, the answer you got by sorting would be correct.) Instead, it asks you to produce one string, got by concatenating the input strings in some order (i.e., making one string without spaces), so that the resulting single string is lexicographically minimal.
To illustrate the difference, consider the string you get by concatenating the sorted sequence, and the answer string:
bwjijibwjibwjp //Your answer
bwjibwjibwjijp //The correct answer
Now when you compare these two strings — note that you're just comparing two 14-character strings, not two sequences of strings — you can see that the correct answer is indeed lexicographically smaller than your answer: your answer starts with "bwjij", while the correct answer starts with "bwjib", and "bwjib" comes before "bwjij" in lexicographic order.
Hope you understand the question now. It is not a sorting question at all. (That is, it is not a problem of sorting the input strings. You could do sorting on all possible strings got by permuting and concatenating the input strings; this is one way of solving the problem if the number of input strings is small.)

You can convert this into a trivial sorting problem by comparing word1 + word2 against word2 + word1. In Python:
def cmp_concetanate(word1, word2):
c1 = word1 + word2
c2 = word2 + word1
if c1 < c2:
return -1
elif c1 > c2:
return 1
else:
return 0
Using this comparison function with the standard sort solves the problem.

I've been using F# in this Facebook hacker cup. Learned quite a bit in this competition. Since the documentation on F# on the web is still rare, I think I might as well share a bit here.
This problem requests you to sort a list of strings based on a customized comparison method. Here is my code snippet in F#.
let comparer (string1:string) (string2:string) =
String.Compare(string1 + string2, string2 + string1)
// Assume words is an array of strings that you read from the input
// Do inplace sorting there
Array.sortInPlaceWith comparer words
// result contains the string for output
let result = Array.fold (+) "" words

//Use this block of code to print lexicographically sorted characters of an array or it can be used in many ways.
#include<stdio.h>
#include<conio.h>
void combo(int,int,char[],char[],int*,int*,int*);
void main()
{
char a[4]={'a','b','c'};
char a1[10];
int i=0,no=0;
int l=0,j=0;
combo(0,3,a,a1,&j,&l,&no);
printf("%d",no);
getch();
}
void combo(int ctr,int n,char a[],char a1[],int*j,int*l,int*no)
{
int i=0;
if(ctr==n)
{
for(i=0;i<n;i++)
printf("%c",a1[i]);
printf("\n");
(*no)++;
(*j)++;
if((*j)==n)
{
*l=0;
*j=0;
}
else
*l=1;
getch();
}
else
for(i=0;i<n;i++)
{
if(*l!=1)
*j=i;
a1[ctr]=a[*j];
combo(ctr+1,n,a,a1,j,l,no);
}
}

The example you posted shows that mere sorting would not generate the lexicographically lowest string.
For the given problem, you would need to apply some additional trick to determine which string should come before which(as of now, I can't think of the exact method)
The actual output does not violate the condition for lexicographically lowest word.

The sort command on linux also does Lexicographic sorting and generates the output in the order bw ji jibw jibw jp

Check what happened here:
If you just apply a lexicographic sort you'll get bw ji jibw jibw jp
but if you analyze token by token you'll find that "bwjibw" (bw, jibw) is lexicographicaly lower than "bwjijibw" (bw, ji, jibw) that's why the answer is bw jibw jibw ji jp because first you should append bwjibwjibw and after that you could concatenate ji and jp to get the lowest string.

A simple trick involving only sorting, which would work for this problem as the max string length is specified, would be to pad all strings up to max length with the first letter in the string. Then you sort the padded strings, but output the original unpadded ones. For ex. for string length 2 and inputs b and ba you would sort bb and ba which would give you ba and bb, and hence you should output bab.

Prasun's trick works if you instead pad with a special "placeholder" character that could be weighted to be greater than "z" in a string sort function. The result would give you the order of lowest lexicographic combination.

The contest is over so I am posting a possible solution, not the most efficient but one way of doing it
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
using namespace std;
int main()
{
ofstream myfile;
myfile.open("output.txt");
int numTestCases;
int numStrings;
string* ptr=NULL;
char*ptr2=NULL;
string tosort;
scanf("%d",&numTestCases);
for(int i=0;i<numTestCases;i++)
{
scanf("%d",&numStrings);
ptr=new string[numStrings];
for(int i=0;i<numStrings;i++)
{
cin>>ptr[i];
}
sort(ptr,ptr+numStrings);
for(int i=0;i<numStrings;i++)
{
next_permutation(ptr,ptr+numStrings);
}
tosort.clear();
for(int i=0;i<numStrings;i++)
{
tosort.append(ptr[i]);
}
ptr2=&tosort[i];
cout<<tosort<<endl;
myfile<<tosort<<endl;
delete[]ptr;
}
return 0;
}
I am using algorithms from the STL library in c++, the prev_permutation function simply generates a permutation sorted lexicographically

Related

Valid Permutations of a String

This question was asked to me in a recent amazon technical interview. It goes as follows:-
Given a string ex: "where am i" and a dictionary of valid words, you have to list all valid distinct permutations of the string. A valid string comprises of words which exists in the dictionary. For ex: "we are him","whim aree" are valid strings considering the words(whim, aree) are part of the dictionary. Also the condition is that a mere rearrangement of words is not a valid string, i.e "i am where" is not a valid combination.
The task is to find all possible such strings in the optimum way.

As you have said, space doesn't count, so input can be just viewed as a list of chars. The output is the permutation of words, so an obvious way to do it is find all valid words then permutate them.
Now problem becomes to divide a list of chars into subsets which each forms a word, which you can find some answers here and following is my version to solve this sub-problem.
If the dictionary is not large, we can iterate dictionary to
find min_len/max_len of words, to estimate how many words we may have, i.e. how deep we recur
convert word into map to accelerate search;
filter the words which have impossible char (i.e. the char our input doesn't have) out;
if this word is subset of our input, we can find word recursively.
The following is pseudocode:
int maxDepth = input.length / min_len;
void findWord(List<Map<Character, Integer>> filteredDict, Map<Character, Integer> input, List<String> subsets, int level) {
if (level < maxDepth) {
for (Map<Character, Integer> word : filteredDict) {
if (subset(input, word)) {
subsets.add(word);
findWord(filteredDict, removeSubset(input, word), subsets, level + 1);
}
}
}
}
And then you can permutate words in a recursive functions easily.
Technically speaking, this solution can be O(n**d) -- where n is dictionary size and d is max depth. But if the input is not large and complex, we can still solve it in feasible time.

clustering words based on their char set

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.

Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}

Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).

I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.

One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.

Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).

26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

How can I find all possible letter combinations of a string?

I am given a string and i need to find all possible letter combinations of this string. What is the best way I can achieve this?
example:
abc
result:
abc
acb
bca
bac
cab
cba
i have nothing so far. i am not asking for code. i am just asking for the best way to do it? an algorithm? a pseudocode? maybe a discussion?

you can sort it then use std::next_permutation
take a look at the example: http://www.cplusplus.com/reference/algorithm/next_permutation/

Do you want combinations or permutations? For example, if your string is "abbc" do you want to see "bbac" once or twice?
If you actually want permutations you can use std::next_permutation and it'll take care of all the work for you.

If you want the combinations (order independant) You can use a combination finding algorithm such as that found either here or here. Alternatively, you can use this (a java implementation of a combination generator, with an example demonstrating what you want.
Alternatively, if you want what you have listed in your post (the permutations), then you can (for C++) use std::next_permutation found in <algorithm.h>. You can find more information on std::next_permutation here.
Hope this helps. :)

In C++, std::next_permutation:
std::string s = "abc";
do
{
std::cout << s << std::endl;
} while (std::next_permutation(s.begin(), s.end()));

Copied from an old Wikipedia article;
For every number k, with 0 ≤ k < n!, the following algorithm generates a unique permutation on sequence s.
function permutation(k, s) {
for j = 2 to length(s) {
swap s[(k mod j) + 1] with s[j]; // note that our array is indexed starting at 1
k := k / j; // integer division cuts off the remainder
}
return s;
}

Algorithm for finding first repeated substring of length k

There is a homework I should do and I need help. I should write a program to find the first substring of length k that is repeated in the string at least twice.
For example in the string "banana" there are two repeated substrings of length 2: "an" , "na". In this case, the answer is "an" because it appeared sooner than "na"
Note that the simple O(n^2) algorithm is not useful since there is time limit on execution time of program so I guess it should be in linear time.
There is a hint too: Use Hash table.
I don't want the code. I just want you to give me a clue because I have no idea how to do this using a hash table. Should I use a specific data structure too?

Iterate over the character indexes of the string (0, 1, 2, ...) up to and including the index of the second-from-last character (i.e. up to strlen(str) - 2). For each iteration, do the following...
Extract the 2-char substring starting at the character index.
Check whether your hashtable contains the 2-char substring. If it does, you've got your answer.
Insert each 2-char substring into the hashtable.
This is easily modifiable to cope with substrings of length k.

Combine Will A's algorithm with a rolling hash to get a linear-time algorithm.

You can use linked hash map.
public static String findRepeated(String s , int k){
Map<String,Integer> map = new LinkedHashMap<String,Integer>();
for(int i = 0 ; i < s.length() - k ; i ++){
String temp = s.substring(i,i +k);
if(!map.containsKey(temp)){
map.put(temp, 1);
}
else{
map.put(temp, map.get(temp) + 1);
}
}
for(Map.Entry<String,Integer> entry : map.entrySet()){
if(entry.getValue() > 1){
return entry.getKey();
}
}
return "no such value";
}

String Tiling Algorithm

I'm looking for an efficient algorithm to do string tiling. Basically, you are given a list of strings, say BCD, CDE, ABC, A, and the resulting tiled string should be ABCDE, because BCD aligns with CDE yielding BCDE, which is then aligned with ABC yielding the final ABCDE.
Currently, I'm using a slightly naïve algorithm, that works as follows. Starting with a random pair of strings, say BCD and CDE, I use the following (in Java):
public static String tile(String first, String second) {
for (int i = 0; i < first.length() || i < second.length(); i++) {
// "right" tile (e.g., "BCD" and "CDE")
String firstTile = first.substring(i);
// "left" tile (e.g., "CDE" and "BCD")
String secondTile = second.substring(i);
if (second.contains(firstTile)) {
return first.substring(0, i) + second;
} else if (first.contains(secondTile)) {
return second.substring(0, i) + first;
}
}
return EMPTY;
}
System.out.println(tile("CDE", "ABCDEF")); // ABCDEF
System.out.println(tile("BCD", "CDE")); // BCDE
System.out.println(tile("CDE", "ABC")); // ABCDE
System.out.println(tile("ABC", tile("BCX", "XYZ"))); // ABCXYZ
Although this works, it's not very efficient, as it iterates over the same characters over and over again.
So, does anybody know a better (more efficient) algorithm to do this ? This problem is similar to a DNA sequence alignment problem, so any advice from someone in this field (and others, of course) are very much welcome. Also note that I'm not looking for an alignment, but a tiling, because I require a full overlap of one of the strings over the other.
I'm currently looking for an adaptation of the Rabin-Karp algorithm, in order to improve the asymptotic complexity of the algorithm, but I'd like to hear some advice before delving any further into this matter.
Thanks in advance.
For situations where there is ambiguity -- e.g., {ABC, CBA} which could result in ABCBA or CBABC --, any tiling can be returned. However, this situation seldom occurs, because I'm tiling words, e.g. {This is, is me} => {This is me}, which are manipulated so that the aforementioned algorithm works.
Similar question: Efficient Algorithm for String Concatenation with Overlap

Order the strings by the first character, then length (smallest to largest), and then apply the adaptation to KMP found in this question about concatenating overlapping strings.

I think this should work for the tiling of two strings, and be more efficient than your current implementation using substring and contains. Conceptually I loop across the characters in the 'left' string and compare them to a character in the 'right' string. If the two characters match, I move to the next character in the right string. Depending on which string the end is first reached of, and if the last compared characters match or not, one of the possible tiling cases is identified.
I haven't thought of anything to improve the time complexity of tiling more than two strings. As a small note for multiple strings, this algorithm below is easily extended to checking the tiling of a single 'left' string with multiple 'right' strings at once, which might prevent extra looping over the strings a bit if you're trying to find out whether to do ("ABC", "BCX", "XYZ") or ("ABC", "XYZ", BCX") by just trying all the possibilities. A bit.
string Tile(string a, string b)
{
// Try both orderings of a and b,
// since TileLeftToRight is not commutative.
string ab = TileLeftToRight(a, b);
if (ab != "")
return ab;
return TileLeftToRight(b, a);
// Alternatively you could return whichever
// of the two results is longest, for cases
// like ("ABC" "BCABC").
}
string TileLeftToRight(string left, string right)
{
int i = 0;
int j = 0;
while (true)
{
if (left[i] != right[j])
{
i++;
if (i >= left.Length)
return "";
}
else
{
i++;
j++;
if (i >= left.Length)
return left + right.Substring(j);
if (j >= right.Length)
return left;
}
}
}

If Open Source code is acceptable, then you should check the genome benchmarks in Stanford's STAMP benchmark suite: it does pretty much exactly what you're looking for. Starting with a bunch of strings ("genes"), it looks for the shortest string that incorporates all the genes. So for example if you have ATGC and GCAA, it'll find ATGCAA. There's nothing about the algorithm that limits it to a 4-character alphabet, so this should be able to help you.

The first thing to ask is if you want to find the tilling of {CDB, CDA}? There is no single tilling.

Interesting problem. You need some kind of backtracking. For example if you have:
ABC, BCD, DBC
Combining DBC with BCD results in:
ABC, DBCD
Which is not solvable. But combining ABC with BCD results in:
ABCD, DBC
Which can be combined to:
ABCDBC.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio