Filter only digit sequences containing a given set of digits - performance

I have a large list of digit strings like this one. The individual strings are relatively short (say less than 50 digits).
data = [
I need to find out a efficient data structure (speed first, memory second) and algorithm which returns only those strings that are composed of a given set of digits.
Example results:
filter(data, [0,3,4]) = ['300303334']
filter(data, [0,1,2,3,4,5]) = ['300303334', '53210234']
The data list will usually fit into memory.

For each digit, precompute a postings list that don't contain the digit.
postings = [[] for _ in xrange(10)]
for i, d in enumerate(data):
for j in xrange(10):
digit = str(j)
if digit not in d:
Now, to find all strings that contain, for example, just the digits [1, 3, 5] you can merge the postings lists for the other digits (ie: 0, 2, 4, 6, 7, 8, 9).
def intersect_postings(p0, p1):
i0, i1 = next(p0), next(p1)
while True:
if i0 == i1:
yield i0
i0, i1 = next(p0), next(p1)
elif i0 < i1: i0 = next(p0)
else: i1 = next(p1)
def find_all(digits):
p = None
for d in xrange(10):
if d not in digits:
if p is None: p = iter(postings[d])
else: p = intersect_postings(p, iter(postings[d]))
return (data[i] for i in p) if p else iter(data)
print list(find_all([0, 3, 4]))
print list(find_all([0, 1, 2, 3, 4, 5]))

A string can be encoded by a 10-bit number. There are 2^10, or 1,024 possible values.
So create a dictionary that uses an integer for a key and a list of strings for the value.
Calculate the value for each string and add that string to the list of strings for that value.
General idea:
Dictionary Lookup;
for each (string in list)
value = 0;
for each character in string
set bit N in value, where N is the character (0-9)
Lookup[value] += string // adds string to list for this value in dictionary
Then, to get a list of the strings that match your criteria, just compute the value and do a direct dictionary lookup.
So if the user asks for strings that contain only 3, 5, and 7:
value = (1 << 3) || (1 << 5) || (1 << 7);
list = Lookup[value];
Note that, as Matt pointed out in comment below, this will only return strings that contain all three digits. So, for example, it wouldn't return 37. That seems like a fatal flaw to me.
If the number of symbols you have to deal with is very large, then the number of possible combinations becomes too large for this solution to be practical.
With a large number of symbols, I'd recommend an inverted index as suggested in the comments, combined with a secondary filter that removes the strings that contain extraneous digits.

Consider a function f which constructs a bitmask for each string with bit i set if digit i is in the string.
For example,
f('0') = 0b0000000001
f('00') = 0b0000000001
f('1') = 0b0000000010
f('1100') = 0b0000000011
Then I suggest storing a list of strings for each bitmask.
For example,
Bitmask 0b0000000001 -> ['0','00']
Once you have prepared this data structure (which is the same size as your original list), you can then easily access all the strings for a particular filter by accessing all lists where the bitmask is a subset of the digits in your filter.
So for your example of filter [0,3,4] you would return the lists from:
Strings containing just 0
Strings containing just 3
Strings containing just 4
Strings containing 0 and 3
Strings containing 0 and 4
Strings containing 3 and 4
Strings containing 0 and 3 and 4
Example Python Code
from collections import defaultdict
import itertools
raw_data = [
def preprocess(raw_data):
data = defaultdict(list)
for s in raw_data:
bitmask = 0
for digit in s:
bitmask |= 1<<int(digit)
return data
def filter(data,mask):
for r in range(len(mask)):
for m in itertools.combinations(mask,r+1):
bitmask = sum(1<<digit for digit in m)
for s in data[bitmask]:
yield s
data = preprocess(raw_data)
for a in filter(data, [0,1,2,3,4,5]):
print a

Just for kicks, I have coded up Jim's lovely algorithm and the Perl is here if anyone wants to play with it. Please do not accept this as an answer or anything, pass all credit to Jim:
use strict;
use warnings;
my $Debug=1;
my $Nwords=1000;
my ($word,$N,$value,$i,$j,$k);
my (#dictionary,%Lookup);
# Generate "words" with random number of characters 5-30
print "DEBUG: Generating $Nwords word dictionary\n" if $Debug;
$j = rand(25) + 5; # length of this word
$word = $word . int(rand(10));
print "$word\n" if $Debug;
# Add some obvious test cases
$dictionary[++$i]="0" x 50;
$dictionary[++$i]="1" x 50;
$dictionary[++$i]="2" x 50;
$dictionary[++$i]="3" x 50;
$dictionary[++$i]="4" x 50;
$dictionary[++$i]="5" x 50;
$dictionary[++$i]="6" x 50;
$dictionary[++$i]="7" x 50;
$dictionary[++$i]="8" x 50;
$dictionary[++$i]="9" x 50;
# Encode words
for $word (#dictionary){
$value |= 1 << $N;
print "DEBUG: $word encoded as $value\n" if $Debug;
# Do lookups
print "Enter permitted digits, separated with commas: ";
my $line=<STDIN>;
my #digits=split(",",$line);
for my $d (#digits){
$value |= 1<<$d;
print "Value: $value\n";
print join(", ",#{$Lookup{$value}}),"\n\n" if defined $Lookup{$value};

I like Jim Mischel's approach. It has pretty efficient look up and bounded memory usage. Code in C follows:
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <readline/readline.h>
#include <readline/history.h>
enum {
zero = '0',
nine = '9',
numbers = nine - zero + 1,
masks = 1 << numbers,
typedef uint16_t mask;
struct list {
char *s;
struct list *next;
typedef struct list list_cell;
typedef struct list *list;
static inline int is_digit(char c) { return c >= zero && c <= nine; }
static inline mask char2mask(char c) { return 1 << (c - zero); }
static inline mask add_char2mask(mask m, char c) {
return m | (is_digit(c) ? char2mask(c) : 0);
static inline int is_set(mask m, mask n) { return (m & n) != 0; }
static inline int is_set_char(mask m, char c) { return is_set(m, char2mask(c)); }
static inline int is_submask(mask sub, mask m) { return (sub & m) == sub; }
static inline char *sprint_mask(char buf[11], mask m) {
char *s = buf;
char i;
for(i = zero; i <= nine; i++)
if(is_set_char(m, i)) *s++ = i;
*s = 0;
return buf;
static inline mask get_mask(char *s) {
mask m=0;
for(; *s; s++)
m = add_char2mask(m, *s);
return m;
static inline int is_empty(list l) { return !l; }
static inline list insert(list *l, char *s) {
list cell = (list)malloc(sizeof(list_cell));
cell->s = s;
cell->next = *l;
return *l = cell;
static void *foreach(void *f(char *, void *), list l, void *init) {
for(; !is_empty(l); l = l->next)
init = f(l->s, init);
return init;
struct printer_state {
int first;
FILE *f;
static void *prin_list_member(char *s, void *data) {
struct printer_state *st = (struct printer_state *)data;
if(st->first) {
fputs(", ", st->f);
} else
st->first = 1;
fputs(s, st->f);
return data;
static void print_list(list l) {
struct printer_state st = {.first = 0, .f = stdout};
foreach(prin_list_member, l, (void *)&st);
static list *init_lu(void) { return (list *)calloc(sizeof(list), masks); }
static list *insert2lu(list lu[masks], char *s) {
mask i, m = get_mask(s);
if(m) // skip string without any number
for(i = m; i < masks; i++)
if(is_submask(m, i))
insert(lu+i, s);
return lu;
int usage(const char *name) {
fprintf(stderr, "Usage: %s filename\n", name);
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static inline void chomp(char *s) { if( (s = strchr(s, '\n')) ) *s = '\0'; }
list *load_file(FILE *f) {
char *line = NULL;
size_t len = 0;
ssize_t read;
list *lu = init_lu();
for(; (read = getline(&line, &len, f)) != -1; line = NULL) {
insert2lu(lu, line);
return lu;
void read_reqs(list *lu) {
char *line;
char buf[11];
for(; (line = readline("> ")); free(line))
if(*line) {
mask m = get_mask(line);
printf("mask: %s\nstrings: ", sprint_mask(buf, m));
int main(int argc, const char* argv[] ) {
const char *name = argv[0];
FILE *f;
list *lu;
if(argc != 2) return usage(name);
f = fopen(argv[1], "r");
if(!f) handle_error("open");
lu = load_file(f);
To compile use
gcc -lreadline -o digitfilter digitfilter.c
And test run:
$ cat data.txt
$ ./digitfilter data.txt
> 034
mask: 034
strings: 300303334
> 0,1,2,3,4,5
mask: 012345
strings: 53210234, 300303334
> 0345678
mask: 0345678
strings: 5374576807063874, 300303334

Put each value into a set-- Eg.: '300303334'={3, 0, 4}.
Since the length of your data items are bound by a constant (50),
you can do these at O(1) time for each item using Java HashSet. The overall complexity of this phase adds up to O(n).
For each filter set, use containsAll() of HashSet to see whether
each of these data items is a subset of your filter. Takes O(n).
Takes O(m*n) in the overall where n is the number of data items and m the number of filters.


Algorithm to print all permutations with repetition of numbers

I have successfully designed the algorithm to print all the permutations with the repetition of numbers. But the algorithm which I have designed has a flaw. It works only if the chars of the string are unique.
Can someone help me out in extending the algorithm for the case where chars of the string may not be unique..
My code so far :
using namespace std;
void _perm(char *arr, char*result, int index)
static int count = 1;
if (index == strlen(arr))
cout << count++ << ". " << result << endl;
for (int i = 0; i < strlen(arr); i++)
result[index] = arr[i];
_perm(arr, result, index + 1);
int compare(const void *a, const void *b)
return (*(char*)a - *(char*)b);
void perm(char *arr)
int n = strlen(arr);
if (n == 0)
qsort(arr, n, sizeof(char), compare);
char *data = new char[n];
_perm(arr, data, 0);
int main()
char arr[] = "BACD";
return 0;
I am printing the output strings in lexicographically sorted way.
I am referring to the example.3 from this page.
Your code doesn't print permutations, but four draws from the string pool with repetition. It will produce 4^4 == 256 combinations, one of which is "AAAA".
The code Karnuakar linked to will give you permutations of a string, but without distinguishing between the multiple occurrences of certain letters. You need some means to prevent recursing with the same letter in each recursion step. In C++, this can be done with a set.
The example code below uses a typical C string, but uses the terminating '\0' to detect the end. The C-string functions from <cstring> are not needed. The output will not be sorted unless the original string was sorted.
#include <iostream>
#include <algorithm>
#include <set>
using namespace std;
void perm(char *str, int index = 0)
std::set<char> used;
char *p = str + index;
char *q = p;
if (*p == '\0') {
std::cout << str << std::endl;
while (*q) {
if (used.find(*q) == used.end()) {
std::swap(*p, *q);
perm(str, index + 1);
std::swap(*p, *q);
int main()
char arr[] = "AAABB";
return 0;
This will produce 5! == 120 permutations for "ABCDE", but only 5! / (2! 3!) == 10 unique permutations for "AAABB". It will also create the 1260 permutations from the linked exercise.

All of the option to replace an unknown number of characters

I am trying to find an algorithm that for an unknown number of characters in a string, produces all of the options for replacing some characters with stars.
For example, for the string "abc", the output should be:
It is simple enough with a known number of stars, just run through all of the options with for loops, but I'm having difficulties with an all of the options.
Every star combination corresponds to binary number, so you can use simple cycle
for i = 1 to 2^n-1
where n is string length
and set stars to the positions of 1-bits of binary representations of i
for example: i=5=101b => * b *
This is basically a binary increment problem.
You can create a vector of integer variables to represent a binary array isStar and for each iteration you "add one" to the vector.
bool AddOne (int* isStar, int size) {
isStar[size - 1] += 1
for (i = size - 1; i >= 0; i++) {
if (isStar[i] > 1) {
if (i = 0) { return true; }
isStar[i] = 0;
isStar[i - 1] += 1;
return false;
That way you still have the original string while replacing the characters
This is a simple binary counting problem, where * corresponds to a 1 and the original letter to a 0. So you could do it with a counter, applying a bit mask to the string, but it's just as easy to do the "counting" in place.
Here's a simple implementation in C++:
(Edit: The original question seems to imply that at least one character must be replaced with a star, so the count should start at 1 instead of 0. Or, in the following, the post-test do should be replaced with a pre-test for.)
#include <iostream>
#include <string>
// A cleverer implementation would implement C++'s iterator protocol.
// But that would cloud the simple logic of the algorithm.
class StarReplacer {
StarReplacer(const std::string& s): original_(s), current_(s) {}
const std::string& current() const { return current_; }
// returns true unless we're at the last possibility (all stars),
// in which case it returns false but still resets current to the
// original configuration.
bool advance() {
for (int i = current_.size()-1; i >= 0; --i) {
if (current_[i] == '*') current_[i] = original_[i];
else {
current_[i] = '*';
return true;
return false;
std::string original_;
std::string current_;
int main(int argc, const char** argv) {
for (int a = 1; a < argc; ++a) {
StarReplacer r(argv[a]);
do {
std::cout << r.current() << std::endl;
} while (r.advance());
std::cout << std::endl;
return 0;

Reorder a string by half the character

This is an interview question.
Given a string such as: 123456abcdef consisting of n/2 integers followed by n/2 characters. Reorder the string to contain as 1a2b3c4d5e6f . The algortithm should be in-place.
The solution I gave was trivial - O(n^2). Just shift the characters by n/2 places to the left.
I tried using recursion as -
a. Swap later half of the first half with the previous half of the 2nd part - eg
123 456 abc def
123 abc 456 def
b. Recurse on the two halves.
The pbm I am stuck is that the swapping varies with the number of elements - for eg.
What to do next?
123 abc
12ab 3c
And what to do for : 12345 abcde
123abc 45ab
This is a pretty old question and may be a duplicate. Please let me know.. :)
Another example:
Input: 38726zfgsa
Output: 3z8f7g2s6a
Here's how I would approach the problem:
1) Divide the string into two partitions, number part and letter part
2) Divide each of those partitions into two more (equal sized)
3) Swap the second the third partition (inner number and inner letter)
4) Recurse on the original two partitions (with their newly swapped bits)
5) Stop when the partition has a size of 2
For example:
123456abcdef -> 123456 abcdef -> 123 456 abc def -> 123 abc 456 def
123abc -> 123 abc -> 12 3 ab c -> 12 ab 3 c
12 ab -> 1 2 a b -> 1 a 2 b
... etc
And the same for the other half of the recursion..
All can be done in place with the only gotcha being swapping partitions that aren't the same size (but it'll be off by one, so not difficult to handle).
It is easy to permute an array in place by chasing elements round cycles if you have a bit-map to mark which elements have been moved. We don't have a separate bit-map, but IF your characters are letters (or at least have the high order bit clear) then we can use the top bit of each character to mark this. This produces the following program, which is not recursive and so does not use stack space.
class XX
/** new position given old position */
static int newFromOld(int x, int n)
if (x < n / 2)
return x * 2;
return (x - n / 2) * 2 + 1;
private static int HIGH_ORDER_BIT = 1 << 15; // 16-bit chars
public static void main(String[] s)
// input data - create an array so we can modify
// characters in place
char[] x = s[0].toCharArray();
if ((x.length & 1) != 0)
System.err.println("Only works with even length strings");
// Character we have read but not yet written, if any
char holding = 0;
// where character in hand was read from
int holdingPos = 0;
// whether picked up a character in our hand
boolean isHolding = false;
int rpos = 0;
while (rpos < x.length)
{ // Here => moved out everything up to rpos
// and put in place with top bit set to mark new occupant
if (!isHolding)
{ // advance read pointer to read new character
char here = x[rpos];
holdingPos = rpos++;
if ((here & HIGH_ORDER_BIT) != 0)
// already dealt with
int targetPos = newFromOld(holdingPos, x.length);
// pick up char at target position
holding = x[targetPos];
// place new character, and mark as new
x[targetPos] = (char)(here | HIGH_ORDER_BIT);
// Now holding a character that needs to be put in its
// correct place
isHolding = true;
holdingPos = targetPos;
int targetPos = newFromOld(holdingPos, x.length);
char here = x[targetPos];
if ((here & HIGH_ORDER_BIT) != 0)
{ // back to where we picked up a character to hold
isHolding = false;
x[targetPos] = (char)(holding | HIGH_ORDER_BIT);
holding = here;
holdingPos = targetPos;
for (int i = 0; i < x.length; i++)
System.out.println("Result is " + new String(x));
These days, if I asked someone that question, what I'm looking for them to write on the whiteboard first is:
and then maybe ask for more examples.
(And then, depending, if the task is to interleave numbers & letters, I think you can do it with two walking-pointers, indexLetter and indexDigit, and advance them across swapping as needed til you reach the end.)
In your recursive solution why don't you just make a test if n/2 % 2 == 0 (n%4 ==0 ) and treat the 2 situations differently
As templatetypedef commented your recursion cannot be in-place.
But here is a solution (not in place) using the way you wanted to make your recursion :
def f(s):
if n==2: #initialisation
return s
elif n%4 == 0 : #if n%4 == 0 it's easy
return f(s[:n/4]+s[n/2:3*n/4])+f(s[n/4:n/2]+s[3*n/4:])
else: #otherwise, n-2 %4 == 0
return s[0]+s[n/2]+f(s[1:n/2]+s[n/2+1:])
Here we go. Recursive, cuts it in half each time, and in-place. Uses the approach outlined by #Chris Mennie. Getting the splitting right was tricky. A lot longer than Python, innit?
/* In-place, divide-and-conquer, recursive riffle-shuffle of strings;
* even length only. No wide characters or Unicode; old school. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void testrif(const char *s);
void riffle(char *s);
void rif_recur(char *s, size_t len);
void swap(char *s, size_t midpt, size_t len);
void flip(char *s, size_t len);
void if_odd_quit(const char *s);
int main(void)
return 0;
void testrif(const char *s)
char mutable[20];
strcpy(mutable, s);
printf("'%s'\n", mutable);
printf("'%s'\n\n", mutable);
void riffle(char *s)
rif_recur(s, strlen(s));
void rif_recur(char *s, size_t len)
/* Turn, e.g., "abcde12345" into "abc123de45", then recurse. */
size_t pivot = len / 2;
size_t half = (pivot + 1) / 2;
size_t twice = half * 2;
if (len < 4)
swap(s + half, pivot - half, pivot);
rif_recur(s, twice);
rif_recur(s + twice, len - twice);
void swap(char *s, size_t midpt, size_t len)
/* Swap s[0..midpt] with s[midpt..len], in place. Algorithm from
* Programming Pearls, Chapter 2. */
flip(s, midpt);
flip(s + midpt, len - midpt);
flip(s, len);
void flip(char *s, size_t len)
/* Reverse order of characters in s, in place. */
char *p, *q, tmp;
if (len < 2)
for (p = s, q = s + len - 1; p < q; p++, q--) {
tmp = *p;
*p = *q;
*q = tmp;
void if_odd_quit(const char *s)
if (strlen(s) % 2) {
fputs("String length is odd; aborting.\n", stderr);
By comparing 123456abcdef and 1a2b3c4d5e6f we can note that only the first and the last characters are in their correct position. We can also note that for each remaining n-2 characters we can compute their correct position directly from their original position. They will get there, and the element that was there surely was not in the correct position, so it will have to replace another one. By doing n-2 such steps all the elements will get to the correct positions:
void funny_sort(char* arr, int n){
int pos = 1; // first unordered element
char aux = arr[pos];
for (int iter = 0; iter < n-2; iter++) { // n-2 unordered elements
pos = (pos < n/2) ? pos*2 : (pos-n/2)*2+1;// correct pos for aux
swap(&aux, arr + pos);
Score each digit as its numerical value. Score each letter as a = 1.5, b = 2.5 c = 3.5 etc. Run an insertion sort of the string based on the score of each character.
[ETA] Simple scoring won't work so use two pointers and reverse the piece of the string between the two pointers. One pointer starts at the front of the string and advances one step each cycle. The other pointer starts in the middle of the string and advances every second cycle.
^ ^
^ ^
^ ^
^ ^

Generate all combinations of arbitrary alphabet up to arbitrary length

Say I have an array of arbitrary size holding single characters. I want to compute all possible combinations of those characters up to an arbitrary length.
So lets say my array is [1, 2, 3]. The user-specified length is 2. Then the possible combinations are [11, 22, 33, 12, 13, 23, 21, 31, 32].
I'm having real trouble finding a suitable algorithm that allows arbitrary lengths and not just permutates the array. Oh and while speed is not absolutely critical, it should be reasonably fast too.
Just do an add with carry.
Say your array contained 4 symbols and you want ones of length 3.
Start with 000 (i.e. each symbol on your word = alphabet[0])
Then add up:
The algorithm (given these indices) is just to increase the lowest number. If it reaches the number of symbols in your alphabet, increase the previous number (following the same rule) and set the current to 0.
C++ code:
int N_LETTERS = 4;
char alphabet[] = {'a', 'b', 'c', 'd'};
std::vector<std::string> get_all_words(int length)
std::vector<int> index(length, 0);
std::vector<std::string> words;
std::string word(length);
for (int i = 0; i < length; ++i)
word[i] = alphabet[index[i]];
for (int i = length-1; ; --i)
if (i < 0) return words;
if (index[i] == N_LETTERS)
index[i] = 0;
Code is untested, but should do the trick.
Knuth covers combinations and permutations in some depth in The Art of Computer Programming, vol 1. Here is an implementation of one of his algorithms I wrote some years ago (don't hate on the style, its ancient code):
#include <algorithm>
#include <vector>
#include <functional>
#include <iostream>
using namespace std;
template<class BidirectionalIterator, class Function, class Size>
Function _permute(BidirectionalIterator first, BidirectionalIterator last, Size k, Function f, Size n, Size level)
// This algorithm is adapted from Donald Knuth,
// "The Art of Computer Programming, vol. 1, p. 45, Method 1"
// Thanks, Donald.
for( Size x = 0; x < (n-level); ++x ) // rotate every possible value in to this level's slot
if( (level+1) < k )
// if not at max level, recurse down to twirl higher levels first
f = _permute(first,last,k,f,n,level+1);
// we are at highest level, this is a unique permutation
BidirectionalIterator permEnd = first;
advance(permEnd, k);
// rotate next element in to this level's position & continue
BidirectionalIterator rotbegin(first);
BidirectionalIterator rotmid(rotbegin);
return f;
template<class BidirectionalIterator, class Function, class Size>
Function for_each_permutation(BidirectionalIterator first, BidirectionalIterator last, Size k, Function fn)
return _permute<BidirectionalIterator,Function,Size>(first, last, k, fn, distance(first,last), 0);
template<class Elem>
struct DumpPermutation : public std::binary_function<bool, Elem* , Elem*>
bool operator()(Elem* begin, Elem* end) const
cout << "[";
copy(begin, end, ostream_iterator<Elem>(cout, " "));
cout << "]" << endl;
return true;
int main()
int ary[] = {1, 2, 3};
const size_t arySize = sizeof(ary)/sizeof(ary[0]);
for_each_permutation(&ary[0], &ary[arySize], 2, DumpPermutation<int>());
return 0;
Output of this program is:
[1 2 ]
[1 3 ]
[2 3 ]
[2 1 ]
[3 1 ]
[3 2 ]
If you want your combinations to include repeated elements like [11] [22] and [33], you can generate your list of combinations using the algorithm above, and then append to the generated list new elements, by doing something like this:
for( size_t i = 0; i < arySize; ++i )
cout << "[";
for( int j = 0; j < k; ++j )
cout << ary[i] << " ";
cout << "]" << endl;
...and the program output now becomes:
[1 2 ]
[1 3 ]
[2 3 ]
[2 1 ]
[3 1 ]
[3 2 ]
[1 1 ]
[2 2 ]
[3 3 ]
One way to do it would be with a simple counter that you internally interpret as base N, where N is the number of items in the array. You then extract each digit from the base N counter and use it as an index into your array. So if your array is [1,2] and the user specified length is 2, you have
Counter = 0, indexes are 0, 0
Counter = 1, indexes are 0, 1
Counter = 2, indexes are 1, 0
Counter = 3, indexes are 1, 1
The trick here will be your base-10 to base-N conversion code, which isn't terribly difficult.
If you know the length before hand, all you need is some for loops. Say, for length = 3:
for ( i = 0; i < N; i++ )
for ( j = 0; j < N; j++ )
for ( k = 0; k < N; k++ )
you now have ( i, j, k ), or a_i, a_j, a_k
Now to generalize it, just do it recursively, each step of the recursion with one of the for loops:
recurse( int[] a, int[] result, int index)
if ( index == N ) base case, process result
for ( i = 0; i < N; i++ ) {
result[index] = a[i]
recurse( a, result, index + 1 )
Of course, if you simply want all combinations, you can just think of each step as an N-based number, from 1 to k^N - 1, where k is the length.
Basically you would get, in base N (for k = 4):
0000 // take the first element four times
0001 // take the first element three times, then the second element
000(N-1) // take the first element three times, then take the N-th element
1000 // take the second element, then the first element three times
(N-1)(N-1)(N-1)(N-1) // take the last element four times
Using Peter's algorithm works great; however, if your letter set is too large or your string size too long, attempting to put all of the permutations in an array and returning the array won't work. The size of the array will be the size of the alphabet raised to the length of the string.
I created this in perl to take care of the problem:
package Combiner;
#package used to grab all possible combinations of a set of letters. Gets one every call, allowing reduced memory usage and faster processing.
use strict;
use warnings;
#initiate to use nextWord
#arguments are an array reference for the list of letters and the number of characters to be in the generated strings.
sub new {
my ($class, $phoneList,$length) = #_;
my $self = bless {
phoneList => $phoneList,
length => $length,
N_LETTERS => scalar #$phoneList,
}, $class;
sub init {
my ($self) = shift;
$self->{lindex} = [(0) x $self->{length}];
$self->{end} = 0;
#returns all possible combinations of N phonemes, one at a time.
sub nextWord {
my $self = shift;
return 0 if $self->{end} == 1;
my $word = [('-') x $self->{length}];
$$word[$_] = ${$self->{phoneList}}[${$self->{lindex}}[$_]]
#treat the string like addition; loop through 000, 001, 002, 010, 020, etc.
for(my $i = $self->{length}-1;;$i--){
if($i < 0){
$self->{end} = 1;
return $word;
if (${$self->{lindex}}[$i] == $self->{N_LETTERS}){
${$self->{lindex}}[$i] = 0;
return $word;
Call it like this: my $c = Combiner->new(['a','b','c','d'],20);. Then call nextWord to grab the next word; if nextWord returns 0, it means it's done.
Here's my implementation in Haskell:
g :: [a] -> [[a]] -> [[a]]
g alphabet = concat . map (\xs -> [ xs ++ [s] | s <- alphabet])
allwords :: [a] -> [[a]]
allwords alphabet = concat $ iterate (g alphabet) [[]]
Load this script into GHCi. Suppose that we want to find all strings of length less than or equal to 2 over the alphabet {'a','b','c'}. The following GHCi session does that:
*Main> take 13 $ allwords ['a','b','c']
Or, if you want just the strings of length equal to 2:
*Main> filter (\xs -> length xs == 2) $ take 13 $ allwords ['a','b','c']
Be careful with allwords ['a','b','c'] for it is an infinite list!
This is written by me. may be helpful for u...
#include <unistd.h>
void main()
FILE *file;
int i=0,f,l1,l2,l3=0;
char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890!##$%&*.!##$%^&*()";
int size=sizeof(set)-1;
char per[]="000";
//check urs all entered details here//
printf("Setlength=%d Comination are genrating\n",size);
// writing permutation here for length of 3//
//first for loop which control left most char printed in file//
// second for loop which control all intermediate char printed in file//
//third for loop which control right most char printed in file//
//apend file (add text to a file or create a file if it does not exist.//
file = fopen("file.txt","a+");
//writes array per to file named file.txt//
///Writing to file is completed//
printf("Genrating Combination %d\r",i);
printf("\n%d combination has been genrate out of entered data of length %d \n",i,size);
puts("No combination is left :) ");
puts("Press any butoon to exit");

Find the first un-repeated character in a string

What is the quickest way to find the first character which only appears once in a string?
It has to be at least O(n) because you don't know if a character will be repeated until you've read all characters.
So you can iterate over the characters and append each character to a list the first time you see it, and separately keep a count of how many times you've seen it (in fact the only values that matter for the count is "0", "1" or "more than 1").
When you reach the end of the string you just have to find the first character in the list that has a count of exactly one.
Example code in Python:
def first_non_repeated_character(s):
counts = defaultdict(int)
l = []
for c in s:
counts[c] += 1
if counts[c] == 1:
for c in l:
if counts[c] == 1:
return c
return None
This runs in O(n).
I see that people have posted some delightful answers below, so I'd like to offer something more in-depth.
An idiomatic solution in Ruby
We can find the first un-repeated character in a string like so:
def first_unrepeated_char string
string.each_char.tally.find { |_, n| n == 1 }.first
How does Ruby accomplish this?
Reading Ruby's source
Let's break down the solution and consider what algorithms Ruby uses for each step.
First we call each_char on the string. This creates an enumerator which allows us to visit the string one character at a time. This is complicated by the fact that Ruby handles Unicode characters, so each value we get from the enumerator can be a variable number of bytes. If we know our input is ASCII or similar, we could use each_byte instead.
The each_char method is implemented like so:
rb_str_each_char(VALUE str)
RETURN_SIZED_ENUMERATOR(str, 0, 0, rb_str_each_char_size);
return rb_str_enumerate_chars(str, 0);
In turn, rb_string_enumerate_chars is implemented as:
rb_str_enumerate_chars(VALUE str, VALUE ary)
VALUE orig = str;
long i, len, n;
const char *ptr;
rb_encoding *enc;
str = rb_str_new_frozen(str);
ptr = RSTRING_PTR(str);
len = RSTRING_LEN(str);
enc = rb_enc_get(str);
for (i = 0; i < len; i += n) {
n = rb_enc_fast_mbclen(ptr + i, ptr + len, enc);
ENUM_ELEM(ary, rb_str_subseq(str, i, n));
else {
for (i = 0; i < len; i += n) {
n = rb_enc_mbclen(ptr + i, ptr + len, enc);
ENUM_ELEM(ary, rb_str_subseq(str, i, n));
if (ary)
return ary;
return orig;
From this we can see that it calls rb_enc_mbclen (or its fast version) to get the length (in bytes) of the next character in the string so that it can iterate the next step. By lazily iterating over a string, reading just one character at a time, we end up doing just one full pass over the input string as tally consumes the iterator.
Tally is then implemented like so:
static void
tally_up(VALUE hash, VALUE group)
VALUE tally = rb_hash_aref(hash, group);
if (NIL_P(tally)) {
tally = INT2FIX(1);
else if (FIXNUM_P(tally) && tally < INT2FIX(FIXNUM_MAX)) {
tally += INT2FIX(1) & ~FIXNUM_FLAG;
else {
tally = rb_big_plus(tally, INT2FIX(1));
rb_hash_aset(hash, group, tally);
static VALUE
tally_i(RB_BLOCK_CALL_FUNC_ARGLIST(i, hash))
tally_up(hash, i);
return Qnil;
Here, tally_i uses RB_BLOCK_CALL_FUNC_ARGLIST to call repeatedly to tally_up, which updates the tally hash on every iteration.
Rough time & memory analysis
The each_char method doesn't allocate an array to eagerly hold the characters of the string, so it has a small constant memory overhead. When we tally the characters, we allocate a hash and put our tally data into it which in the worst case scenario can take up as much memory as the input string times some constant factor.
Time-wise, tally does a full scan of the string, and calling find to locate the first non-repeated character will scan the hash again, each of which carry O(n) worst-case complexity.
However, tally also updates a hash on every iteration. Updating the hash on every character can be as slow as O(n) again, so the worst case complexity of this Ruby solution is perhaps O(n^2).
However, under reasonable assumptions, updating a hash has an O(1) complexity, so we can expect the average case amortized to look like O(n).
My old accepted answer in Python
You can't know that the character is un-repeated until you've processed the whole string, so my suggestion would be this:
def first_non_repeated_character(string):
chars = []
repeated = []
for character in string:
if character in chars:
if not character in repeated:
if len(chars):
return chars[0]
return False
Edit: originally posted code was bad, but this latest snippet is Certified To Work On Ryan's Computerâ„¢.
Why not use a heap based data structure such as a minimum priority queue. As you read each character from the string, add it to the queue with a priority based on the location in the string and the number of occurrences so far. You could modify the queue to add priorities on collision so that the priority of a character is the sum of the number appearances of that character. At the end of the loop, the first element in the queue will be the least frequent character in the string and if there are multiple characters with a count == 1, the first element was the first unique character added to the queue.
Here is another fun way to do it. Counter requires Python2.7 or Python3.1
>>> from collections import Counter
>>> def first_non_repeated_character(s):
... return min((k for k,v in Counter(s).items() if v<2), key=s.index)
>>> first_non_repeated_character("aaabbbcddd")
>>> first_non_repeated_character("aaaebbbcddd")
Lots of answers are attempting O(n) but are forgetting the actual costs of inserting and removing from the lists/associative arrays/sets they're using to track.
If you can assume that a char is a single byte, then you use a simple array indexed by the char and keep a count in it. This is truly O(n) because the array accesses are guaranteed O(1), and the final pass over the array to find the first element with 1 is constant time (because the array has a small, fixed size).
If you can't assume that a char is a single byte, then I would propose sorting the string and then doing a single pass checking adjacent values. This would be O(n log n) for the sort plus O(n) for the final pass. So it's effectively O(n log n), which is better than O(n^2). Also, it has virtually no space overhead, which is another problem with many of the answers that are attempting O(n).
Counter requires Python2.7 or Python3.1
>>> from collections import Counter
>>> def first_non_repeated_character(s):
... counts = Counter(s)
... for c in s:
... if counts[c]==1:
... return c
... return None
>>> first_non_repeated_character("aaabbbcddd")
>>> first_non_repeated_character("aaaebbbcddd")
Refactoring a solution proposed earlier (not having to use extra list/memory). This goes over the string twice. So this takes O(n) too like the original solution.
def first_non_repeated_character(s):
counts = defaultdict(int)
for c in s:
counts[c] += 1
for c in s:
if counts[c] == 1:
return c
return None
The following is a Ruby implementation of finding the first nonrepeated character of a string:
def first_non_repeated_character(string)
string1 = string.split('')
string2 = string.split('')
string1.each do |let1|
counter = 0
string2.each do |let2|
if let1 == let2
if counter == 1
return let1
p first_non_repeated_character('dont doddle in the forest')
And here is a JavaScript implementation of the same style function:
var first_non_repeated_character = function (string) {
var string1 = string.split('');
var string2 = string.split('');
var single_letters = [];
for (var i = 0; i < string1.length; i++) {
var count = 0;
for (var x = 0; x < string2.length; x++) {
if (string1[i] == string2[x]) {
if (count == 1) {
return string1[i];
console.log(first_non_repeated_character('dont doddle in the forest'));
console.log(first_non_repeated_character('how are you today really?'));
In both cases I used a counter knowing that if the letter is not matched anywhere in the string, it will only occur in the string once so I just count it's occurrence.
I think this should do it in C. This operates in O(n) time with no ambiguity about order of insertion and deletion operators. This is a counting sort (simplest form of a bucket sort, which itself is the simple form of a radix sort).
unsigned char find_first_unique(unsigned char *string)
int chars[256];
int i=0;
memset(chars, 0, sizeof(chars));
while (string[i++])
i = 0;
while (string[i++])
if (chars[string[i]] == 1) return string[i];
return 0;
In Ruby:
(Original Credit: Andrew A. Smith)
x = "a huge string in which some characters repeat"
def first_unique_character(s)
s.each_char.detect { |c| s.count(c) == 1 }
=> "u"
def first_non_repeated_character(string):
chars = []
repeated = []
for character in string:
if character in repeated:
... discard it.
else if character in chars:
if not character in repeated:
if len(chars):
return chars[0]
return False
Other JavaScript solutions are quite c-style solutions here is a more JavaScript-style solution.
var arr = string.split("");
var occurences = {};
var tmp;
var lowestindex = string.length+1;
arr.forEach( function(c){
tmp = c;
if( typeof occurences[tmp] == "undefined")
occurences[tmp] = tmp;
occurences[tmp] += tmp;
for(var p in occurences) {
if(occurences[p].length == 1)
lowestindex = Math.min(lowestindex, string.indexOf(p));
if(lowestindex > string.length)
return null;
return string[lowestindex];
in C, this is almost Shlemiel the Painter's Algorithm (not quite O(n!) but more than 0(n2)).
But will outperform "better" algorithms for reasonably sized strings because O is so small. This can also easily tell you the location of the first non-repeating string.
char FirstNonRepeatedChar(char * psz)
for (int ii = 0; psz[ii] != 0; ++ii)
for (int jj = ii+1; ; ++jj)
// if we hit the end of string, then we found a non-repeat character.
if (psz[jj] == 0)
return psz[ii]; // this character doesn't repeat
// if we found a repeat character, we can stop looking.
if (psz[ii] == psz[jj])
return 0; // there were no non-repeating characters.
edit: this code is assuming you don't mean consecutive repeating characters.
Here's an implementation in Perl (version >=5.10) that doesn't care whether the repeated characters are consecutive or not:
use strict;
use warnings;
foreach my $word(#ARGV)
my #distinct_chars;
my %char_counts;
my #chars=split(//,$word);
foreach (#chars)
push #distinct_chars,$_ unless $_~~#distinct_chars;
my $first_non_repeated="";
print "For \"$word\", the first non-repeated character is '$first_non_repeated'.\n";
print "All characters in \"$word\" are repeated.\n";
Storing this code in a script (which I named and running it on a few inputs produces:
jmaney> perl aabccd "a huge string in which some characters repeat" abcabc
For "aabccd", the first non-repeated character is 'b'.
For "a huge string in which some characters repeat", the first non-repeated character is 'u'.
All characters in "abcabc" are repeated.
Here's a possible solution in ruby without using Array#detect (as in this answer). Using Array#detect makes it too easy, I think.
ALPHABET = %w(a b c d e f g h i j k l m n o p q r s t u v w x y z)
def fnr(s)
unseen_chars = ALPHABET.dup
seen_once_chars = []
s.each_char do |c|
if unseen_chars.include?(c)
seen_once_chars << c
elsif seen_once_chars.include?(c)
Seems to work for some simple examples:
fnr "abcdabcegghh"
# => "d"
fnr "abababababababaqababa"
=> "q"
Suggestions and corrections are very much appreciated!
Try this code:
public static String findFirstUnique(String str)
String unique = "";
foreach (char ch in str)
if (unique.Contains(ch)) unique=unique.Replace(ch.ToString(), "");
else unique += ch.ToString();
return unique[0].ToString();
In Mathematica one might write this:
string = "conservationist deliberately treasures analytical";
Cases[Gather # Characters # string, {_}, 1, 1][[1]]
This snippet code in JavaScript
var string = "tooth";
var hash = [];
for(var i=0; j=string.length, i<j; i++){
if(hash[string[i]] !== undefined){
hash[string[i]] = hash[string[i]] + 1;
hash[string[i]] = 1;
for(i=0; j=string.length, i<j; i++){
if(hash[string[i]] === 1){ string[i] );
return false;
// prints "h"
Different approach here.
scan each element in the string and create a count array which stores the repetition count of each element.
Next time again start from first element in the array and print the first occurrence of element with count = 1
C code
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
char t_c;
char *t_p = argv[1] ;
char count[128]={'\0'};
char ch;
for(t_c = *(argv[1]); t_c != '\0'; t_c = *(++t_p))
t_p = argv[1];
for(t_c = *t_p; t_c != '\0'; t_c = *(++t_p))
if(count[t_c] == 1)
printf("Element is %c\n",t_c);
return 0;
input is = aabbcddeef output is = c
char FindUniqueChar(char *a)
int i=0;
bool repeat=false;
while(a[i] != '\0')
if (a[i] == a[i+1])
repeat = true;
return a[i];
return a[i];
Here is another approach...we could have a array which will store the count and the index of the first occurrence of the character. After filling up the array we could jst traverse the array and find the MINIMUM index whose count is 1 then return str[index]
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <climits>
using namespace std;
#define No_of_chars 256
//store the count and the index where the char first appear
typedef struct countarray
int count;
int index;
//returns the count array
countarray *getcountarray(char *str)
countarray *count;
count=new countarray[No_of_chars];
for(int i=0;i<No_of_chars;i++)
for(int i=0;*(str+i);i++)
if(count[*(str+i)].count==1) //if count==1 then update the index
return count;
char firstnonrepeatingchar(char *str)
countarray *array;
array = getcountarray(str);
int result = INT_MAX;
for(int i=0;i<No_of_chars;i++)
if(array[i].count==1 && result > array[i].index)
result = array[i].index;
delete[] (array);
return (str[result]);
int main()
char str[] = "geeksforgeeks";
cout<<"First non repeating character is "<<firstnonrepeatingchar(str)<<endl;
return 0;
This c# function uses a HashTable (Dictionary) and have a performance O(2n) worstcase.
private static string FirstNoRepeatingCharacter(string aword)
Dictionary<string, int> dic = new Dictionary<string, int>();
for (int i = 0; i < aword.Length; i++)
if (!dic.ContainsKey(aword.Substring(i, 1)))
dic.Add(aword.Substring(i, 1), 1);
dic[aword.Substring(i, 1)]++;
foreach (var item in dic)
if (item.Value == 1) return item.Key;
return string.Empty;
string aword = "TEETER";
Console.WriteLine(FirstNoRepeatingCharacter(aword)); //print: R
I have two strings i.e. 'unique' and 'repeated'. Every character appearing for the first time, gets added to 'unique'. If it is repeated for the second time, it gets removed from 'unique' and added to 'repeated'. This way, we will always have a string of unique characters in 'unique'.
Complexity big O(n)
public void firstUniqueChar(String str){
String unique= "";
String repeated = "";
str = str.toLowerCase();
for(int i=0; i<str.length();i++){
char ch = str.charAt(i);
if(!(repeated.contains(str.subSequence(i, i+1))))
if(unique.contains(str.subSequence(i, i+1))){
unique = unique.replaceAll(Character.toString(ch), "");
repeated = repeated+ch;
unique = unique+ch;
The following code is in C# with complexity of n.
using System;
using System.Linq;
using System.Text;
namespace SomethingDigital
class FirstNonRepeatingChar
public static void Main()
String input = "geeksforgeeksandgeeksquizfor";
char[] str = input.ToCharArray();
bool[] b = new bool[256];
String unique1 = "";
String unique2 = "";
foreach (char ch in str)
if (!unique1.Contains(ch))
unique1 = unique1 + ch;
unique2 = unique2 + ch;
unique2 = unique2.Replace(ch.ToString(), "");
if (unique2 != "")
Console.WriteLine("No non repeated string");
The following solution is an elegant way to find the first unique character within a string using the new features which have been introduced as part as Java 8. This solution uses the approach of first creating a map to count the number of occurrences of each character. It then uses this map to find the first character which occurs only once. This runs in O(N) time.
import static;
import static;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
// Runs in O(N) time and uses lambdas and the stream API from Java 8
// Also, it is only three lines of code!
private static String findFirstUniqueCharacterPerformantWithLambda(String inputString) {
// convert the input string into a list of characters
final List<String> inputCharacters = Arrays.asList(inputString.split(""));
// first, construct a map to count the number of occurrences of each character
final Map<Object, Long> characterCounts = inputCharacters
.collect(groupingBy(s -> s, counting()));
// then, find the first unique character by consulting the count map
return inputCharacters
.filter(s -> characterCounts.get(s) == 1)
Here is one more solution with o(n) time complexity.
public void findUnique(String string) {
ArrayList<Character> uniqueList = new ArrayList<>();
int[] chatArr = new int[128];
for (int i = 0; i < string.length(); i++) {
Character ch = string.charAt(i);
if (chatArr[ch] != -1) {
chatArr[ch] = -1;
} else {
if (uniqueList.size() == 0) {
System.out.println("No unique character found!");
} else {
System.out.println("First unique character is :" + uniqueList.get(0));
I read through the answers, but did not see any like mine, I think this answer is very simple and fast, am I wrong?
def first_unique(s):
repeated = []
while s:
if s[0] not in s[1:] and s[0] not in repeated:
return s[0]
s = s[1:]
return None
(first_unique('abdcab') == 'd', first_unique('aabbccdad') == None, first_unique('') == None, first_unique('a') == 'a')
Question : First Unique Character of a String
This is the simplest solution.
public class Test4 {
public static void main(String[] args) {
String a = "GiniGinaProtijayi";
public static void firstUniqCharindex(String a) {
int[] count = new int[256];
for (int i = 0; i < a.length(); i++) {
int index = -1;
for (int i = 0; i < a.length(); i++) {
if (count[a.charAt(i)] == 1) {
index = i;
} // if
System.out.println(index);// output => 8
System.out.println(a.charAt(index)); //output => P
}// end1
IN Python :
def firstUniqChar(a):
count = [0] * 256
for i in a: count[ord(i)] += 1
element = ""
for items in a:
if(count[ord(items) ] == 1):
element = items ;
return element
a = "GiniGinaProtijayi";
print(firstUniqChar(a)) # output is P
Using Java 8 :
public class Test2 {
public static void main(String[] args) {
String a = "GiniGinaProtijayi";
Map<Character, Long> map = a.chars()
ch -> Character.valueOf((char) ch)
System.out.println("MAP => " + map);
// {G=2, i=5, n=2, a=2, P=1, r=1, o=1, t=1, j=1, y=1}
Character chh = map
.filter(entry -> entry.getValue() == 1L)
.map(entry -> entry.getKey())
System.out.println("First Non Repeating Character => " + chh);// P
}// main
how about using a suffix tree for this case... the first unrepeated character will be first character of longest suffix string with least depth in tree..
Create Two list -
unique list - having only unique character .. UL
non-unique list - having only repeated character -NUL
for(char c in str) {
//do nothing
}else if(ul.contains(c)){
