Split string into words

Split string into words - algorithm

I am looking for the most efficient algorithm to form all possible combinations of words from a string. For example:
Input String: forevercarrot
Output:
forever carrot
forever car rot
for ever carrot
for ever car rot
(All words should be from a dictionary).
I can think of a brute force approach. (find all possible substrings and match) but what would be better ways?

Use a prefix tree for your list of known words. Probably libs like myspell already do so. Try using a ready-made one.
Once you found a match (e.g. 'car'), split your computation: one branch starts to look for a new word ('rot'), another continues to explore variants of current beginning ('carrot').
Effectively you maintain a queue of pairs (start_position, current_position) of offsets into your string every time you split the computation. Several threads can pop from this queue in parallel and try to continue a word that starts from start_position and is already known up to current_position of the pair, but does not end there. When a word is found, it is reported and another pair is popped from the queue. When it's impossible, no result is generated. When a split occurs, a new pair is added to the end of the queue. Initially the queue contains a (0,0).

See this question which has even better answers. It's a standard dynamic programming problem:
How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

A psuedocode implementation, exploiting the fact that every part of the string needs to be a word, we can't skip anything. We work forward from the start of the string until the first bit is a word, and then generate all possible combinations of the rest of the string. Once we've done that, we keep going along until we find any other possibilities for the first word, and so on.
allPossibleWords(string s, int startPosition) {
list ret
for i in startPosition..s'length
if isWord(s[startPosition, i])
ret += s[startPostion, i] * allPossibleWords(s, i)
return ret
}
The bugbear in this code is that you'll end up repeating calculations - in your example, you'll end up having to calculate allPossibleWords("carrot") twice - once in ["forever", allPossibleWords["carrot"]] and once in ["for", "ever", allPossibleWords["carrot"]]. So memoizing this is something to consider.

Input String: forevercarrot
Output:
forever carrot
forever car rot
for ever carrot
for ever car rot
program :
#include<iostream>
#include<string>
#include<vector>
#include<string.h>
void strsplit(std::string str)
{
int len=0,i,x,y,j,k;
len = str.size();
std::string s1,s2,s3,s4,s5,s6,s7;
char *c = new char[len+1]();
char *b = new char[len+1]();
char *d = new char[len+1]();
for(i =0 ;i< len-1;i++)
{
std::cout<<"\n";
for(j=0;j<=i;j++)
{
c[j] = str[j];
b[j] = str[j];
s3 += c[j];
y = j+1;
}
for( int h=i+1;h<len;h++){
s5 += str[h];
}
s6 = s3+" "+s5;
std::cout<<" "<<s6<<"\n";
s5 = "";
for(k = y;k<len-1;k++)
{
d[k] = str[k];
s1 += d[k];
s1 += " ";
for(int l = k+1;l<len;l++){
b[l] = str[l];
s2 += b[l];
}
s4 = s3+" "+s1+s2;
s7 = s4;
std::cout<<" "<<s4<<"\n";
s3 = "";s4 = "";
}
s1 = "";s3 = "";
}
}
int main(int argc, char* argv[])
{
std::string str;
if(argc < 2)
std::cout<<"Usage: "<<argv[0]<<" <InputString> "<<"\n";
else{
str = argv[1];
strsplit(str);
}
return 0;
}

Related

Find word in string buffer/paragraph/text

This was asked in Amazon telephonic interview - "Can you write a program (in your preferred language C/C++/etc.) to find a given word in a string buffer of big size ? i.e. number of occurrences "
I am still looking for perfect answer which I should have given to the interviewer.. I tried to write a linear search (char by char comparison) and obviously I was rejected.
Given a 40-45 min time for a telephonic interview, what was the perfect algorithm he/she was looking for ???

The KMP Algorithm is a popular string matching algorithm.
KMP Algorithm
Checking char by char is inefficient. If the string has 1000 characters and the keyword has 100 characters, you don't want to perform unnecessary comparisons. The KMP Algorithm handles many cases which can occur, but I imagine the interviewer was looking for the case where: When you begin (pass 1), the first 99 characters match, but the 100th character doesn't match. Now, for pass 2, instead of performing the entire comparison from character 2, you have enough information to deduce where the next possible match can begin.
// C program for implementation of KMP pattern searching
// algorithm
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void computeLPSArray(char *pat, int M, int *lps);
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // length of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example
// AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char *txt = "ABABDABACDABABCABAB";
char *pat = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
This code is taken from a really good site that teaches algorithms:
Geeks for Geeks KMP

Amazon and companies alike expect knowledge of Boyer–Moore string search or / and Knuth–Morris–Pratt algorithms.
Those are good if you want to show perfect knowledge. Otherwise, try to be creative and write something relatively elegant and efficient.
Did you ask about delimiters before you wrote anything? It could be that they may simplify your task to provide some extra information about a string buffer.
Even code below could be ok (it's really not) if you provide enough information in advance, properly explain runtime, space requirements, choice of data containers.
int find( std::string & the_word, std::string & text )
{
std::stringstream ss( text ); // !!! could be really bad idea if 'text' is really big
std::string word;
std::unordered_map< std::string, int > umap;
while( ss >> text ) ++umap[text]; // you have to assume that each word separated by white-spaces.
return umap[the_word];
}

algorithm in C++

I am writing a multi-way trie that will load in a dictionary that will take words and phrases. So first the dictionary will be loaded into the trie.

This is some (almost) C++ adapted from the following article:
http://www.toptal.com/java/the-trie-a-neglected-data-structure
That one is written in Java, so I've taken the courtesy of giving it to you in c++.
struct Alphabet{
char[] x = 'abcdefghijklmnopqrstuvwxyz';
int findIndex(const char* s){
for(int i = 0; i < 26; ++i){
if(x[i] == *s){
return i;
}
}
return -1;
}
};
struct MWTrieNode{
std::vector::<MWTrieNode*> children;
bool isLast = false;
}
MWTrieNode* getWord(const char* s, int len, MWTrieNode* root){
MWTrieNode* node = root;
Alphabet a;
for(int i = 0; i < len; i++){
const char* currChar = s[i];
int index = a.findIndex(currChar);
MWTrieNode* child = node->children[index];
if(!child){
// No such word
return NULL;
}
// step into the MWTrieNode
node = child;
}
return node;
}
// * corrected comparison between char* and char. (using *(char*))
You can modify the getWord function to take in some parameters to modify how you return your words.
But this should get you started.
For Completions, you'll need to find all of the words below a certain prefix (I'd Imagine). So you'd want' to build several "sub trees" starting with the root of your search prefix (i.e. "House" --> "Housewife", "Housing", "Household", etc.
If you pass in 'Ho', you will find a 'partial word' with no Node saying its at the end. At this point, you can start at the 'o' node.
The part you haven't mentioned is how you store words that are both a word by them selves, as well as a portion of a longer word, for example, Home, Homeowner, Homewrecker.
Those Nodes must have a isLast == true, but also have child nodes. it is this special case that should help you find multiple options for auto-complete. You're running the getWord method several times for a single search, with different prefixes and conditions. The result should be a list of words that all have the prefix you desire.

I'm sure that professors Alvarado and Mirza will be highly interested in the contents of your post.

Parsing morse code

I am trying to solve this problem.
The goal is to determine the number of ways a morse string can be interpreted, given a dictionary of word.
What I did is that I first "translated" words from my dictionary into morse. Then, I used a naive algorithm, searching for all the ways it can be interpreted recursively.
#include <iostream>
#include <vector>
#include <map>
#include <string>
#include <iterator>
using namespace std;
string morse_string;
int morse_string_size;
map<char, string> morse_table;
unsigned int sol;
void matches(int i, int factor, vector<string> &dictionary) {
int suffix_length = morse_string_size-i;
if (suffix_length <= 0) {
sol += factor;
return;
}
map<int, int> c;
for (vector<string>::iterator it = dictionary.begin() ; it != dictionary.end() ; it++) {
if (((*it).size() <= suffix_length) && (morse_string.substr(i, (*it).size()) == *it)) {
if (c.find((*it).size()) == c.end())
c[(*it).size()] = 0;
else
c[(*it).size()]++;
}
}
for (map<int, int>::iterator it = c.begin() ; it != c.end() ; it++) {
matches(i+it->first, factor*(it->second), dictionary);
}
}
string encode_morse(string s) {
string ret = "";
for (unsigned int i = 0 ; i < s.length() ; ++i) {
ret += morse_table[s[i]];
}
return ret;
}
int main() {
morse_table['A'] = ".-"; morse_table['B'] = "-..."; morse_table['C'] = "-.-."; morse_table['D'] = "-.."; morse_table['E'] = "."; morse_table['F'] = "..-."; morse_table['G'] = "--."; morse_table['H'] = "...."; morse_table['I'] = ".."; morse_table['J'] = ".---"; morse_table['K'] = "-.-"; morse_table['L'] = ".-.."; morse_table['M'] = "--"; morse_table['N'] = "-."; morse_table['O'] = "---"; morse_table['P'] = ".--."; morse_table['Q'] = "--.-"; morse_table['R'] = ".-."; morse_table['S'] = "..."; morse_table['T'] = "-"; morse_table['U'] = "..-"; morse_table['V'] = "...-"; morse_table['W'] = ".--"; morse_table['X'] = "-..-"; morse_table['Y'] = "-.--"; morse_table['Z'] = "--..";
int T, N;
string tmp;
vector<string> dictionary;
cin >> T;
while (T--) {
morse_string = "";
cin >> morse_string;
morse_string_size = morse_string.size();
cin >> N;
for (int j = 0 ; j < N ; j++) {
cin >> tmp;
dictionary.push_back(encode_morse(tmp));
}
sol = 0;
matches(0, 1, dictionary);
cout << sol;
if (T)
cout << endl << endl;
}
return 0;
}
Now the thing is that I only have 3 seconds of execution time allowed, and my algorithm won't work under this limit of time.
Is this the good way to do this and if so, what am I missing ? Otherwise, can you give some hints about what is a good strategy ?
EDIT :
There can be at most 10 000 words in the dictionary and at most 1000 characters in the morse string.

A solution that combines dynamic programming with a rolling hash should work for this problem.
Let's start with a simple dynamic programming solution. We allocate an vector which we will use to store known counts for prefixes of morse_string. We then iterate through morse_string and at each position we iterate through all words and we look back to see if they can fit into morse_string. If they can fit then we use the dynamic programming vector to determine how many ways we could have build the prefix of morse_string up to i-dictionaryWord.size()
vector<long>dp;
dp.push_back(1);
for (int i=0;i<morse_string.size();i++) {
long count = 0;
for (int j=1;j<dictionary.size();j++) {
if (dictionary[j].size() > i) continue;
if (dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)) {
count += dp[i-dictionary[j].size()];
}
}
dp.push_back(count);
}
result = dp[morse_code.size()]
The problem with this solution is that it is too slow. Let's say that N is the length of morse_string and M is the size of the dictionary and K is the size of the largest word in the dictionary. It will do O(N*M*K) operations. If we assume K=1000 this is about 10^10 operations which is too slow on most machines.
The K cost came from the line dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)
If we could speed up this string matching to constant or log complexity we would be okay. This is where rolling hashing comes in. If you build a rolling hash array of morse_string then the idea is that you can compute the hash of any substring of morse_string in O(1). So you could then do hash(dictionary[j]) == hash(morse_string.substring(i-dictionary[j].size(),i))
This is good but in the presence of imperfect hashing you could have multiple words from the dictionary with the same hash. That would mean that after getting a hash match you would still need to match the strings as well as the hashes. In programming contests, people often assume perfect hashing and skip the string matching. This is often a safe bet especially on a small dictionary. In case it doesn't produce a perfect hashing (which you can check in code) you can always adjust your hash function slightly and maybe the adjusted hash function will produce a perfect hashing.

Reorder a string by half the character

This is an interview question.
Given a string such as: 123456abcdef consisting of n/2 integers followed by n/2 characters. Reorder the string to contain as 1a2b3c4d5e6f . The algortithm should be in-place.
The solution I gave was trivial - O(n^2). Just shift the characters by n/2 places to the left.
I tried using recursion as -
a. Swap later half of the first half with the previous half of the 2nd part - eg
123 456 abc def
123 abc 456 def
b. Recurse on the two halves.
The pbm I am stuck is that the swapping varies with the number of elements - for eg.
What to do next?
123 abc
12ab 3c
And what to do for : 12345 abcde
123abc 45ab
This is a pretty old question and may be a duplicate. Please let me know.. :)
Another example:
Input: 38726zfgsa
Output: 3z8f7g2s6a

Here's how I would approach the problem:
1) Divide the string into two partitions, number part and letter part
2) Divide each of those partitions into two more (equal sized)
3) Swap the second the third partition (inner number and inner letter)
4) Recurse on the original two partitions (with their newly swapped bits)
5) Stop when the partition has a size of 2
For example:
123456abcdef -> 123456 abcdef -> 123 456 abc def -> 123 abc 456 def
123abc -> 123 abc -> 12 3 ab c -> 12 ab 3 c
12 ab -> 1 2 a b -> 1 a 2 b
... etc
And the same for the other half of the recursion..
All can be done in place with the only gotcha being swapping partitions that aren't the same size (but it'll be off by one, so not difficult to handle).

It is easy to permute an array in place by chasing elements round cycles if you have a bit-map to mark which elements have been moved. We don't have a separate bit-map, but IF your characters are letters (or at least have the high order bit clear) then we can use the top bit of each character to mark this. This produces the following program, which is not recursive and so does not use stack space.
class XX
{
/** new position given old position */
static int newFromOld(int x, int n)
{
if (x < n / 2)
{
return x * 2;
}
return (x - n / 2) * 2 + 1;
}
private static int HIGH_ORDER_BIT = 1 << 15; // 16-bit chars
public static void main(String[] s)
{
// input data - create an array so we can modify
// characters in place
char[] x = s[0].toCharArray();
if ((x.length & 1) != 0)
{
System.err.println("Only works with even length strings");
return;
}
// Character we have read but not yet written, if any
char holding = 0;
// where character in hand was read from
int holdingPos = 0;
// whether picked up a character in our hand
boolean isHolding = false;
int rpos = 0;
while (rpos < x.length)
{ // Here => moved out everything up to rpos
// and put in place with top bit set to mark new occupant
if (!isHolding)
{ // advance read pointer to read new character
char here = x[rpos];
holdingPos = rpos++;
if ((here & HIGH_ORDER_BIT) != 0)
{
// already dealt with
continue;
}
int targetPos = newFromOld(holdingPos, x.length);
// pick up char at target position
holding = x[targetPos];
// place new character, and mark as new
x[targetPos] = (char)(here | HIGH_ORDER_BIT);
// Now holding a character that needs to be put in its
// correct place
isHolding = true;
holdingPos = targetPos;
}
int targetPos = newFromOld(holdingPos, x.length);
char here = x[targetPos];
if ((here & HIGH_ORDER_BIT) != 0)
{ // back to where we picked up a character to hold
isHolding = false;
continue;
}
x[targetPos] = (char)(holding | HIGH_ORDER_BIT);
holding = here;
holdingPos = targetPos;
}
for (int i = 0; i < x.length; i++)
{
x[i] ^= HIGH_ORDER_BIT;
}
System.out.println("Result is " + new String(x));
}
}

These days, if I asked someone that question, what I'm looking for them to write on the whiteboard first is:
assertEquals("1a2b3c4d5e6f",funnySort("123456abcdef"));
...
and then maybe ask for more examples.
(And then, depending, if the task is to interleave numbers & letters, I think you can do it with two walking-pointers, indexLetter and indexDigit, and advance them across swapping as needed til you reach the end.)

In your recursive solution why don't you just make a test if n/2 % 2 == 0 (n%4 ==0 ) and treat the 2 situations differently
As templatetypedef commented your recursion cannot be in-place.
But here is a solution (not in place) using the way you wanted to make your recursion :
def f(s):
n=len(s)
if n==2: #initialisation
return s
elif n%4 == 0 : #if n%4 == 0 it's easy
return f(s[:n/4]+s[n/2:3*n/4])+f(s[n/4:n/2]+s[3*n/4:])
else: #otherwise, n-2 %4 == 0
return s[0]+s[n/2]+f(s[1:n/2]+s[n/2+1:])

Here we go. Recursive, cuts it in half each time, and in-place. Uses the approach outlined by #Chris Mennie. Getting the splitting right was tricky. A lot longer than Python, innit?
/* In-place, divide-and-conquer, recursive riffle-shuffle of strings;
* even length only. No wide characters or Unicode; old school. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void testrif(const char *s);
void riffle(char *s);
void rif_recur(char *s, size_t len);
void swap(char *s, size_t midpt, size_t len);
void flip(char *s, size_t len);
void if_odd_quit(const char *s);
int main(void)
{
testrif("");
testrif("a1");
testrif("ab12");
testrif("abc123");
testrif("abcd1234");
testrif("abcde12345");
testrif("abcdef123456");
return 0;
}
void testrif(const char *s)
{
char mutable[20];
strcpy(mutable, s);
printf("'%s'\n", mutable);
riffle(mutable);
printf("'%s'\n\n", mutable);
}
void riffle(char *s)
{
if_odd_quit(s);
rif_recur(s, strlen(s));
}
void rif_recur(char *s, size_t len)
{
/* Turn, e.g., "abcde12345" into "abc123de45", then recurse. */
size_t pivot = len / 2;
size_t half = (pivot + 1) / 2;
size_t twice = half * 2;
if (len < 4)
return;
swap(s + half, pivot - half, pivot);
rif_recur(s, twice);
rif_recur(s + twice, len - twice);
}
void swap(char *s, size_t midpt, size_t len)
{
/* Swap s[0..midpt] with s[midpt..len], in place. Algorithm from
* Programming Pearls, Chapter 2. */
flip(s, midpt);
flip(s + midpt, len - midpt);
flip(s, len);
}
void flip(char *s, size_t len)
{
/* Reverse order of characters in s, in place. */
char *p, *q, tmp;
if (len < 2)
return;
for (p = s, q = s + len - 1; p < q; p++, q--) {
tmp = *p;
*p = *q;
*q = tmp;
}
}
void if_odd_quit(const char *s)
{
if (strlen(s) % 2) {
fputs("String length is odd; aborting.\n", stderr);
exit(1);
}
}

By comparing 123456abcdef and 1a2b3c4d5e6f we can note that only the first and the last characters are in their correct position. We can also note that for each remaining n-2 characters we can compute their correct position directly from their original position. They will get there, and the element that was there surely was not in the correct position, so it will have to replace another one. By doing n-2 such steps all the elements will get to the correct positions:
void funny_sort(char* arr, int n){
int pos = 1; // first unordered element
char aux = arr[pos];
for (int iter = 0; iter < n-2; iter++) { // n-2 unordered elements
pos = (pos < n/2) ? pos*2 : (pos-n/2)*2+1;// correct pos for aux
swap(&aux, arr + pos);
}
}

Score each digit as its numerical value. Score each letter as a = 1.5, b = 2.5 c = 3.5 etc. Run an insertion sort of the string based on the score of each character.
[ETA] Simple scoring won't work so use two pointers and reverse the piece of the string between the two pointers. One pointer starts at the front of the string and advances one step each cycle. The other pointer starts in the middle of the string and advances every second cycle.
123456abcdef
^ ^
1a65432bcdef
^ ^
1a23456bcdef
^ ^
1a2b6543cdef
^ ^

Find the first un-repeated character in a string

What is the quickest way to find the first character which only appears once in a string?

It has to be at least O(n) because you don't know if a character will be repeated until you've read all characters.
So you can iterate over the characters and append each character to a list the first time you see it, and separately keep a count of how many times you've seen it (in fact the only values that matter for the count is "0", "1" or "more than 1").
When you reach the end of the string you just have to find the first character in the list that has a count of exactly one.
Example code in Python:
def first_non_repeated_character(s):
counts = defaultdict(int)
l = []
for c in s:
counts[c] += 1
if counts[c] == 1:
l.append(c)
for c in l:
if counts[c] == 1:
return c
return None
This runs in O(n).

I see that people have posted some delightful answers below, so I'd like to offer something more in-depth.
An idiomatic solution in Ruby
We can find the first un-repeated character in a string like so:
def first_unrepeated_char string
string.each_char.tally.find { |_, n| n == 1 }.first
end
How does Ruby accomplish this?
Reading Ruby's source
Let's break down the solution and consider what algorithms Ruby uses for each step.
First we call each_char on the string. This creates an enumerator which allows us to visit the string one character at a time. This is complicated by the fact that Ruby handles Unicode characters, so each value we get from the enumerator can be a variable number of bytes. If we know our input is ASCII or similar, we could use each_byte instead.
The each_char method is implemented like so:
rb_str_each_char(VALUE str)
{
RETURN_SIZED_ENUMERATOR(str, 0, 0, rb_str_each_char_size);
return rb_str_enumerate_chars(str, 0);
}
In turn, rb_string_enumerate_chars is implemented as:
rb_str_enumerate_chars(VALUE str, VALUE ary)
{
VALUE orig = str;
long i, len, n;
const char *ptr;
rb_encoding *enc;
str = rb_str_new_frozen(str);
ptr = RSTRING_PTR(str);
len = RSTRING_LEN(str);
enc = rb_enc_get(str);
if (ENC_CODERANGE_CLEAN_P(ENC_CODERANGE(str))) {
for (i = 0; i < len; i += n) {
n = rb_enc_fast_mbclen(ptr + i, ptr + len, enc);
ENUM_ELEM(ary, rb_str_subseq(str, i, n));
}
}
else {
for (i = 0; i < len; i += n) {
n = rb_enc_mbclen(ptr + i, ptr + len, enc);
ENUM_ELEM(ary, rb_str_subseq(str, i, n));
}
}
RB_GC_GUARD(str);
if (ary)
return ary;
else
return orig;
}
From this we can see that it calls rb_enc_mbclen (or its fast version) to get the length (in bytes) of the next character in the string so that it can iterate the next step. By lazily iterating over a string, reading just one character at a time, we end up doing just one full pass over the input string as tally consumes the iterator.
Tally is then implemented like so:
static void
tally_up(VALUE hash, VALUE group)
{
VALUE tally = rb_hash_aref(hash, group);
if (NIL_P(tally)) {
tally = INT2FIX(1);
}
else if (FIXNUM_P(tally) && tally < INT2FIX(FIXNUM_MAX)) {
tally += INT2FIX(1) & ~FIXNUM_FLAG;
}
else {
tally = rb_big_plus(tally, INT2FIX(1));
}
rb_hash_aset(hash, group, tally);
}
static VALUE
tally_i(RB_BLOCK_CALL_FUNC_ARGLIST(i, hash))
{
ENUM_WANT_SVALUE();
tally_up(hash, i);
return Qnil;
}
Here, tally_i uses RB_BLOCK_CALL_FUNC_ARGLIST to call repeatedly to tally_up, which updates the tally hash on every iteration.
Rough time & memory analysis
The each_char method doesn't allocate an array to eagerly hold the characters of the string, so it has a small constant memory overhead. When we tally the characters, we allocate a hash and put our tally data into it which in the worst case scenario can take up as much memory as the input string times some constant factor.
Time-wise, tally does a full scan of the string, and calling find to locate the first non-repeated character will scan the hash again, each of which carry O(n) worst-case complexity.
However, tally also updates a hash on every iteration. Updating the hash on every character can be as slow as O(n) again, so the worst case complexity of this Ruby solution is perhaps O(n^2).
However, under reasonable assumptions, updating a hash has an O(1) complexity, so we can expect the average case amortized to look like O(n).
My old accepted answer in Python
You can't know that the character is un-repeated until you've processed the whole string, so my suggestion would be this:
def first_non_repeated_character(string):
chars = []
repeated = []
for character in string:
if character in chars:
chars.remove(character)
repeated.append(character)
else:
if not character in repeated:
chars.append(character)
if len(chars):
return chars[0]
else:
return False
Edit: originally posted code was bad, but this latest snippet is Certified To Work On Ryan's Computer™.

Why not use a heap based data structure such as a minimum priority queue. As you read each character from the string, add it to the queue with a priority based on the location in the string and the number of occurrences so far. You could modify the queue to add priorities on collision so that the priority of a character is the sum of the number appearances of that character. At the end of the loop, the first element in the queue will be the least frequent character in the string and if there are multiple characters with a count == 1, the first element was the first unique character added to the queue.

Here is another fun way to do it. Counter requires Python2.7 or Python3.1
>>> from collections import Counter
>>> def first_non_repeated_character(s):
... return min((k for k,v in Counter(s).items() if v<2), key=s.index)
...
>>> first_non_repeated_character("aaabbbcddd")
'c'
>>> first_non_repeated_character("aaaebbbcddd")
'e'

Lots of answers are attempting O(n) but are forgetting the actual costs of inserting and removing from the lists/associative arrays/sets they're using to track.
If you can assume that a char is a single byte, then you use a simple array indexed by the char and keep a count in it. This is truly O(n) because the array accesses are guaranteed O(1), and the final pass over the array to find the first element with 1 is constant time (because the array has a small, fixed size).
If you can't assume that a char is a single byte, then I would propose sorting the string and then doing a single pass checking adjacent values. This would be O(n log n) for the sort plus O(n) for the final pass. So it's effectively O(n log n), which is better than O(n^2). Also, it has virtually no space overhead, which is another problem with many of the answers that are attempting O(n).

Counter requires Python2.7 or Python3.1
>>> from collections import Counter
>>> def first_non_repeated_character(s):
... counts = Counter(s)
... for c in s:
... if counts[c]==1:
... return c
... return None
...
>>> first_non_repeated_character("aaabbbcddd")
'c'
>>> first_non_repeated_character("aaaebbbcddd")
'e'

Refactoring a solution proposed earlier (not having to use extra list/memory). This goes over the string twice. So this takes O(n) too like the original solution.
def first_non_repeated_character(s):
counts = defaultdict(int)
for c in s:
counts[c] += 1
for c in s:
if counts[c] == 1:
return c
return None

The following is a Ruby implementation of finding the first nonrepeated character of a string:
def first_non_repeated_character(string)
string1 = string.split('')
string2 = string.split('')
string1.each do |let1|
counter = 0
string2.each do |let2|
if let1 == let2
counter+=1
end
end
if counter == 1
return let1
break
end
end
end
p first_non_repeated_character('dont doddle in the forest')
And here is a JavaScript implementation of the same style function:
var first_non_repeated_character = function (string) {
var string1 = string.split('');
var string2 = string.split('');
var single_letters = [];
for (var i = 0; i < string1.length; i++) {
var count = 0;
for (var x = 0; x < string2.length; x++) {
if (string1[i] == string2[x]) {
count++
}
}
if (count == 1) {
return string1[i];
}
}
}
console.log(first_non_repeated_character('dont doddle in the forest'));
console.log(first_non_repeated_character('how are you today really?'));
In both cases I used a counter knowing that if the letter is not matched anywhere in the string, it will only occur in the string once so I just count it's occurrence.

I think this should do it in C. This operates in O(n) time with no ambiguity about order of insertion and deletion operators. This is a counting sort (simplest form of a bucket sort, which itself is the simple form of a radix sort).
unsigned char find_first_unique(unsigned char *string)
{
int chars[256];
int i=0;
memset(chars, 0, sizeof(chars));
while (string[i++])
{
chars[string[i]]++;
}
i = 0;
while (string[i++])
{
if (chars[string[i]] == 1) return string[i];
}
return 0;
}

In Ruby:
(Original Credit: Andrew A. Smith)
x = "a huge string in which some characters repeat"
def first_unique_character(s)
s.each_char.detect { |c| s.count(c) == 1 }
end
first_unique_character(x)
=> "u"

def first_non_repeated_character(string):
chars = []
repeated = []
for character in string:
if character in repeated:
... discard it.
else if character in chars:
chars.remove(character)
repeated.append(character)
else:
if not character in repeated:
chars.append(character)
if len(chars):
return chars[0]
else:
return False

Other JavaScript solutions are quite c-style solutions here is a more JavaScript-style solution.
var arr = string.split("");
var occurences = {};
var tmp;
var lowestindex = string.length+1;
arr.forEach( function(c){
tmp = c;
if( typeof occurences[tmp] == "undefined")
occurences[tmp] = tmp;
else
occurences[tmp] += tmp;
});
for(var p in occurences) {
if(occurences[p].length == 1)
lowestindex = Math.min(lowestindex, string.indexOf(p));
}
if(lowestindex > string.length)
return null;
return string[lowestindex];
}

in C, this is almost Shlemiel the Painter's Algorithm (not quite O(n!) but more than 0(n2)).
But will outperform "better" algorithms for reasonably sized strings because O is so small. This can also easily tell you the location of the first non-repeating string.
char FirstNonRepeatedChar(char * psz)
{
for (int ii = 0; psz[ii] != 0; ++ii)
{
for (int jj = ii+1; ; ++jj)
{
// if we hit the end of string, then we found a non-repeat character.
//
if (psz[jj] == 0)
return psz[ii]; // this character doesn't repeat
// if we found a repeat character, we can stop looking.
//
if (psz[ii] == psz[jj])
break;
}
}
return 0; // there were no non-repeating characters.
}
edit: this code is assuming you don't mean consecutive repeating characters.

Here's an implementation in Perl (version >=5.10) that doesn't care whether the repeated characters are consecutive or not:
use strict;
use warnings;
foreach my $word(#ARGV)
{
my #distinct_chars;
my %char_counts;
my #chars=split(//,$word);
foreach (#chars)
{
push #distinct_chars,$_ unless $_~~#distinct_chars;
$char_counts{$_}++;
}
my $first_non_repeated="";
foreach(#distinct_chars)
{
if($char_counts{$_}==1)
{
$first_non_repeated=$_;
last;
}
}
if(length($first_non_repeated))
{
print "For \"$word\", the first non-repeated character is '$first_non_repeated'.\n";
}
else
{
print "All characters in \"$word\" are repeated.\n";
}
}
Storing this code in a script (which I named non_repeated.pl) and running it on a few inputs produces:
jmaney> perl non_repeated.pl aabccd "a huge string in which some characters repeat" abcabc
For "aabccd", the first non-repeated character is 'b'.
For "a huge string in which some characters repeat", the first non-repeated character is 'u'.
All characters in "abcabc" are repeated.

Here's a possible solution in ruby without using Array#detect (as in this answer). Using Array#detect makes it too easy, I think.
ALPHABET = %w(a b c d e f g h i j k l m n o p q r s t u v w x y z)
def fnr(s)
unseen_chars = ALPHABET.dup
seen_once_chars = []
s.each_char do |c|
if unseen_chars.include?(c)
unseen_chars.delete(c)
seen_once_chars << c
elsif seen_once_chars.include?(c)
seen_once_chars.delete(c)
end
end
seen_once_chars.first
end
Seems to work for some simple examples:
fnr "abcdabcegghh"
# => "d"
fnr "abababababababaqababa"
=> "q"
Suggestions and corrections are very much appreciated!

Try this code:
public static String findFirstUnique(String str)
{
String unique = "";
foreach (char ch in str)
{
if (unique.Contains(ch)) unique=unique.Replace(ch.ToString(), "");
else unique += ch.ToString();
}
return unique[0].ToString();
}

In Mathematica one might write this:
string = "conservationist deliberately treasures analytical";
Cases[Gather # Characters # string, {_}, 1, 1][[1]]
{"v"}

This snippet code in JavaScript
var string = "tooth";
var hash = [];
for(var i=0; j=string.length, i<j; i++){
if(hash[string[i]] !== undefined){
hash[string[i]] = hash[string[i]] + 1;
}else{
hash[string[i]] = 1;
}
}
for(i=0; j=string.length, i<j; i++){
if(hash[string[i]] === 1){
console.info( string[i] );
return false;
}
}
// prints "h"

Different approach here.
scan each element in the string and create a count array which stores the repetition count of each element.
Next time again start from first element in the array and print the first occurrence of element with count = 1
C code
-----
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
char t_c;
char *t_p = argv[1] ;
char count[128]={'\0'};
char ch;
for(t_c = *(argv[1]); t_c != '\0'; t_c = *(++t_p))
count[t_c]++;
t_p = argv[1];
for(t_c = *t_p; t_c != '\0'; t_c = *(++t_p))
{
if(count[t_c] == 1)
{
printf("Element is %c\n",t_c);
break;
}
}
return 0;
}

input is = aabbcddeef output is = c
char FindUniqueChar(char *a)
{
int i=0;
bool repeat=false;
while(a[i] != '\0')
{
if (a[i] == a[i+1])
{
repeat = true;
}
else
{
if(!repeat)
{
cout<<a[i];
return a[i];
}
repeat=false;
}
i++;
}
return a[i];
}

Here is another approach...we could have a array which will store the count and the index of the first occurrence of the character. After filling up the array we could jst traverse the array and find the MINIMUM index whose count is 1 then return str[index]
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <climits>
using namespace std;
#define No_of_chars 256
//store the count and the index where the char first appear
typedef struct countarray
{
int count;
int index;
}countarray;
//returns the count array
countarray *getcountarray(char *str)
{
countarray *count;
count=new countarray[No_of_chars];
for(int i=0;i<No_of_chars;i++)
{
count[i].count=0;
count[i].index=-1;
}
for(int i=0;*(str+i);i++)
{
(count[*(str+i)].count)++;
if(count[*(str+i)].count==1) //if count==1 then update the index
count[*(str+i)].index=i;
}
return count;
}
char firstnonrepeatingchar(char *str)
{
countarray *array;
array = getcountarray(str);
int result = INT_MAX;
for(int i=0;i<No_of_chars;i++)
{
if(array[i].count==1 && result > array[i].index)
result = array[i].index;
}
delete[] (array);
return (str[result]);
}
int main()
{
char str[] = "geeksforgeeks";
cout<<"First non repeating character is "<<firstnonrepeatingchar(str)<<endl;
return 0;
}

Function:
This c# function uses a HashTable (Dictionary) and have a performance O(2n) worstcase.
private static string FirstNoRepeatingCharacter(string aword)
{
Dictionary<string, int> dic = new Dictionary<string, int>();
for (int i = 0; i < aword.Length; i++)
{
if (!dic.ContainsKey(aword.Substring(i, 1)))
dic.Add(aword.Substring(i, 1), 1);
else
dic[aword.Substring(i, 1)]++;
}
foreach (var item in dic)
{
if (item.Value == 1) return item.Key;
}
return string.Empty;
}
Example:
string aword = "TEETER";
Console.WriteLine(FirstNoRepeatingCharacter(aword)); //print: R

I have two strings i.e. 'unique' and 'repeated'. Every character appearing for the first time, gets added to 'unique'. If it is repeated for the second time, it gets removed from 'unique' and added to 'repeated'. This way, we will always have a string of unique characters in 'unique'.
Complexity big O(n)
public void firstUniqueChar(String str){
String unique= "";
String repeated = "";
str = str.toLowerCase();
for(int i=0; i<str.length();i++){
char ch = str.charAt(i);
if(!(repeated.contains(str.subSequence(i, i+1))))
if(unique.contains(str.subSequence(i, i+1))){
unique = unique.replaceAll(Character.toString(ch), "");
repeated = repeated+ch;
}
else
unique = unique+ch;
}
System.out.println(unique.charAt(0));
}

The following code is in C# with complexity of n.
using System;
using System.Linq;
using System.Text;
namespace SomethingDigital
{
class FirstNonRepeatingChar
{
public static void Main()
{
String input = "geeksforgeeksandgeeksquizfor";
char[] str = input.ToCharArray();
bool[] b = new bool[256];
String unique1 = "";
String unique2 = "";
foreach (char ch in str)
{
if (!unique1.Contains(ch))
{
unique1 = unique1 + ch;
unique2 = unique2 + ch;
}
else
{
unique2 = unique2.Replace(ch.ToString(), "");
}
}
if (unique2 != "")
{
Console.WriteLine(unique2[0].ToString());
Console.ReadLine();
}
else
{
Console.WriteLine("No non repeated string");
Console.ReadLine();
}
}
}
}

The following solution is an elegant way to find the first unique character within a string using the new features which have been introduced as part as Java 8. This solution uses the approach of first creating a map to count the number of occurrences of each character. It then uses this map to find the first character which occurs only once. This runs in O(N) time.
import static java.util.stream.Collectors.counting;
import static java.util.stream.Collectors.groupingBy;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
// Runs in O(N) time and uses lambdas and the stream API from Java 8
// Also, it is only three lines of code!
private static String findFirstUniqueCharacterPerformantWithLambda(String inputString) {
// convert the input string into a list of characters
final List<String> inputCharacters = Arrays.asList(inputString.split(""));
// first, construct a map to count the number of occurrences of each character
final Map<Object, Long> characterCounts = inputCharacters
.stream()
.collect(groupingBy(s -> s, counting()));
// then, find the first unique character by consulting the count map
return inputCharacters
.stream()
.filter(s -> characterCounts.get(s) == 1)
.findFirst()
.orElse(null);
}

Here is one more solution with o(n) time complexity.
public void findUnique(String string) {
ArrayList<Character> uniqueList = new ArrayList<>();
int[] chatArr = new int[128];
for (int i = 0; i < string.length(); i++) {
Character ch = string.charAt(i);
if (chatArr[ch] != -1) {
chatArr[ch] = -1;
uniqueList.add(ch);
} else {
uniqueList.remove(ch);
}
}
if (uniqueList.size() == 0) {
System.out.println("No unique character found!");
} else {
System.out.println("First unique character is :" + uniqueList.get(0));
}
}

I read through the answers, but did not see any like mine, I think this answer is very simple and fast, am I wrong?
def first_unique(s):
repeated = []
while s:
if s[0] not in s[1:] and s[0] not in repeated:
return s[0]
else:
repeated.append(s[0])
s = s[1:]
return None
test
(first_unique('abdcab') == 'd', first_unique('aabbccdad') == None, first_unique('') == None, first_unique('a') == 'a')

Question : First Unique Character of a String
This is the simplest solution.
public class Test4 {
public static void main(String[] args) {
String a = "GiniGinaProtijayi";
firstUniqCharindex(a);
}
public static void firstUniqCharindex(String a) {
int[] count = new int[256];
for (int i = 0; i < a.length(); i++) {
count[a.charAt(i)]++;
}
int index = -1;
for (int i = 0; i < a.length(); i++) {
if (count[a.charAt(i)] == 1) {
index = i;
break;
} // if
}
System.out.println(index);// output => 8
System.out.println(a.charAt(index)); //output => P
}// end1
}
IN Python :
def firstUniqChar(a):
count = [0] * 256
for i in a: count[ord(i)] += 1
element = ""
for items in a:
if(count[ord(items) ] == 1):
element = items ;
break
return element
a = "GiniGinaProtijayi";
print(firstUniqChar(a)) # output is P
Using Java 8 :
public class Test2 {
public static void main(String[] args) {
String a = "GiniGinaProtijayi";
Map<Character, Long> map = a.chars()
.mapToObj(
ch -> Character.valueOf((char) ch)
).collect(
Collectors.groupingBy(
Function.identity(),
LinkedHashMap::new,
Collectors.counting()));
System.out.println("MAP => " + map);
// {G=2, i=5, n=2, a=2, P=1, r=1, o=1, t=1, j=1, y=1}
Character chh = map
.entrySet()
.stream()
.filter(entry -> entry.getValue() == 1L)
.map(entry -> entry.getKey())
.findFirst()
.get();
System.out.println("First Non Repeating Character => " + chh);// P
}// main
}

how about using a suffix tree for this case... the first unrepeated character will be first character of longest suffix string with least depth in tree..

Create Two list -
unique list - having only unique character .. UL
non-unique list - having only repeated character -NUL
for(char c in str) {
if(nul.contains(c)){
//do nothing
}else if(ul.contains(c)){
ul.remove(c);
nul.add(c);
}else{
nul.add(c);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Split string into words - algorithm

See this question which has even better answers. It's a standard dynamic programming problem: How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

Related

Find word in string buffer/paragraph/text

algorithm in C++

Parsing morse code

Reorder a string by half the character

Find the first un-repeated character in a string

Categories

Resources