Finding every possible word out of a bigger word [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hi I'm looking for an algorithm to extract every possible word out of a single word in C++.
For example from the word "overflow" I can get these : "love","flow","for","row,"over"...
So how can I get only valid english words efficiently.
Note: I have a dictionary, a big word list.

I can't think how to do this without bruit-forcing it with all the permutations.
Something like this:
#include <string>
#include <algorithm>
int main()
{
using size_type = std::string::size_type;
std::string word = "overflow";
// examine every permutation of the letters contained in word
while(std::next_permutation(word.begin(), word.end()))
{
// examine each substring permutation
for(size_type s = 0; s < word.size(); ++s)
{
std::string sub = word.substr(0, s);
// look up sub in a dictionary here...
}
}
return 0;
}
I can think of 2 ways to speed this up.
1) Keep a check on substrings of a given permutation already tried to avoid unnecessary dictionary lookups (std::set or std::unordered_set maybe).
2) Cache popular results, keeping the most frequently requested words (std::map or std::unordered_map perhaps).
NOTE:
It turns out even after adding cashing at various levels this is indeed a very slow algorithm for larger words.
However this uses a much faster algorithm:
#include <set>
#include <string>
#include <cstring>
#include <fstream>
#include <iostream>
#include <algorithm>
#define con(m) std::cout << m << '\n'
std::string& lower(std::string& s)
{
std::transform(s.begin(), s.end(), s.begin(), tolower);
return s;
}
std::string& trim(std::string& s)
{
static const char* t = " \t\n\r";
s.erase(s.find_last_not_of(t) + 1);
s.erase(0, s.find_first_not_of(t));
return s;
}
void usage()
{
con("usage: anagram [-p] -d <word-file> -w <word>");
con(" -p - (optional) find only perfect anagrams.");
con(" -d <word-file> - (required) A file containing a list of possible words.");
con(" -w <word> - (required) The word to find anagrams of in the <word-file>.");
}
int main(int argc, char* argv[])
{
std::string word;
std::string wordfile;
bool perfect_anagram = false;
for(int i = 1; i < argc; ++i)
{
if(!strcmp(argv[i], "-p"))
perfect_anagram = true;
else if(!strcmp(argv[i], "-d"))
{
if(!(++i < argc))
{
usage();
return 1;
}
wordfile = argv[i];
}
else if(!strcmp(argv[i], "-w"))
{
if(!(++i < argc))
{
usage();
return 1;
}
word = argv[i];
}
}
if(wordfile.empty() || word.empty())
{
usage();
return 1;
}
std::ifstream ifs(wordfile);
if(!ifs)
{
con("ERROR: opening dictionary: " << wordfile);
return 1;
}
// for analyzing the relevant characters and their
// relative abundance
std::string sorted_word = lower(word);
std::sort(sorted_word.begin(), sorted_word.end());
std::string unique_word = sorted_word;
unique_word.erase(std::unique(unique_word.begin(), unique_word.end()), unique_word.end());
// This is where the successful words will go
// using a set to ensure uniqueness
std::set<std::string> found;
// plow through the dictionary
// (storing it in memory would increase performance)
std::string line;
while(std::getline(ifs, line))
{
// quick rejects
if(trim(line).size() < 2)
continue;
if(perfect_anagram && line.size() != word.size())
continue;
if(line.size() > word.size())
continue;
// This may be needed if dictionary file contains
// upper-case words you want to match against
// such as acronyms and proper nouns
// lower(line);
// for analyzing the relevant characters and their
// relative abundance
std::string sorted_line = line;
std::sort(sorted_line.begin(), sorted_line.end());
std::string unique_line = sorted_line;
unique_line.erase(std::unique(unique_line.begin(), unique_line.end()), unique_line.end());
// closer rejects
if(unique_line.find_first_not_of(unique_word) != std::string::npos)
continue;
if(perfect_anagram && sorted_word != sorted_line)
continue;
// final check if candidate line from the dictionary
// contains only the letters (in the right quantity)
// needed to be an anagram
bool match = true;
for(auto c: unique_line)
{
auto n1 = std::count(sorted_word.begin(), sorted_word.end(), c);
auto n2 = std::count(sorted_line.begin(), sorted_line.end(), c);
if(n1 < n2)
{
match = false;
break;
}
}
if(!match)
continue;
// we found a good one
found.insert(std::move(line));
}
con("Found: " << found.size() << " word" << (found.size() == 1?"":"s"));
for(auto&& word: found)
con(word);
}
Explanation:
This algorithm works by concentrating on known good patterns (dictionary words) rather than the vast number of bad patterns generated by the permutation solution.
So it trundles through the dictionary looking for words to match the search term. It successively discounts the words based on tests that increase in accuracy as the more obvious words are discounted.
The crux logic used is to search each surviving dictionary word to ensure it contains every letter from the search term. This is achieved by finding a string that contains exactly one of each of the letters from the search term and the dictionary word. It uses std::unique to produce that string. If it survives this test then it goes on to check that the number of each letter in the dictionary word is reflected in the search term. This uses std::count().
A perfect_anagram is detected only if all the letters match in the dictionary word and the search term. Otherwise it is sufficient that the search term contains at least enough of the correct letters.

Related

Boost R tree node remove

I want to remove the nearest point node. and that should be satisfied the limit of distance.
but I think my code is not efficient.
How can I modify this?
for (int j = 0; j < 3; j++) {
bgi::rtree< value, bgi::quadratic<16> > nextRT;
// search for nearest neighbours
std::vector<value> matchPoints;
vector<pair<float, float>> pointList;
for (unsigned i = 0; i < keypoints[j + 1].size(); ++i) {
point p = point(keypoints[j + 1][i].pt.x, keypoints[j + 1][i].pt.y);
nextRT.insert(std::make_pair(p, i));
RT.query(bgi::nearest(p, 1), std::back_inserter(matchPoints));
if (bg::distance(p, matchPoints.back().first) > 3) matchPoints.pop_back();
else {
pointList.push_back(make_pair(keypoints[j + 1][i].pt.x, keypoints[j + 1][i].pt.y));
RT.remove(matchPoints.back());
}
}
and I also curious about result of matchPoints.
After query function works, there are values in matchPoints.
first one is point, and second one looks like some indexing number.
I don't know what second one means.
Q. and I also curious about result of matchPoints. After query function works, there are values in matchPoints. first one is point, and second one looks like some indexing number. I don't know what second one means.
Well, that's got to be a data member in your value type. What is in it depends solely on what you inserted into the rtree. it wouldn't surprise me if it was an ID that describes the geometry.
Since you do not even show the type of RT, we can only assume it is the same as nextRT. If so, we can assume that value is likely a pair like pair<box, unsigned> (because of what you insert). So, look at what got inserted for the unsigned value of the pair in RT...
Q.
if (bg::distance(p, matchPoints.back().first) > 3) matchPoints.pop_back();
else {
pointList.push_back(make_pair(keypoints[j + 1][i].pt.x, keypoints[j + 1][i].pt.y));
rtree.remove(matchPoints.back());
}
Simplify your code! Distilling the requirements:
It looks to me that for 4 sets of "key points", you want to create 4 rtrees containing all those key points with sequentially increasing ids.
Also for those 4 sets of "key points", you want to create a list of key points for which a geometry can be found with a radius of 3.
As a side-effect, remove those closely-matching geometries from the original rtree RT.
DECISION: Because these tasks are independent, let's do them separate:
// making up types that match the usage in your code:
struct keypoint_t { point pt; };
std::array<std::vector<keypoint_t>, 4> keypoints;
Now, let's do the tasks:
Note how RT is not used here:
for (auto const& current_key_set : keypoints) {
bgi::rtree< value, bgi::quadratic<16> > nextRT; // use a better name...
int i = 0;
for (auto const& kpd : current_key_set)
nextRT.insert(std::make_pair(kpd.pt, i++));
}
Creating the vector containing matched key-points (those with near geometries in RT):
for (auto const& current_key_set : keypoints) {
std::vector<point> matched_key_points;
for (auto const& kpd : current_key_set) {
point p = kpd.pt;
value match;
if (!RT.query(bgi::nearest(p, 1), &match))
continue;
if (bg::distance(p, match.first) <= 3) {
matched_key_points.push_back(p);
RT.remove(match);
}
}
}
Ironically, removing the matching geometries from RT became a bit of a minor issue in this: you can either delete by iterator or by a value. In this case, we use the overload that takes a value.
Summary
It was hard to understand the code enough to see what it did. I have shown how to clean up the code, and make it work. Maybe these aren't the things you need, but hopefully using the better separated code, you should be able to get further.
Note that the algorithms have side effects. This makes it hard to understand what really will happen. E.g.:
removing points from the original RT affects what the subsequent key points (even from subsequent sets (next j)) can match with
if you have the same key point multiple times, they may match more than 1 source RT point (because after removal of the first match, there might be a second match within radius 3)
key points are checked strictly sequentially. This means that if the first keypoint roughly matches a point X, this might cause a later keypoint to fail to match, even though the point X might be closer to that keypoint...
I'd suggest you THINK about the requirements really hard before implementing things with these side-effects. **Study the sample cases in the live demo below. If all these side-effects are exactly what you wanted, be sure to use much better naming and proper comments to describe what the code is doing.
Live Demo
Live On Coliru
#include <boost/geometry.hpp>
#include <boost/geometry/io/io.hpp>
#include <boost/geometry/index/rtree.hpp>
#include <iostream>
namespace bg = boost::geometry;
namespace bgi = bg::index;
typedef bg::model::point<float, 2, bg::cs::cartesian> point;
typedef std::pair<point, unsigned> pvalue;
typedef pvalue value;
int main() {
bgi::rtree< value, bgi::quadratic<16> > RT;
{
int i = 0;
for (auto p : { point(2.0f, 2.0f), point(2.5f, 2.5f) })
RT.insert(std::make_pair(p, i++));
}
struct keypoint_t { point pt; };
using keypoints_t = std::vector<keypoint_t>;
keypoints_t const keypoints[] = {
keypoints_t{ keypoint_t { point(-2, 2) } }, // should not match anything
keypoints_t{ keypoint_t { point(-1, 2) } }, // should match (2,2)
keypoints_t{ keypoint_t { point(2.0, 2.0) }, // matches (2.5,2.5)
{ point(2.5, 2.5) }, // nothing anymore...
},
};
for (auto const& current_key_set : keypoints) {
bgi::rtree< pvalue, bgi::quadratic<16> > nextRT; // use a better name...
int i = 0;
for (auto const& kpd : current_key_set)
nextRT.insert(std::make_pair(kpd.pt, i++));
}
for (auto const& current_key_set : keypoints) {
std::cout << "-----------\n";
std::vector<point> matched_key_points;
for (auto const& kpd : current_key_set) {
point p = kpd.pt;
std::cout << "Key: " << bg::wkt(p) << "\n";
value match;
if (!RT.query(bgi::nearest(p, 1), &match))
continue;
if (bg::distance(p, match.first) <= 3) {
matched_key_points.push_back(p);
std::cout << "\tRemoving close point: " << bg::wkt(match.first) << "\n";
RT.remove(match);
}
}
std::cout << "\nMatched keys: ";
for (auto& p : matched_key_points)
std::cout << bg::wkt(p) << " ";
std::cout << "\n\tElements remaining: " << RT.size() << "\n";
}
}
Prints
-----------
Key: POINT(-2 2)
Matched keys:
Elements remaining: 2
-----------
Key: POINT(-1 2)
Removing close point: POINT(2 2)
Matched keys: POINT(-1 2)
Elements remaining: 1
-----------
Key: POINT(2 2)
Removing close point: POINT(2.5 2.5)
Key: POINT(2.5 2.5)
Matched keys: POINT(2 2)
Elements remaining: 0

Find word in string buffer/paragraph/text

This was asked in Amazon telephonic interview - "Can you write a program (in your preferred language C/C++/etc.) to find a given word in a string buffer of big size ? i.e. number of occurrences "
I am still looking for perfect answer which I should have given to the interviewer.. I tried to write a linear search (char by char comparison) and obviously I was rejected.
Given a 40-45 min time for a telephonic interview, what was the perfect algorithm he/she was looking for ???
The KMP Algorithm is a popular string matching algorithm.
KMP Algorithm
Checking char by char is inefficient. If the string has 1000 characters and the keyword has 100 characters, you don't want to perform unnecessary comparisons. The KMP Algorithm handles many cases which can occur, but I imagine the interviewer was looking for the case where: When you begin (pass 1), the first 99 characters match, but the 100th character doesn't match. Now, for pass 2, instead of performing the entire comparison from character 2, you have enough information to deduce where the next possible match can begin.
// C program for implementation of KMP pattern searching
// algorithm
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void computeLPSArray(char *pat, int M, int *lps);
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // length of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example
// AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char *txt = "ABABDABACDABABCABAB";
char *pat = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
This code is taken from a really good site that teaches algorithms:
Geeks for Geeks KMP
Amazon and companies alike expect knowledge of Boyer–Moore string search or / and Knuth–Morris–Pratt algorithms.
Those are good if you want to show perfect knowledge. Otherwise, try to be creative and write something relatively elegant and efficient.
Did you ask about delimiters before you wrote anything? It could be that they may simplify your task to provide some extra information about a string buffer.
Even code below could be ok (it's really not) if you provide enough information in advance, properly explain runtime, space requirements, choice of data containers.
int find( std::string & the_word, std::string & text )
{
std::stringstream ss( text ); // !!! could be really bad idea if 'text' is really big
std::string word;
std::unordered_map< std::string, int > umap;
while( ss >> text ) ++umap[text]; // you have to assume that each word separated by white-spaces.
return umap[the_word];
}

algorithm in C++

I am writing a multi-way trie that will load in a dictionary that will take words and phrases. So first the dictionary will be loaded into the trie.
This is some (almost) C++ adapted from the following article:
http://www.toptal.com/java/the-trie-a-neglected-data-structure
That one is written in Java, so I've taken the courtesy of giving it to you in c++.
struct Alphabet{
char[] x = 'abcdefghijklmnopqrstuvwxyz';
int findIndex(const char* s){
for(int i = 0; i < 26; ++i){
if(x[i] == *s){
return i;
}
}
return -1;
}
};
struct MWTrieNode{
std::vector::<MWTrieNode*> children;
bool isLast = false;
}
MWTrieNode* getWord(const char* s, int len, MWTrieNode* root){
MWTrieNode* node = root;
Alphabet a;
for(int i = 0; i < len; i++){
const char* currChar = s[i];
int index = a.findIndex(currChar);
MWTrieNode* child = node->children[index];
if(!child){
// No such word
return NULL;
}
// step into the MWTrieNode
node = child;
}
return node;
}
// * corrected comparison between char* and char. (using *(char*))
You can modify the getWord function to take in some parameters to modify how you return your words.
But this should get you started.
For Completions, you'll need to find all of the words below a certain prefix (I'd Imagine). So you'd want' to build several "sub trees" starting with the root of your search prefix (i.e. "House" --> "Housewife", "Housing", "Household", etc.
If you pass in 'Ho', you will find a 'partial word' with no Node saying its at the end. At this point, you can start at the 'o' node.
The part you haven't mentioned is how you store words that are both a word by them selves, as well as a portion of a longer word, for example, Home, Homeowner, Homewrecker.
Those Nodes must have a isLast == true, but also have child nodes. it is this special case that should help you find multiple options for auto-complete. You're running the getWord method several times for a single search, with different prefixes and conditions. The result should be a list of words that all have the prefix you desire.
I'm sure that professors Alvarado and Mirza will be highly interested in the contents of your post.

Parsing morse code

I am trying to solve this problem.
The goal is to determine the number of ways a morse string can be interpreted, given a dictionary of word.
What I did is that I first "translated" words from my dictionary into morse. Then, I used a naive algorithm, searching for all the ways it can be interpreted recursively.
#include <iostream>
#include <vector>
#include <map>
#include <string>
#include <iterator>
using namespace std;
string morse_string;
int morse_string_size;
map<char, string> morse_table;
unsigned int sol;
void matches(int i, int factor, vector<string> &dictionary) {
int suffix_length = morse_string_size-i;
if (suffix_length <= 0) {
sol += factor;
return;
}
map<int, int> c;
for (vector<string>::iterator it = dictionary.begin() ; it != dictionary.end() ; it++) {
if (((*it).size() <= suffix_length) && (morse_string.substr(i, (*it).size()) == *it)) {
if (c.find((*it).size()) == c.end())
c[(*it).size()] = 0;
else
c[(*it).size()]++;
}
}
for (map<int, int>::iterator it = c.begin() ; it != c.end() ; it++) {
matches(i+it->first, factor*(it->second), dictionary);
}
}
string encode_morse(string s) {
string ret = "";
for (unsigned int i = 0 ; i < s.length() ; ++i) {
ret += morse_table[s[i]];
}
return ret;
}
int main() {
morse_table['A'] = ".-"; morse_table['B'] = "-..."; morse_table['C'] = "-.-."; morse_table['D'] = "-.."; morse_table['E'] = "."; morse_table['F'] = "..-."; morse_table['G'] = "--."; morse_table['H'] = "...."; morse_table['I'] = ".."; morse_table['J'] = ".---"; morse_table['K'] = "-.-"; morse_table['L'] = ".-.."; morse_table['M'] = "--"; morse_table['N'] = "-."; morse_table['O'] = "---"; morse_table['P'] = ".--."; morse_table['Q'] = "--.-"; morse_table['R'] = ".-."; morse_table['S'] = "..."; morse_table['T'] = "-"; morse_table['U'] = "..-"; morse_table['V'] = "...-"; morse_table['W'] = ".--"; morse_table['X'] = "-..-"; morse_table['Y'] = "-.--"; morse_table['Z'] = "--..";
int T, N;
string tmp;
vector<string> dictionary;
cin >> T;
while (T--) {
morse_string = "";
cin >> morse_string;
morse_string_size = morse_string.size();
cin >> N;
for (int j = 0 ; j < N ; j++) {
cin >> tmp;
dictionary.push_back(encode_morse(tmp));
}
sol = 0;
matches(0, 1, dictionary);
cout << sol;
if (T)
cout << endl << endl;
}
return 0;
}
Now the thing is that I only have 3 seconds of execution time allowed, and my algorithm won't work under this limit of time.
Is this the good way to do this and if so, what am I missing ? Otherwise, can you give some hints about what is a good strategy ?
EDIT :
There can be at most 10 000 words in the dictionary and at most 1000 characters in the morse string.
A solution that combines dynamic programming with a rolling hash should work for this problem.
Let's start with a simple dynamic programming solution. We allocate an vector which we will use to store known counts for prefixes of morse_string. We then iterate through morse_string and at each position we iterate through all words and we look back to see if they can fit into morse_string. If they can fit then we use the dynamic programming vector to determine how many ways we could have build the prefix of morse_string up to i-dictionaryWord.size()
vector<long>dp;
dp.push_back(1);
for (int i=0;i<morse_string.size();i++) {
long count = 0;
for (int j=1;j<dictionary.size();j++) {
if (dictionary[j].size() > i) continue;
if (dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)) {
count += dp[i-dictionary[j].size()];
}
}
dp.push_back(count);
}
result = dp[morse_code.size()]
The problem with this solution is that it is too slow. Let's say that N is the length of morse_string and M is the size of the dictionary and K is the size of the largest word in the dictionary. It will do O(N*M*K) operations. If we assume K=1000 this is about 10^10 operations which is too slow on most machines.
The K cost came from the line dictionary[j] == morse_string.substring(i-dictionary[j].size(),i)
If we could speed up this string matching to constant or log complexity we would be okay. This is where rolling hashing comes in. If you build a rolling hash array of morse_string then the idea is that you can compute the hash of any substring of morse_string in O(1). So you could then do hash(dictionary[j]) == hash(morse_string.substring(i-dictionary[j].size(),i))
This is good but in the presence of imperfect hashing you could have multiple words from the dictionary with the same hash. That would mean that after getting a hash match you would still need to match the strings as well as the hashes. In programming contests, people often assume perfect hashing and skip the string matching. This is often a safe bet especially on a small dictionary. In case it doesn't produce a perfect hashing (which you can check in code) you can always adjust your hash function slightly and maybe the adjusted hash function will produce a perfect hashing.

All of the option to replace an unknown number of characters

I am trying to find an algorithm that for an unknown number of characters in a string, produces all of the options for replacing some characters with stars.
For example, for the string "abc", the output should be:
*bc
a*c
ab*
**c
*b*
a**
***
It is simple enough with a known number of stars, just run through all of the options with for loops, but I'm having difficulties with an all of the options.
Every star combination corresponds to binary number, so you can use simple cycle
for i = 1 to 2^n-1
where n is string length
and set stars to the positions of 1-bits of binary representations of i
for example: i=5=101b => * b *
This is basically a binary increment problem.
You can create a vector of integer variables to represent a binary array isStar and for each iteration you "add one" to the vector.
bool AddOne (int* isStar, int size) {
isStar[size - 1] += 1
for (i = size - 1; i >= 0; i++) {
if (isStar[i] > 1) {
if (i = 0) { return true; }
isStar[i] = 0;
isStar[i - 1] += 1;
}
}
return false;
}
That way you still have the original string while replacing the characters
This is a simple binary counting problem, where * corresponds to a 1 and the original letter to a 0. So you could do it with a counter, applying a bit mask to the string, but it's just as easy to do the "counting" in place.
Here's a simple implementation in C++:
(Edit: The original question seems to imply that at least one character must be replaced with a star, so the count should start at 1 instead of 0. Or, in the following, the post-test do should be replaced with a pre-test for.)
#include <iostream>
#include <string>
// A cleverer implementation would implement C++'s iterator protocol.
// But that would cloud the simple logic of the algorithm.
class StarReplacer {
public:
StarReplacer(const std::string& s): original_(s), current_(s) {}
const std::string& current() const { return current_; }
// returns true unless we're at the last possibility (all stars),
// in which case it returns false but still resets current to the
// original configuration.
bool advance() {
for (int i = current_.size()-1; i >= 0; --i) {
if (current_[i] == '*') current_[i] = original_[i];
else {
current_[i] = '*';
return true;
}
}
return false;
}
private:
std::string original_;
std::string current_;
};
int main(int argc, const char** argv) {
for (int a = 1; a < argc; ++a) {
StarReplacer r(argv[a]);
do {
std::cout << r.current() << std::endl;
} while (r.advance());
std::cout << std::endl;
}
return 0;
}

Resources