Algorithm to find all possible results using text search - algorithm

I am currently making a web crawler to crawl all the possible characters on a video game site (Final Fantasy XIV Lodestone).
My interface for doing this is using the site's search. http://lodestone.finalfantasyxiv.com/rc/search/characterForm
If the search finds more than 1000 characters it only returns the first 1000. The text search does not seem to understand either *, ? or _.
If a search for the letter a, I'm getting all the characters that have a in their names rather than all characters that start with a.
I'm guessing I could do searches for all character combination aa, ab, ba, etc. But that doesn't guarantee me:
I will never get more than 1000 result
It doesn't seem very efficient has many characters would appear multiple times and would need to be filtered out.
I'm looking for an algorithm on how to construct my search text.

Considered as a practical problem, have you asked Square Enix for some kind of API access or database dump? They might prefer this to having you scrape their search results.
Considered purely in the abstract, it's not clear that any search strategy will succeed in finding all the results. For suppose there were a character called "Ar", how would you find it? If you search for "ar", the results only go as far as at Ak—. If you search for "a" or "r", the situation is even worse. Any other search fails to find this character. (In practice you might be able to find "Ar" by guessing its world and/or main skill, but in theory there might be so many characters with that skill on that world that this remains ineffective.)

Main question here is what are you planning to do with all those characters. What is the purpose of your program? Putting that aside, you can search for single letter, and filter by both main skill and world (using double loop). It is highly unlikely that you will ever have more that 1000 hits that way for any consonant. If you want to search for name starting with vowel then use bigraph vowel-other_letter in a loop that iterates other_letter from A to Z.
Additional optimization is to try to guess at what page the list with needed letter will start. If you have total number of pages (TNOP) then your list will start somewhere near page TNOP * LETTER / 27, where LETTER is the order of the letter in the alphabet.

Related

Matching words and valid sub-words in elasticseach

I've been working with ElasticSearch within an existing code base for a few days, so I expect that the answer is easy once I know what I'm doing. I want to extend a search to yield the same results when I search with a compound word, like "eyewitness", or its component words separated by a whitespace, like "eye witness".
For example, I have a catalog of toy cars that includes both "firetruck" toys and "fire truck" toys. I would like to ensure that if someone searched on either of these terms, the results would include both the "firetruck" and the "fire truck" entries.
I attempted to do this at first with the "fuzziness" of a match, hoping that "fire truck" would be considered one transform away from "firetruck", but that does not work: ES fuzziness is per-word and will not add or remove whitespace characters as a valid transformation.
I know that I could do some brute-forcing before generating the query by trying to come up with additional search terms by breaking big words into smaller words and also joining smaller words into bigger words and checking all of them against a dictionary, but that falls apart pretty quickly when "fuzziness" and proper names are part of the task.
It seems like this is exactly the kind of thing that ES should do well, and that I simply don't have the right vocabulary yet for searching for the solution.
Thanks, everyone.
there are two things you could could do:
you could split words into their compounds, i.e. firetruck would be split into two tokens fire and truck, see here
you could use n-grams, i.e. for 4 grams the original firetruck get split into the tokens fire, iret, retr, etru, truc, ruck. In queries, the scoring function helps you ending up with pretty decent results. Check out this.
Always remember to do the same tokenization on both the analysis and the query side.
I would start with the ngrams and if that is not good enough you should go with the compounds and split them yourself - but that's a lot of work depending on the vocabulary you have under consideration.
hope the concepts and the links help, fricke

Search a string as you type the character

I have contacts stored in my mobile. Lets say my contacts are
Ram
Hello
Hi
Feat
Eat
At
When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".
Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"
Design and develop a program for this.
This is interview question at Samsung mobile development
I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.
question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.
I am not asking for program. So programming language is immaterial for me.
State machine appears to be good solution. Does anyone have suggestion?
Solution should be fast enough for million contacts.
It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:
for each contact in contacts
if contact.Name.Contains("A")
Add contact to results
Then the user types "T", and you sequentially search the previous returned results:
for each contact in results
if contact.Name.Contains("AT")
Add contact to new search results
Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.
If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.
You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.
Update after edit to question
If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.
Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:
r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.
I think you get the idea.
That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.
So you have almost instant lookup of 1- and 2-character names. How does this help you?
An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.
The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.
This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.
Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.
What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

Algorithm to reform a sentence from sentence whose spaces are removed and alphabets of words are reordered?

I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.

Creating a suggested words algorithm

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.
My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).
Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.
What do you think? Anyone have ideas for this?
First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.
A small example
Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.
I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.
Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words and discard tokens that are encoding errors or parts of pgp signature or whatever else we get in mails and know that we will never be interested in those?
You need to decide on a good enough criteria for a word and write a regular expression or a manual to enforce it.
A few rules that can be extrapolated from your examples:
words can start with a captial letter or be all capital letters but if you have more than say, 2 uppercase letters and more than 2 lowercase letters inside a word, it's not a word
If you have numbers inside the word, it's not a word
if it's longer than say, 20 characters
There's no magic trick. you need to decide what you want the rules to be and make them happen.
Al alternative way is to train some kind of Hidden Markov-Models system to recognize things that sound like words but I think this is an overkill for what you want to do.
http://en.wikipedia.org/wiki/English_words_with_uncommon_properties
you can make rules that reject anything with these 'uncommon properties' to build a system that accepts most actual words
Although I generally agree with shoosh's answer, his approach makes it easy to achieve high recall but also low precision, i.e. you would get almost all real words but also a lot non-words. If your definition of word is too restrictive, it's the other way around but that's also not what you want since then you would miss cases like 'zebra123'. So here are a few ideas about how to improve precision:
It may be worthwile thinking about if you could determine what parts of an email belong to the main text and which are footers like pgp signatures. I'm sure it's possible to find some simple heuristics that match most cases, e.g. cut of everything below a line which consists only of '-'-characters.
Depending on your performance criteria you may want to check if a word is a real word or contains a real word by matching against a simple word list. It's easy to find quite exhaustive lists of Englisch words on the web, and you could also compile one yourself by extracting words from a large and clean text corpus.
Using a lexical analyser, you could filter every token which is marked as unknown.
Some simple statistics may tell you how likely it is that something is a word. Tokens which occur with high frequency most probably are words. Tokens which appear only once or whose number is below a certain threshold very probably are not words. Common spelling errors should appear more than once and uncommon ones may be ignored.
Some if these suggestions clearly don't work for cases like 'zebra123'. Again, simply cutting off, or splitting on, in-word numbers may do the trick.
My general approach would be to first identify tokens which certainly are words (using the suggestions above), then identify tokens which certainly are not words (using a regular expression), and then look (with your eyes) at the few hundred or thousand remaining tokens to find common characteristics to handle these separately.

Resources