How do you autocomplete names containing spaces? - algorithm

I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.

Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...

Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)

If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.

If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.

Related

Abbreviating a UUID

What would be a good way to abbreviate UUID for use in a button in a user interface when the id is all we know about the target?
GitHub seems to abbreviate commit ids by taking 7 characters from the beginning. For example b1310ce6bc3cc932ce5cdbe552712b5a3bdcb9e5 would appear in a button as b1310ce. While not perfect this shorter version is sufficient to look unique in the context where it is displayed. I'm looking for a similar solution that would work for UUIDs. I'm wondering is some part of the UUID is more random than another.
The most straight forward option would be splitting at dash and using the first part. The UUID 42e9992a-8324-471d-b7f3-109f6c7df99d would then be abbreviated as 42e9992a. All of the solutions I can come up with seem equally arbitrary. Perhaps there is some outside the box user interface design solution that I didn't think of.
Entropy of a UUID is highest in the first bits for UUID V1 and V2, and evenly distributed for V3, V4 and V5. So, the first N characters are no worse than any other N characters subset.
For N=8, i.e. the group before the first dash, the odds of there being a collision within a list you could reasonably display within a single GUI screen is vanishingly small.
The question is whether you want to show part of the UUID or only ensure that unique strings are presented as shorter unique strings. If you want to focus on the latter, which appears to be the goal you are suggesting in your opening paragraph:
(...) While not perfect this shorter version is sufficient to look unique in
the context where it is displayed. (...)
you can make use of hashing.
Hashing:
Hashing is the transformation of a string of characters into a usually
shorter fixed-length value or key that represents the original string.
Hashing is used to index and retrieve items in a database because it
is faster to find the item using the shorter hashed key than to find
it using the original value.
Hashing is very common and easy to use across many of popular languages; simple approach in Python:
import hashlib
import uuid
encoded_str = uuid.UUID('42e9992a-8324-471d-b7f3-109f6c7df99d').bytes
hash_uuid = hashlib.sha1(encoded_str).hexdigest()
hash_uuid[:10]
'b6e2a1c885'
Expectedly, a small change in string will result in a different string correctly showing uniqueness.
# Second digit is replaced with 3, rest of the string remains untouched
encoded_str_two = uuid.UUID('43e9992a-8324-471d-b7f3-109f6c7df99d').bytes
hash_uuid_two = hashlib.sha1(encoded_str_two).hexdigest()
hash_uuid_two[:10]
'406ec3f5ae'
After thinking about this for a while I realised that the short git commit hash is used as part of command line commands. Since this requirement does not exist for UUIDs and graphical user interfaces I simply decided to use ellipsis for the abbreviation. Like so 42e9992...

Is there a way to change the way Google Sheets Query Group sorts? By both capitals and letter case?

I have a simple query function that returns a range of names and sums, grouped by the names.
=QUERY('Mamut inklipp'!C:R;"select F, sum(R) group by F";0)
This sorts by the names, but case sensitive. A-Z all comes before a-z. Therefore "Eve" comes before "adam". To me that is just plain wrong.
Is there a way to change the the sorting method?
You should be able to work around that. Pre-processing the data ('before the query') might be an option. Here's a little example.
I hope that works for you?
Note: Depending on your locale, you may have to use commas instead of semi-colons as argument separators (in the formula).

Is there a way to remove a word from a KeyedVectors vocab?

I need to remove an invalid word from the vocab of a "gensim.models.keyedvectors.Word2VecKeyedVectors".
I tried to remove it using del model.vocab[word], if I print the model.vocab the word disappeared, but when I run model.most_similar using other words the word that I deleted is still appearing as similar.
So how can I delete a word from model.vocab in a way that affect the model.most_similar to not bring it?
There's no existing method supporting the removal of individual words.
A quick-and-dirty workaround might be to, at the same time as removing the vocab entry, noting the index of the existing vector (in the underlying large vector array), and also changing the string in the kv_model.index2entity list at that index to some plug value (like say, '***DELETED***').
Then, after performing any most_similar(), discard any entries matching '***DELETED***'.
Refer to:
How to remove a word completely from a Word2Vec model in gensim?
Possible method 1: I solve it by editing the text model file itself.
Possible method 2: Refer to #zsozso's answer. (Though I didn't get
it to
work).

Parse.com search performance

I have a Parse class, that I want to make searchable on a String attribute . According to Parse's blog the most efficient way to do this is by using a list of lowercased words as your searchable String attribute. For example, instead of "My Search Term", use ["my", "search", "term"].
The problem is, that I want to be able to search substrings as well. So, an object with description "abc" should be returned for any of the following search strings: a, b, c, ab, bc or abc.
I thought about further breaking down the ["my", "search", "term"] tags into more words, so turn "my" into ["m", "y", "my"]. This list might become huge, though, in case of a slightly longer string (consider "search" for example).
Using query.matches would solve this, but is not recommended for large datasets.
So, what is best way to approach this, considering, that most of the tags will be a lot longer than only 2 characters?
You identified correctly the only two reasonable alternatives in parse. And for nontrivial keywords, I think query.matches will be the only workable choice.
Consider relaxing the requirement that the user can search any substring of a string. Isn't it unreasonable to suppose that I might search for the word "requirement" with the string "ire"? If search is limited to matching prefixes, the query can use startsWith.

What Ruby Regex code can I use for obtaining "out of sight" from the input "outofsight"?

I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
User wants to search for "outofsight"
Starts typing "out"
Sees "out of sight" as suggestion
Selects "out of sight" from suggestions
????
PROFIT!!!
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.
You could take a string and put \s* in between each character.
So outofsight would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight.
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.

Resources