Parse.com search performance - performance

I have a Parse class, that I want to make searchable on a String attribute . According to Parse's blog the most efficient way to do this is by using a list of lowercased words as your searchable String attribute. For example, instead of "My Search Term", use ["my", "search", "term"].
The problem is, that I want to be able to search substrings as well. So, an object with description "abc" should be returned for any of the following search strings: a, b, c, ab, bc or abc.
I thought about further breaking down the ["my", "search", "term"] tags into more words, so turn "my" into ["m", "y", "my"]. This list might become huge, though, in case of a slightly longer string (consider "search" for example).
Using query.matches would solve this, but is not recommended for large datasets.
So, what is best way to approach this, considering, that most of the tags will be a lot longer than only 2 characters?

You identified correctly the only two reasonable alternatives in parse. And for nontrivial keywords, I think query.matches will be the only workable choice.
Consider relaxing the requirement that the user can search any substring of a string. Isn't it unreasonable to suppose that I might search for the word "requirement" with the string "ire"? If search is limited to matching prefixes, the query can use startsWith.

Related

Need a search algorithm that works with acronyms that is not fuzzy (Working with Elm, anything helps though)

I'm working with Elm, but any kind of pseudocode is welcome. This is not homework - it's a personal project. I'm looking to implement a more advanced search, without using "fuzzy" searching. I will give a few examples of what I would like.
Searching "coa" should find "Cathedral of Aachen", but NOT "Charcoal"
Searching "braek" should not find "break" (though if it fit the other requirements, I'm okay if it does)
Given the string, "This is a test for a string search", the algorithm should find a match with "tistestase", finding the capital letters in the following: "This IS a TEST for A string SEarch" Note that "is" did not get highlighted in "thIS", because it is not the beginning of a word. Also note that the second last character in the search entry, "s", is not found in "String", but instead, it is compounded with the last character to find "SEarch". This last point is what I'm having trouble figuring out mainly. How do I know to ignore a valid letter, so that it doesn't fail further along? If the S from "string" was found, the final "e" would cause the search to fail.
The search should be able to find full words as well. If I were to search "special edition", it would return successful if it found "SPECIAL limited EDITION".
If there's a solution, or a term out there that I can search to help my with my issue, that would also be appreciated. If it is also fuzzy, but fits my other criteria, I'll be happy.

How do I analyze text that doesn't have a separator (eg a domain name)?

I have a bunch of domain names without the tld I'd like to search but they don't always have a natural break in between words (like a "-"). For instance:
techtarget
americanexpress
theamericanexpress // a non-existent site
thefacebook
What is the best analyzer to use? e.g. if a user types in "american ex" I'd like to prioritize "americanexpress" over "theamericanexpress". A simple prefix query would work in this particular case but a user then types in "facebook" but that doesn't return anything. ;(
In most of the case including yours, Standard Analyzer is sufficient. Also, it is default analyzer in ElasticSearch and it provides grammar based tokenization. For example:
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." will be tokenized into [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ].
In your case, domain names are tokenized into list of terms as [techtarget, americanexpress, theamericanexpress, thefacebook].
Why query search for facebook doesnot return anything?
Because, there is no facebook term stored in the dictionary and hence search result return no data. Whats going on is that ES try to find search term facebook in the dictionary but the dictionary only contain thefacebook and hence search return no result.
Solution:
In order to match search term facebook with thefacebook, you need to wrap wildcards around your search term i.e. .*facebook will match thefacebook. However, you should know that using regex will have a performance overheads.
Other workaround is that you can use synonyms. What synonyms does is that you can specify synonyms (list of alternative search terms) for your search terms. e.g. "facebook, thefacebook, facebooksocial, fb, fbook", with these synonyms, you can provide any of search term from these synonyms, the it will match with any of these synonyms. i.e. If your search term is facebook and your domain is stored as thefacebook then the search will be matched.
Also, for prioritization you need to first understand how scoring work in ES and then you can use Boosting.

Searching within text fields in CloudKit

How are people searching within a string (a substring) field using CloudKit?
For making predicates for use with CloudKit, from what I gather, you can only can only do BEGINSWITH and TOKENMATCHES to search search a single text field's (prefix) or all fields (exact match) respectively. CONTAINS only works on collections despite these examples. I can't determine a way to find, for example, roses in the following string "Red roses are pretty"
I was thinking of making a tokenized version of certain string fields; for example the following fields on a hypothetical record:
description: 'Red roses are pretty'
descriptionTokenized: ['Red', 'roses', 'are', 'pretty']
testing this out makes CONTAINS somewhat useful when searching for distinct, space separated substrings but still not as good as SQL LIKE would be.

Amazon Cloudsearch not searching with partial string

I'm testing Amazon Cloudsearch for my web application and i'm running into some strange issues.
I have the following domain indexes: name, email, id.
For example, I have data such as: John Doe, John#example.com, 1
When I search for jo I get nothing. If I search for joh I still get nothing, But if I search for john then I get the above document as a hit. Why is it not getting when I put partial strings? I even put suggestors on name and email with fuzzy matching enabled. Is there something else i'm missing? I read the below on this:
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-compound-queries.html
I'm doing the searches using boto as well as with the form on AWS page.
What you're trying to do -- finding "john" by searching "jo" -- is a called a prefix search.
You can accomplish this either by searching
(prefix field=name 'jo')
or
q=jo*
Note that if you use the q=jo* method of appending * to all your queries, you may want to do something like q=jo* |jo because john* will not match john.
This can seem a little confusing but imagine if google gave back results for prefix matches: if you searched for tort and got back a mess of results about tortoises and torture instead of tort (a legal term), you would be very confused (and frustrated).
A suggester is also a viable approach but that's going to give you back suggestions (like john, jordan and jostle rather than results) that you would then need to search for; it does not return matching documents to you.
See "Searching for Prefixes in Amazon CloudSearch" at http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
Are your index field types "Text"? If they are just "Literals", they have to be an exact match.
I think you must have your name and email fields set as the literal type instead of the text type, otherwise a simple text search of 'jo' or 'Joh' should've found the example document.
While using a prefix search may have solved your problem (and that makes sense if the fields are set as the literal type), the accepted answer isn't really correct. The notion that it's "like a google search" isn't based on anything in the documentation. It actually contradicts the example they use, and in general muddies up what's possible with the service. From the docs:
When you search text and text-array fields for individual terms, Amazon CloudSearch finds all documents that contain the search terms anywhere within the specified field, in any order. For example, in the sample movie data, the title field is configured as a text field. If you search the title field for star, you will find all of the movies that contain star anywhere in the title field, such as star, star wars, and a star is born. This differs from searching literal fields, where the field value must be identical to the search string to be considered a match.

How do you autocomplete names containing spaces?

I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.
Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...
Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)
If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.
If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.

Resources