ElasticSearch first/last name match - elasticsearch

I have two fields: first and last
I'm trying to use multi_match to fuzzy match full names:
"multi_match": {
"query": name,
"fields": [
"first",
"last",
],
"fuzziness": 0.1
}
This search only matches when the search is 100% exact first +' '+ last name. This is undesirable.
What would be a more effective first-last name search technique with ElasticSearch? (assume the two fields must be separate)
e.g. typing Dan Smi should match Danny Smith

It sounds like you're looking for Phonetic Analysis, which can be used to create new tokens that represent what the original tokens sounds like.
I created a runnable example with your example data here, which shows a search for "Dan Smi" matching the first and last name fields using a double metaphone filter.
The github page of the Phonetic Analysis plugin contains the name of all the other implemented phonetic token filters that you might want to try out as well.

Ho true, re-read your question, is more about analyzing, you can play online with analyzer/stemmer ==> http://es.subitolabs.com/#/testr/20061741
Another thing, did you look at something called "Suggestion"? Quite new, but so powerful ==> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters.html
In a such case, I mean cross_field, you may need to do some extra work around ES, tokenize your input string before (use ES analysis API to get token pieces), then run suggester for each tokens, end re-assemble the result.

Related

How do I analyze text that doesn't have a separator (eg a domain name)?

I have a bunch of domain names without the tld I'd like to search but they don't always have a natural break in between words (like a "-"). For instance:
techtarget
americanexpress
theamericanexpress // a non-existent site
thefacebook
What is the best analyzer to use? e.g. if a user types in "american ex" I'd like to prioritize "americanexpress" over "theamericanexpress". A simple prefix query would work in this particular case but a user then types in "facebook" but that doesn't return anything. ;(
In most of the case including yours, Standard Analyzer is sufficient. Also, it is default analyzer in ElasticSearch and it provides grammar based tokenization. For example:
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." will be tokenized into [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ].
In your case, domain names are tokenized into list of terms as [techtarget, americanexpress, theamericanexpress, thefacebook].
Why query search for facebook doesnot return anything?
Because, there is no facebook term stored in the dictionary and hence search result return no data. Whats going on is that ES try to find search term facebook in the dictionary but the dictionary only contain thefacebook and hence search return no result.
Solution:
In order to match search term facebook with thefacebook, you need to wrap wildcards around your search term i.e. .*facebook will match thefacebook. However, you should know that using regex will have a performance overheads.
Other workaround is that you can use synonyms. What synonyms does is that you can specify synonyms (list of alternative search terms) for your search terms. e.g. "facebook, thefacebook, facebooksocial, fb, fbook", with these synonyms, you can provide any of search term from these synonyms, the it will match with any of these synonyms. i.e. If your search term is facebook and your domain is stored as thefacebook then the search will be matched.
Also, for prioritization you need to first understand how scoring work in ES and then you can use Boosting.

How to search emoticon/emoji in elasticsearch?

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.
The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).
The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

ElasticSearch Nest AutoComplete based on words split by whitespace

I have AutoComplete working with ElasticSearch (Nest) and it's fine when the user types in the letters from the begining of the phrase but I would like to be able to use a specialized type of auto complete if it's possible that caters for words in a sentence.
To clarify further, my requirement is to be able to "auto complete" like such:
Imagine the full indexed string is "this is some title". When the user types in "th", this comes back as a suggestion with my current code.
I would also like the same thing to be returned if the user types in "som" or "title" or any letters that form a word (word being classified as a string between two spaces or the start/end of the string).
The code I have is:
var result = _client.Search<ContentIndexable>(
body => body
.Index(indexName)
.SuggestCompletion("content-suggest" + Guid.NewGuid(),
descriptor =>
descriptor
.OnField(t => t.Title.Suffix("completion"))
.Text(searchTerm)
.Size(size)));
And I would like to see if it would be possible to write something that matches my requirement using SuggestCompletion (and not by doing a match query).
Many thanks,
Update:
This question already has an answer here but I leave it here since the title/description is probably a little easier to search by search engines.
The correct solution to this problem can be found here:
Elasticsearch NEST client creating multi-field fields with completion
#Kha i think it's better to use the NGram Tokenizer
So you should use this tokenizer when you create the mapping.
If you want more info, and maybe an example write back.

Amazon Cloudsearch not searching with partial string

I'm testing Amazon Cloudsearch for my web application and i'm running into some strange issues.
I have the following domain indexes: name, email, id.
For example, I have data such as: John Doe, John#example.com, 1
When I search for jo I get nothing. If I search for joh I still get nothing, But if I search for john then I get the above document as a hit. Why is it not getting when I put partial strings? I even put suggestors on name and email with fuzzy matching enabled. Is there something else i'm missing? I read the below on this:
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching.html
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-compound-queries.html
I'm doing the searches using boto as well as with the form on AWS page.
What you're trying to do -- finding "john" by searching "jo" -- is a called a prefix search.
You can accomplish this either by searching
(prefix field=name 'jo')
or
q=jo*
Note that if you use the q=jo* method of appending * to all your queries, you may want to do something like q=jo* |jo because john* will not match john.
This can seem a little confusing but imagine if google gave back results for prefix matches: if you searched for tort and got back a mess of results about tortoises and torture instead of tort (a legal term), you would be very confused (and frustrated).
A suggester is also a viable approach but that's going to give you back suggestions (like john, jordan and jostle rather than results) that you would then need to search for; it does not return matching documents to you.
See "Searching for Prefixes in Amazon CloudSearch" at http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-text.html
Are your index field types "Text"? If they are just "Literals", they have to be an exact match.
I think you must have your name and email fields set as the literal type instead of the text type, otherwise a simple text search of 'jo' or 'Joh' should've found the example document.
While using a prefix search may have solved your problem (and that makes sense if the fields are set as the literal type), the accepted answer isn't really correct. The notion that it's "like a google search" isn't based on anything in the documentation. It actually contradicts the example they use, and in general muddies up what's possible with the service. From the docs:
When you search text and text-array fields for individual terms, Amazon CloudSearch finds all documents that contain the search terms anywhere within the specified field, in any order. For example, in the sample movie data, the title field is configured as a text field. If you search the title field for star, you will find all of the movies that contain star anywhere in the title field, such as star, star wars, and a star is born. This differs from searching literal fields, where the field value must be identical to the search string to be considered a match.

How to autocomplete and perform contains for same field

Trying to use the autocomplete functionality which I have in place using the following mappings
analysis : filter : placename_ngram : max_gram=15, min_gram = 2, type = edge_ngram
analyzer :index: filter : lowercase, placename_ngram, tokenizer : keyword
placename_search : filter: lowercase: tokenizer keyword
This works great for type ahead but when I'm trying to find a value like "contains in" it doesn't return the record.
Such as
If I'm doing a text query on "Lake".
I will only get
Lake...
Lake Wood,
But will not get
Smithtown Lake
I have the field setup as multi-field and can do wildcard to find the values but not sure if this is efficient.
I believe I can use NGRAM but that seems like alot of overhead considering I only need index terms by whitespace (or by word). Not every permetation.
Any thoughts?
When I change the tokenizer on both to "standard"....It will then find these records...but my autocomplete gets messed up and brings back Smithtown Lake when typing Lak..... (which in this case I don't want).
Thanks for your help
Have a look at this question where they were doing 2 different queries on the same field.
Basically you are there, but you need to write 2 different queries, one for autocomplete-time and the other for full-blown-search-time.
You even describe wanting to have "Smithtown Lake" returned during search but not during autocomplete, you need to have different queries you want different results!
I think Shingles is exactly what you are looking for. You can say it is NGram for Terms. Checkout this.

Resources