How to autocomplete and perform contains for same field - elasticsearch

Trying to use the autocomplete functionality which I have in place using the following mappings
analysis : filter : placename_ngram : max_gram=15, min_gram = 2, type = edge_ngram
analyzer :index: filter : lowercase, placename_ngram, tokenizer : keyword
placename_search : filter: lowercase: tokenizer keyword
This works great for type ahead but when I'm trying to find a value like "contains in" it doesn't return the record.
Such as
If I'm doing a text query on "Lake".
I will only get
Lake...
Lake Wood,
But will not get
Smithtown Lake
I have the field setup as multi-field and can do wildcard to find the values but not sure if this is efficient.
I believe I can use NGRAM but that seems like alot of overhead considering I only need index terms by whitespace (or by word). Not every permetation.
Any thoughts?
When I change the tokenizer on both to "standard"....It will then find these records...but my autocomplete gets messed up and brings back Smithtown Lake when typing Lak..... (which in this case I don't want).
Thanks for your help

Have a look at this question where they were doing 2 different queries on the same field.
Basically you are there, but you need to write 2 different queries, one for autocomplete-time and the other for full-blown-search-time.
You even describe wanting to have "Smithtown Lake" returned during search but not during autocomplete, you need to have different queries you want different results!

I think Shingles is exactly what you are looking for. You can say it is NGram for Terms. Checkout this.

Related

How do I analyze text that doesn't have a separator (eg a domain name)?

I have a bunch of domain names without the tld I'd like to search but they don't always have a natural break in between words (like a "-"). For instance:
techtarget
americanexpress
theamericanexpress // a non-existent site
thefacebook
What is the best analyzer to use? e.g. if a user types in "american ex" I'd like to prioritize "americanexpress" over "theamericanexpress". A simple prefix query would work in this particular case but a user then types in "facebook" but that doesn't return anything. ;(
In most of the case including yours, Standard Analyzer is sufficient. Also, it is default analyzer in ElasticSearch and it provides grammar based tokenization. For example:
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." will be tokenized into [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ].
In your case, domain names are tokenized into list of terms as [techtarget, americanexpress, theamericanexpress, thefacebook].
Why query search for facebook doesnot return anything?
Because, there is no facebook term stored in the dictionary and hence search result return no data. Whats going on is that ES try to find search term facebook in the dictionary but the dictionary only contain thefacebook and hence search return no result.
Solution:
In order to match search term facebook with thefacebook, you need to wrap wildcards around your search term i.e. .*facebook will match thefacebook. However, you should know that using regex will have a performance overheads.
Other workaround is that you can use synonyms. What synonyms does is that you can specify synonyms (list of alternative search terms) for your search terms. e.g. "facebook, thefacebook, facebooksocial, fb, fbook", with these synonyms, you can provide any of search term from these synonyms, the it will match with any of these synonyms. i.e. If your search term is facebook and your domain is stored as thefacebook then the search will be matched.
Also, for prioritization you need to first understand how scoring work in ES and then you can use Boosting.

elasticsearch - fulltext search for words with special/reserved characters

I am indexing documents that may contain any special/reserved characters in their fulltext body. For example
"PDF/A is an ISO-standardized version of the Portable Document Format..."
I would like to be able to search for pdf/a without having to escape the forward slash.
How should i analyze my query-string and what type of query should i use?
The default standard analyzer will tokenize a string like that so that "PDF" and "A" are separate tokens. The "A" token might get cut out by the stop token filter (See Standard Analyzer). So without any custom analyzers, you will typically get any documents with just "PDF".
You can try creating your own analyzer modeled off the standard analyzer that includes a Mapping Char Filter. The idea would that "PDF/A" might get transformed into something like "pdf_a" at index and query time. A simple match query will work just fine. But this is a very simplistic approach and you might want to consider how '/' characters are used in your content and use slightly more complex regex filters which are also not perfect solutions.
Sorry, I completely missed your point about having to escape the character. Can you elaborate on your use case if this turns out to not be helpful at all?
To support queries containing reserved characters i now use the Simple Query String Query (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
As of not using a query parser it is a bit limited (e.g. no field-queries like id:5), but it solves the purpose.

How to search emoticon/emoji in elasticsearch?

I am trying to search emoticon/emoji containing text in elasticsearch. Earlier, I have inserted tweets in ES. Now I want to search for example smile or sad faces related tweets. I tried the following
1) used equivalent of unicode values of smile, but didn't work. No results were returned.
GET /myindex/twitter_stream/_search
{
"query": {
"match": {
"text": "\u1f603"
}
}
}
How to set up emoji search in elasticsearch? Do, I have to encode raw tweets before ingesting into elasticsearch? What would be the query ? Any experienced approaches? Thanks.
The specification explain how to search for emoji:
Searching includes both searching for emoji characters in queries, and
finding emoji characters in the target. These are most useful when
they include the annotations as synonyms or hints. For example, when
someone searches for ⛽︎ on yelp.com, they see matches for “gas
station”. Conversely, searching for “gas pump” in a search engine
could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would
expect a search for ⛽︎ to result in matches for “Tankstelle”.
You can keep the real unicode char, and expand it to it annotation in each language you aim to support.
This can be done with a synonym filter. But Elasticsearch standard tokenizer will remove the emoji, so there is quite a lot of work to do:
remove emoji modifier, clean everything up;
tokenize via whitespace;
remove undesired punctuation;
expand the emoji to their synonyms.
The whole process is described here: http://jolicode.com/blog/search-for-emoji-with-elasticsearch (disclaimer: I'm the author).
The way I have seen emoticons work is actually a string is stored in place of there image counterparts when you are storing them in a database. For eg. A smile is stored as :smile:. You can verify that in your case. If this is the case, you can add a custom tokenizer which does not tokenize on colons so that an exact match for the emoticons can be made. Then while searching you just need to convert the emoticon image in search to appropriate string and elasticsearch will be able to find it. Hope it helps

ElasticSearch Nest AutoComplete based on words split by whitespace

I have AutoComplete working with ElasticSearch (Nest) and it's fine when the user types in the letters from the begining of the phrase but I would like to be able to use a specialized type of auto complete if it's possible that caters for words in a sentence.
To clarify further, my requirement is to be able to "auto complete" like such:
Imagine the full indexed string is "this is some title". When the user types in "th", this comes back as a suggestion with my current code.
I would also like the same thing to be returned if the user types in "som" or "title" or any letters that form a word (word being classified as a string between two spaces or the start/end of the string).
The code I have is:
var result = _client.Search<ContentIndexable>(
body => body
.Index(indexName)
.SuggestCompletion("content-suggest" + Guid.NewGuid(),
descriptor =>
descriptor
.OnField(t => t.Title.Suffix("completion"))
.Text(searchTerm)
.Size(size)));
And I would like to see if it would be possible to write something that matches my requirement using SuggestCompletion (and not by doing a match query).
Many thanks,
Update:
This question already has an answer here but I leave it here since the title/description is probably a little easier to search by search engines.
The correct solution to this problem can be found here:
Elasticsearch NEST client creating multi-field fields with completion
#Kha i think it's better to use the NGram Tokenizer
So you should use this tokenizer when you create the mapping.
If you want more info, and maybe an example write back.

Elastic Search - Exact phrase search with wildcards

I am looking for help on exact phrase search with wild card.
QueryBuilders.multiMatchQuery("Java Se", "title", "subtitle")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX);
The above query, returns the following results.
1) Java Search
2) Elastic Java Search
Trailing wildcard works.
But, When i search like the below query,
QueryBuilders.multiMatchQuery("ava Se", "title", "subtitle")
.type(MatchQueryBuilder.Type.PHRASE_PREFIX);
It does not return anything as nothing matches exactly "ava Se".
I was expecting the same result as above.
Leading wildcard does not work.
Is there anyway to achieve this?
Thanks,
Baskar.S
If you have a look at the javadoc for "Type.PHRASE_PREFIX" you will see that only the last term in the string is used as a prefix, thus only "Se" in your case.
I tried this query in my index and it worked:
.setQuery(QueryBuilders.matchQuery("body", "(.*?)ing the").type(MatchQueryBuilder.Type.PHRASE_PREFIX))
It returned documents that contain phrases like "We are strengthening the proposals..", "By using the.."
You need to use nGram analyzer or even edgeNGram would be a better idea.
Once you have done that , your index might be a bit heavy but affix search will work fine without wild cards.

Resources