Elasticsearch Suggest+Synonyms+fuzziness - elasticsearch

I am looking for a way to implement the auto-suggest with synonyms & fuzziness
For example, when the user tried to search for "replce ar"
My synonym list has ar => audio record
So, the result should include the items matching
changing audio record
replacing audio record
etc..,
Here we need fuzziness because there is a typo on "replace" (in the user's search text)
Synonyms to match ar => audio record
Auto-suggest with regex pattern.
Is it possible to implement all the three features in a single field?
Edit:
a regex+fuzzy just throws error.
I haven't well explained my need of a regex-pattern.
so, i needed a Regex for doing a partial word lookup ('encyclopedic' contains 'cyclo').
now, after investigating what options do i have for this purpose, directing me to the NGram Tokenizer and looking into the other suggesters, i found that maybe Phrase suggester is realy what I'm looking for, so I'll try it & tell you about.

Yes, you can use synonyms as well as fuzziness for suggestions. The synonyms are handled by adding a synonym filter in your language analyzer and adding that filter to the analyzer. Then, when you create the field mapping for the field(s) you want to use for suggestions, you assign that analyzer to that field.
As for fuzziness, that happens at query time. Most text-based queries support a fuzziness option which allows you to specify how many corrections you want to allow. The default auto value adjusts the number of corrections, depending on how long the term is, so that's usually best.
Notional analysis setup (synonym_graph reference)
{
"analysis": {
"filter": {
"synonyms": {
"type": "synonym_graph",
"expand": "false",
"synonyms": [
"ar => audio record"
]
}
},
"analyzer": {
"synonyms": {
"tokenizer": "standard",
"type": "custom",
"filter": [
"standard",
"lowercase",
"synonyms"
]
}
}
}
}
Notional Field Mapping (Analyzer + Mapping reference)
(Note that the analyzer matches the name of the analyzer defined above)
{
"properties": {
"suggestion": {
"type": "text",
"analyzer": "synonyms"
}
}
}
Notional Query
{
"query": {
"match": {
"suggestion": {
"query": "replce ar",
"fuzziness": "auto",
"operator": "and"
}
}
}
}
Keep in mind that there are several different options for suggestions, so depending on which option you use, you may need to adjust the way the field is mapped, or even add another token filter to the analyzer. But analyzers are just made up of a series of token filters, so you can usually combine whatever token filters you need to achieve your goal. Just make sure you understand what each filter is doing so you get the filters in the correct order.
If you get stuck in part of this process, just submit another question with the specific issue you're running into. Good luck!

Related

ElasticSearch: when to use multi-field

We have an index with a keyword field that is very often an ip address, but not always. We'd like to be able to search this index on that field using not just keywords but also CIDR notation, which is supported only for fields of type 'ip'. On the surface, this looks like a use case for multi-fields.
From https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html:
It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields
So it seems like the following mapping would make sense for us:
{
"mappings": {
"my_field": {
"type": "keyword"
"fields": {
"ip": {
"type": "ip"
"ignore_malformed": true
}
}
}
}
}
So, when our application has a set of non-ip addresses, ip addresses, and CIDR-notation blocks/ranges of ip addresses and needs to query by them, I assume the application would split that set into one set with non-ip addresses and another with ip addresses/CIDR-notation blocks and make two separate terms filters from them in my query, like so:
{
"query": {
"bool": {
"filter": [
{
"terms": {
"my_field.ip": [
"123.123.123.0/24",
"192.168.0.1",
"192.168.16.255",
"192.169.1.0/24"
]
}
},
{
"terms": {
"my_field": [
"someDomain.com",
"notAnIp.net"
]
}
}
]
}
}
}
Is this a proper use of multi-fields? Should we be achieving this some other way? It's unlike the examples given for using multi-fields in that it's really a subset of the values for the field, not all, because I'm using ignore_malformed to discard the non-ip addresses from the sub-field. If there's a better way, what is it?
Yes, your understanding of multi-fields is correct, you just need to understand that you need to explicitly define the sub-field definition(data-type and analyzer) and also map them explicitly so that it uses the defined(data-type and analyzer).
Now once data is indexed in the format you wanted, you can include/exclude the sub-fields based on your use-case.
Multi-fields with multiple analyzers which is very common to implement multi-lingual search is a better example which you can refer.

In Elasticsearch, how do I search for an arbitrary substring?

In Elasticsearch, how do I search for an arbitrary substring, perhaps including spaces? (Searching for part of a word isn't quite enough; I want to search any substring of an entire field.)
I imagine it has to be in a keyword field, rather than a text field.
Suppose I have only a few thousand documents in my Elasticsearch index, and I try:
"query": {
"wildcard" : { "description" : "*plan*" }
}
That works as expected--I get every item where "plan" is in the description, even ones like "supplantation".
Now, I'd like to do
"query": {
"wildcard" : { "description" : "*plan is*" }
}
...so that I might match documents with "Kaplan isn't" among many other possibilities.
It seems this isn't possible with wildcard, match prefix, or any other query type I might see. How do I simply search on any substring? (In SQL, I would just do description LIKE '%plan is%')
(I am aware any such query would be slow or perhaps even impossible for large data sets.)
Have you tried the regxp query in elasticsearch? It sure does sound like something you might be interested in.
I was hoping there might be something built-in for this Elasticsearch, given that this simple substring search seems like a very basic capability (Thinking about it, it is implemented as strstr() in C, LIKE '%%' in SQL, Ctrl+F in most text editors, String.IndexOf in C#, etc.), but this seems not to be the case. Note that the regexp query doesn't support case insensitivity, so I also needed to pair it with this custom analyzer, so that the index matches all-lowercase. Then I can convert my search string to lowercase as well.
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
...
"description": {"type": "text", "analyzer": "lowercase_keyword"},
}
}
Example query:
"query": {
"regexp" : { "description" : ".*plan is.*" }
}
Thanks to Jai Sharma for leading me; I just wanted to provide more detail.

Elasticsearch find missing word in phrase

How can i use Elasticsearch to find the missing word in a phrase? For example i want to find all documents which contain this pattern make * great again, i tried using a wildcard query but it returned no results:
{
"fields": [
"file_name",
"mime_type",
"id",
"sha1",
"added_at",
"content.title",
"content.keywords",
"content.author"
],
"highlight": {
"encoder": "html",
"fields": {
"content.content": {
"number_of_fragments": 5
}
},
"order": "score",
"tags_schema": "styled"
},
"query": {
"wildcard": {
"content.content": "make * great again"
}
}
}
If i put in a word and use a match_phrase query i get results, so i know i have data which matches the pattern.
Which type of query should i use? or do i need to add some type of custom analyzer to the field?
Wildcard queries operate on terms, so if you use it on an analyzed field, it will actually try to match every term in that field separately. In your case, you can create a not_analyzed sub-field (such as content.content.raw) and run the wildcard query on that. Or just map the actual field to not be analyzed, if you don't need to query it in other ways.

Is it possible to returned the analyzed fields in an ElasticSearch >2.0 search?

This question feels very similar to an old question posted here: Retrieve analyzed tokens from ElasticSearch documents, but to see if there are any changes I thought it would make sense to post it again for the latest version of ElasticSearch.
We are trying to search bodies of text in ElasticSearch with the search-query and field-mapping using the snowball stemmer built into ElasticSearch. The performance and results are great, but because we need to have the stemmed text-body for post-analysis we would like to have the search result return the actual stemmed tokens for the text-field per document in the search results.
The mapping for the field currently looks like:
"TitleEnglish": {
"type": "string",
"analyzer": "standard",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
},
"stemming": {
"type": "string",
"analyzer": "snowball"
}
}
}
and the search query is performed specifically on TitleEnglish.stemming. Ideally I would like it to return that field, but returning that does not return the analyzed field but the original field.
Does anybody know of any way to do this? We have looked at Term Vectors, but they only seem to be returnable for individual documents or a body of documents, not for a search result?
Or perhaps other solutions like Solr or Sphinx do offer this option?
To add some extra information. If we run the following query:
GET /_analyze?analyzer=snowball&text=Eight issue of Industrial Lorestan eliminate barriers to facilitate the Committees review of
It returns the stemmed words: eight, issu, industri, etc. This is exactly the result we would like back for each matching document for all of the words in the text (so not just the matches).
Unless I'm missing something evident, why not simply returning a terms aggregation on the TitleEnglish.stemming field?
{
"query": {...},
"aggs" : {
"stems" : {
"terms" : {
"field" : "TitleEnglish.stemming",
"size": 50
}
}
}
}
Adding that aggregation to your query, you'd get a breakdown of all the stemmed terms in the TitleEnglish.stemming sub-field from the documents that matched your query.

problems with phrase matching in elasticsearch

I'm trying to perform Phrase matching using elasticsearch.
Here is what I'm trying to accomplish:
data - 1: {
"test" {
"title" : "text1 text2"
}
}
2: {
"test" {
"title" : "text3 text4"
}
}
3: {
"test" {
"title" : "text5"
}
}
4: {
"test" {
"title" : "text6"
}
}
Search terms:
If I lookup for "text0 text1 text2 text3" - It should return #1 (matches full string)
If I lookup for "text6 text5 text4 text3" - It should return #4, #3, but not #2 as its not in same order.
Here is what I've tried:
set the index_analyzer as keyword, and search_analyzer as standard
also tried creating custom tokens
but none of my solution allows me to lookup a substring match from search query against keyword in document.
If anyone has written similar queries, can you provide how the mappings are configured and what kind of query is been used.
What I see here is this: You want your search to match on any tokens sent from the query. If those tokens do match, it must be an exact match to the title.
This means that indexing your title field as keyword would get you that mandatory match. However, the standard analyzer for search would never match titles spaces as you'd have your index token {"text1 text2"} and your search token [{"text1},{"text2"}]. You can't use a phrase match with any sloppy value or else your token order requirement will be ignored.
So, what you really need is to generate keyword tokens during the index, but you need to generate shingles whenever you search. Your shingles will maintain order and if one of them matches, consider it a go. I would set to not output unigrams, but do allow unigrams if no shingles. This means that if you have just one word, it will output that token, but it if can combine your search words into various number of shingled tokens, it will not emit single word tokens.
PUT
{ "settings":
{
"analysis": {
"filter": {
"my_shingle": {
"type": "shingle",
"max_shingle_size": 50,
"output_unigrams": false
}
},
"analyzer": {
"my_shingler": {
"filter": [
"lowercase",
"asciifolding",
"my_shingle"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Then you just want to set your type mapping to use the keyword analyzer for index and the `my_shingler` analyzer for search.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

Resources