Elasticsearch: query for multiple words across multiple fields (with prefix) - elasticsearch

I'm trying to implement an auto-suggest control powered by an ES index. The index has multiple fields and I want to be able to query across multiple fields using the AND operator and allowing for partial matches (prefix only).
Just as an example, let's say I got 2 fields I want to query on: "colour" and "animal".
I would like to be able to fulfil queries like "duc", "duck", "purpl", "purple", "purple duck".
I managed to get all these working using multi_match() with AND operator.
What I don't seem to be able to do is match on queries like "purple duc", as multi_match doesn't allow for wildcards.
I've looked into match_phrase_prefix() but as i understand it, it doesn't span across multiple fields.
I'm turning toward the implementation of a tokeniser: it feels the solution may be there, so ultimately the questions are:
1) can someone confirm there's no out-of-the-box function to do what I want to do? It feels like a common enough pattern that there could be something ready to use.
2) can someone suggest any solution? Are tokenizers part of the solution?
I'm more than happy to be pointed in the right direction and do more research myself.
Obviously if someone has working solutions to share that would be awesome.
Thanks in advance
- F

I actually wrote a blog post about this awhile back for Qbox, which you can find here: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. (Unfortunately some of the links on the post are broken, and can't easily be fixed at this point, but hopefully you'll get the idea.)
I'll refer you to the post for the details, but here is some code you can use to test it out quickly. Note that I'm using edge ngrams instead of full ngrams.
Also note in particular the use of the _all field, and the match query operator.
Okay, so here is the mapping:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"edgeNGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"edgeNGram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"_all": {
"enabled": true,
"index_analyzer": "edgeNGram_analyzer",
"search_analyzer": "standard"
},
"properties": {
"field1": {
"type": "string",
"include_in_all": true
},
"field2": {
"type": "string",
"include_in_all": true
}
}
}
}
}
Now add a few documents:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"field1":"purple duck","field2":"brown fox"}
{"index":{"_id":2}}
{"field1":"slow purple duck","field2":"quick brown fox"}
{"index":{"_id":3}}
{"field1":"red turtle","field2":"quick rabbit"}
And this query seems to illustrate what you're wanting:
POST /test_index/_search
{
"query": {
"match": {
"_all": {
"query": "purp fo slo",
"operator": "and"
}
}
}
}
returning:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19930676,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.19930676,
"_source": {
"field1": "slow purple duck",
"field2": "quick brown fox"
}
}
]
}
}
Here is the code I used to test it out:
http://sense.qbox.io/gist/b87e426062f453d946d643c7fa3d5480cd8e26ec

Related

Some Elastic fields DSL query searchable and some not

I'm using Elastic Search 6.8.1 and Dynamic Mapping. I have one document in the index now, and am testing out searching on various fields. I make a post to http://localhost:9200/documents/_search and send a DSL query
{
"query":
{"bool":{"must":{"term":{"name": "item2"}}} }
}
and I get the document I expect:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "documents",
"_type": "document",
"_id": "nRMOs5DZg",
"_score": 0.2876821,
"_source": {
"freeform": "DEF",
"name": "item2",
"url": "s3://mybucket/key",
"visible": true
}
}
]
}
}
Now, I want to make sure that I can search on the "freeform" field by changing the query to
{
"query":
{"bool":{"must":{"term":{"freeform": "DEF"}}} }
}
This results in no hits and I can't understand why.
[EDIT]
Here is the dynamic mapping
{
"documents": {
"aliases": {},
"mappings": {
"document": {
"properties": {
"freeform": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"url": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"visible": {
"type": "boolean"
}
}
}
},
"settings": {
"index": {
"creation_date": "1564776393764",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "2er2TF-ySEKgk6gd32K6Ig",
"version": {
"created": "6080199"
},
"provided_name": "documents"
}
}
}
}
It's hard to answer without seeing your mapping, but my guess would be this:
The dynamic mapping tries to guess the data type to assign to your fields; the default for string fields is the "text" data type, which means their value is analyzed and stored as a list of normalized terms, which is useful for free-text search. The string "item2" happens to survive this analysis unchanged, but "DEF" would be analyzed to "def".
Since you're using a term query, the queried term doesn't go through the same analysis process, so you have to query using the analyzed term in order to match the document.
Try searching for "def" instead of "DEF" to test this hypothesis. Also, take a look at the automatically-generated mapping for your index and you'll see which data type each field was mapped to.
If this is indeed the case, you can do one of several things:
If you want exact-string matching: change the mapping from text to keyword (you can control dynamic mapping using Dynamic Templates); or alternatively search using the keyword sub-field which is created automatically for you by searching against freeform.raw instead of freeform.
If you want "free-text" matching: use a match query instead of a term query so both the input and the document value undergo the same analysis (but make sure you understand how analysis and match queries work).

case insensitive elasticsearch with uppercase or lowercase

I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.
I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara.
elastic search uses analyzer which makes everything lowercase.
I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results.
I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.
Is there any way to do this in elastic search?
{"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 80
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now.
this is the link to onedrive where I kept both files.
https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc
Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by #Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter
I created an index with the following settings and mapping:
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"typehere": {
"properties": {
"name": {
"type": "string",
"analyzer": "custom_analyzer"
},
"description": {
"type": "string",
"analyzer": "custom_analyzer"
}
}
}
}
}
Indexed two documents
Doc 1
PUT /test_index/test_mapping/1
{
"name" : "Sara Connor",
"Description" : "My real name is Sarah Connor."
}
Doc 2
PUT /test_index/test_mapping/2
{
"name" : "John Connor",
"Description" : "I might save humanity someday."
}
Do a simple search
POST /test_index/_search?query=sara
{
"query" : {
"match" : {
"name" : "SARA"
}
}
}
And get back only the first document. I tried with "sara" and "Sara" also, same results.
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "test_index",
"_type": "test_mapping",
"_id": "1",
"_score": 0.19178301,
"_source": {
"name": "Sara Connor",
"Description": "My real name is Sarah Connor."
}
}
]
}
}
The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:
Create two analysers one with ngram filter and second analyser
without ngram filter because you don’t need to analyse input search
query using ngram because you have an exact value that you want to search.
Define mappings correctly for your fields. There are two fields in
the mapping that allow you to specify analysers. One is used for
storage (analyzer) and second, is used for searching
(search_analyzer) – if you specified only analyser field then
specified analyser is used for index and search time.
You can read more about it here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
And your code should look like that:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"index_store_ngram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"ngram_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "index_store_ngram",
"search_analyzer": "standard"
}
}
}
}
}
post /my_index/my_type/1
{
"name": "Sara_11_01"
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "sara"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SARA"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SaRa"
}
}
}
Edit 1: updated code for a new example provided in the question
This answer is in context of ElasticSearch 7.14. So, let me re-format the ask of this question in another way:
Irrespective of the actual case type provided in the match query, you would like to be able to get those documents that have been analysed with :
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
Now, coming to the answer part:
It will not be possible to get the match query to return the docs that have been analysed with filter lowercase and the match query contains uppercase letters. The analysis that you have applied in the settings is applicable both while updating and searching data. Although, it is also possible to apply different analysers for updating and searching, I do not see that helping your case. You would have to convert the match query value to lowercase before making the query. So, if your filter is lowercase, you can not match by say Sara or SARA or sAra etc. The match param should be all lowercase, just as it is in your analyser.

elasticsearch - number of searches affects revelance?

I have the following mapping:
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
I've inserted two docs:
POST music/song
{
"song_field" : "Premeditiated murder"
}
POST music/song
{
"song_field" : "Premeditiated"
}
Here is the query:
POST music/song/_search
{
"size": 10,
"query": {
"match": {
"song_field": {
"query": "Premeditiated murd",
"fuzziness": 2
}
}
}
}
Response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.78730416,
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUf6XK1ancUpEdFLdz8",
"_score": 0.78730416,
"_source": {
"song_field": "Premeditiated"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUfUbocancUpEdFLdUf",
"_score": 0.668494,
"_source": {
"song_field": "Premeditiated murder"
}
}
]
}
}
I have two questions:
Why does Premeditiated score is higher ? How can I get a resonable correction + auto-complete?
Does searching the same document over and over again affects default es score ?
You get wrong response because sorting by relevance is broken for very small set of data when you have multiple shareds. Relevance is calculated for each shared and then results from each shared are merged and return so your "Premeditiated" has bigger relevance in one shared. This is a common issue and is well described here: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html
There are two ways to solve this issue you can use:
1. number_of_shards option =1 during defining index mapping
2. add the following information to your search query: search_type=dfs_query_then_fetch
After using one of the above options you will get a result you want.
Regarding your second question: every time when you search scoring is calculated. Even if you are searching the same document over and over again the scoring is calculated and _score result is always the same. If you want to read more how scoring works then you need to read "Controlling relevance" chapter https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-relevance.html. You can always add explain property to your query to see how scroing was calculated https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-explain.html.
P.S
Great that you provided your JSONs but there is a wrong field inside query it should be song_field instead of song_field_1. Additionaly your response doesn’t fit to data stored inside type (look at _source field in the respown) but it doesn't matter here:P.

Find concatenate words in Elasticsearch

Say I have indexed this data
song:{
title:"laser game"
}
but the user is searching for
lasergame
How would you go about mapping/indexing/querying for this?
This is kind of tricky problem.
1) I guess the most effective way might be to use compound token filter, with word list made up of some words you think user might concatenate.
"settings": {
"analysis": {
"analyzer": {
"concatenate_split": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"myFilter"
]
}
},
"filter": {
"myFilter": {
"type": "dictionary_decompounder",
"word_list": [
"laser",
"game",
"lean",
"on",
"die",
"hard"
]
}
}
}
}
After applying analyzer, lasergame will split into laser and game along with lasergame, now this will give you results that has any of those words.
2) Another approach could be concatenating whole title with pattern replace char filter replacing all the spaces.
{
"index" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\\s+",
"replacement":""
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_pattern"]
}
}
}
}
}
You need to use multi fields with this approach, with this pattern, laser game will be indexed as lasergame and your query will work.
Here the problem is laser game play will be indexed as lasegameplay and search for lasergame wont return anything so you might want to consider using prefix query or wildcard query for this.
3) This might not make sense but you could also use synonym filter, if you think users are often concatenating some words.
Hope this helps!
Easiest solution would be using nGrams. That would be the base to start working with and could be tweaked to meet your needs. But here you go:
Mappings
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "nGram",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"sample": {
"properties": {
"myField": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
Test document
PUT /test/sample/1
{
"myField": "laser game"
}
Query
GET /test/_search
{
"query": {
"match": {
"myField": "lasergame"
}
}
}
Results
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2161999,
"hits": [
{
"_index": "test",
"_type": "sample",
"_id": "1",
"_score": 0.2161999,
"_source": {
"myField": "laser game"
}
}
]
}
}
This analyzer will create lots of ngrams in your index, such as la, las, lase...gam, game and etc. Both lasergame and laser game will produce a lot of similar tokens and will find your document as you'd expect.

ElasticSearch: Attempting to get spelling suggestion on proper names

Before I begin, let me just say that I'm no ElasticSearch expert, but I am currently tasked with tweaking some analyzers to get spelling suggestions working better in a couple of different situations. I've seen examples of people who are doing spelling suggestions on proper names, so I know it must be possible, but I've been at this for a couple days now, and I must be missing something, because ElasticSearch doesn't seem to recognize the name I'm looking for. Can you please help me figure this out? Thanks in advance!
Here's the analyzer I'm using for index as well as search:
"full_text": {
"filter": [
"lowercase",
"asciifolding",
],
"type": "custom",
"tokenizer": "keyword"
},
This should demonstrate that the field is tokenizing into one long keyword, which I want.
{
"query": {
"match": {
"_all": "combine 5"
}
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "my_field"
}
}
}
}
...and it outputs something like this, which shows how the field is being tokenized. Looks good:
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 75,
"max_score": 0.58574116,
"hits": [
{
"_index": "my_index",
"_type": "thing",
"_id": "1",
"_score": 0.58574116,
"fields": {
"terms": [
[
"combine 5"
]
]
}
}
}
}
... but when I do a suggest query, it doesn't suggest the field, even though it's just off by a space.
{
"query": {
"match": {
"_all": "combine 5"
}
},
"suggest": {
"suggest-0": {
"term": {
"field": "_all",
"size": 5
},
"text": "combine5"
}
}
}
Which returns a bunch of documents and this suggestion:
"suggest": {
"suggest-0": [
{
"text": "combine5",
"offset": 0,
"length": 8,
"options": [
{
"text": "combined",
"score": 0.875,
"freq": 15
},
{
"text": "combine",
"score": 0.85714287,
"freq": 17
}
]
}
]
}
Note that if I change the spelling suggestion to work just on the field that contains the text, it does suggest it, but not when I'm using _all. Is there a way to get the words in a specific field to be suggested when suggesting against _all?
I'm not sure this qualifies as exactly the answer I was looking for, but I ended up solving this by adding a field on the document containing the keyword value that I was looking for "combine5", so now it is registered as a word and if I suggest on that field, or _all, the word is suggested. It's also found in queries against _all.

Resources