Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field - elasticsearch

Example data:
PUT /test/test/1
{
"text1":"cats meow",
"text2":"12345",
"text3":"toy"
}
PUT /test/test/2
{
"text1":"dog bark",
"text2":"98765",
"text3":"toy"
}
And an example query:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats toy",
"type" : "cross_fields"
}
}
}
Returns the cat hit first and then the dog, which is what I want.
BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.
So if I search:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "phrase_prefix"
}
}
}
or
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats",
"type" : "phrase_prefix"
}
}
}
I should get the cat/ID 1, but I did not.
I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...
Sifting through the documentation really isn't helping me discover how to combine these two.

Yeah, I had to apply an analyzer...
The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.
The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.
So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.
~~~
Here are the steps to make the above question example work:
First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Then I attached this analyzer to each field.
I changed the text1 to match the field I was applying this to.
PUT /test/_mapping/test
{
"test": {
"properties": {
"text1": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
I ran GET /test/_mapping to be sure everything worked.
Then to add the data:
POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }
And the search!
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "cross_fields"
}
}
}
Which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.70778143,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.70778143,
"_source": {
"text1": "cats meow",
"text2": "12345",
"text3": "toy"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.1278426,
"_source": {
"text1": "dog bark",
"text2": "98765",
"text3": "toy"
}
}
]
}
}
This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

Related

Elastic/Kibana: support for plurals in query searches

I'll simplify my issue. Let's say I have an index with 3 documents I've created with Kibana:
PUT /test/vendors/1
{
"type": "doctor",
"name": "Phil",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New York"
}
]
}
PUT /test/vendors/2
{
"type": "lawyer",
"name": "John",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New Jersey"
}
]
}
PUT /test/vendors/3
{
"type": "doctor",
"name": "Jill",
"works_in": [
{
"place": "Chicago"
}
]
}
Now I'm running a search:
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in chicago",
"fields": [ "type", "place" ]
}
}
}
And I'm getting a good response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "test",
"_type": "vendors",
"_id": "1",
"_score": 0.2876821,
"_source": {
"type": "doctor",
"name": "Phil",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New York"
}
]
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "3",
"_score": 0.2876821,
"_source": {
"type": "doctor",
"name": "Jill",
"works_in": [
{
"place": "Chicago"
}
]
}
}
]
}
}
Now things begin to get problematic...
Changed the doctor to doctors
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctors in chicago",
"fields": [ "type", "place" ]
}
}
}
Zero results as doctors not found. Elastic doesn't know about plural vs. singular.
Change the query to New York
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in new york",
"fields": [ "type", "place" ]
}
}
}
But the response result set gives me the doctor in Chicago in addition to the doctor in New York. the fields are matched with OR...
Another interesting question is, what happens if someone usesdocs or physicians or health professionals but means doctor. IS there a provision where I can teach Elasticsearch to funnel those into "doctor"?
Is there any pattern, way around these using elasticsearch alone? where I won't have to analyze the string for meaning in my own application which will then construct a complex exact elasticsearch query to match it?
I would appreciate any pointer in the right direction
I'm assuming that the fields type and place are of Text type with Standard Analyzers.
To manage singular/plurals, what you are looking for is called Snowball Token Filter which you would need to add to the mapping.
Another requirement that you've mentioned saying for e.g. physicians should also be equated as doctor, you need to make use of Synonym Token Filter
Below is how your mapping should be. Note that I've just added analyzer to type. You can make similar changes to the mapping to the other fields.
Mapping
PUT <your_index_name>
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_snow",
"my_synonym"
]
}
},
"filter":{
"my_snow":{
"type":"snowball",
"language":"English"
},
"my_synonym":{
"type":"synonym",
"synonyms":[
"docs, physicians, health professionals, doctor"
]
}
}
}
},
"mappings":{
"mydocs":{
"properties":{
"type":{
"type":"text",
"analyzer":"my_analyzer"
},
"place":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
}
Notice how I've added the synonyms in the mapping itself, instead of that I'd suggest you to have the synonyms added in text file like below
{
"type":"synonym",
"synonyms_path" : "analysis/synonym.txt"
}
According to the link I've shared, it mentions that the above configures a synonym filter, with a path of analysis/synonym.txt (relative to the config location).
Hope it helps!

Elasticsearch advanced autocomplete

I want to autocomplete user input with Elasticsearch. Now There are tons of tutorials out there how to do so, but none go into the really detailed stuff.
The last issue I'm having with my query is that it should score Results that are not real "autocompletions" lower. Example:
IS:
I type: "Bed"
I find: "Bed", "Bigbed", "Fancy Bed", "Bed Frame"
WANT:
I type: "Bed"
I find: "Bed", "Bed Frame", [other "Bed XXX" results], "Fancy Bed", "Bigbed"
So i want Elasticsearch to first complete "to the right" if that makes sense. And then use results that have words in front of it.
I've tried the completion suggester I doesn't do other stuff I want but also has the same issue.
In German there are lots of examples of words like Bigbed (which isn't a real word in English, I know. But I don't want those words as high results. But since they match closer than Bed Frame (because that is 2 Tokens) they show up so high.
This is my query currently:
POST autocompletion/_search?pretty
{
"query": {
"function_score": {
"query": {
"match": {
"keyword": {
"query": "Bed",
"fuzziness": 1,
"minimum_should_match": "100%"
}
}
},
"field_value_factor": {
"field": "bias",
"factor": 1
}
}
}
}
If you use elasticsearch completion suggester, as explained at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html, when querying like:
{
"suggest": {
"song-suggest" : {
"prefix" : "bed",
"completion" : {
"field" : "suggest"
}
}
}
}
You will get:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0.0,
"hits": []
},
"suggest": {
"song-suggest": [
{
"text": "bed",
"offset": 0,
"length": 3,
"options": [
{
"text": "Bed",
"_index": "autocomplete",
"_type": "_doc",
"_id": "1",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Bed"
],
"weight": 34
}
}
},
{
"text": "Bed Frame",
"_index": "autocomplete",
"_type": "_doc",
"_id": "3",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Bed Frame"
],
"weight": 34
}
}
}
]
}
]
}
}
If you want to use the search API instead, you can use 2 queries:
prefix query "bed ****"
with a term starting by "bed"
Here the mapping:
{
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
Here the search query:
{
"query" : {
"bool" : {
"must" : [
{
"match" : {
"suggest" : "Bed"
}
}
],
"should" : [
{
"prefix" : {
"suggest.keyword" : "Bed"
}
}
]
}
}
}
The should clause will boost document starting by "Bed". Et voilĂ !

How Elasticsearch relevance score gets calculated?

I am using multi_match with phrase_prefix for full text search in Elasticsearch 5.5. ES query looks like
{
query: {
bool: {
must: {
multi_match: {
query: "butt",
type: "phrase_prefix",
fields: ["item.name", "item.keywords"],
max_expansions: 10
}
}
}
}
}
I am getting following response
[
{
"_index": "items_index",
"_type": "item",
"_id": "2",
"_score": 0.61426216,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk, flavoured",
"name": "Flavoured Butter"
}
}
},
{
"_index": "items_index",
"_type": "item",
"_id": "1",
"_score": 0.39063013,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk",
"name": "Butter Milk"
}
}
}
]
Mappings is as follows(I am using default mappings)
{
"items_index" : {
"mappings" : {
"parent_doc": {
...
"properties": {
"item" : {
"properties" : {
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
How item with "name": "Flavoured Butter" getting higher score of 0.61426216 against the document with "name": "Butter Milk" and score 0.39063013?
I tried applying boost to "item.name" and removing "item.keywords" form search fields getting same results.
How scores in Elasticsearch works? Are above results correct in terms of relavance?
The scoring for phrase_prefix is similar to that of best_fields, meaning that score of a document is the score obtained from the best_field, which here is item.keywords.
So, item.name isn't adding to score
Refer: multi-match-types
You can use 2 multi_match queries to combine the score from keywords and name.
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.keywords"
],
"max_expansions": 10
}
},{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.name"
],
"max_expansions": 10
}
}]
}
}
}

Find concatenate words in Elasticsearch

Say I have indexed this data
song:{
title:"laser game"
}
but the user is searching for
lasergame
How would you go about mapping/indexing/querying for this?
This is kind of tricky problem.
1) I guess the most effective way might be to use compound token filter, with word list made up of some words you think user might concatenate.
"settings": {
"analysis": {
"analyzer": {
"concatenate_split": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"myFilter"
]
}
},
"filter": {
"myFilter": {
"type": "dictionary_decompounder",
"word_list": [
"laser",
"game",
"lean",
"on",
"die",
"hard"
]
}
}
}
}
After applying analyzer, lasergame will split into laser and game along with lasergame, now this will give you results that has any of those words.
2) Another approach could be concatenating whole title with pattern replace char filter replacing all the spaces.
{
"index" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\\s+",
"replacement":""
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_pattern"]
}
}
}
}
}
You need to use multi fields with this approach, with this pattern, laser game will be indexed as lasergame and your query will work.
Here the problem is laser game play will be indexed as lasegameplay and search for lasergame wont return anything so you might want to consider using prefix query or wildcard query for this.
3) This might not make sense but you could also use synonym filter, if you think users are often concatenating some words.
Hope this helps!
Easiest solution would be using nGrams. That would be the base to start working with and could be tweaked to meet your needs. But here you go:
Mappings
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "nGram",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"sample": {
"properties": {
"myField": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
Test document
PUT /test/sample/1
{
"myField": "laser game"
}
Query
GET /test/_search
{
"query": {
"match": {
"myField": "lasergame"
}
}
}
Results
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2161999,
"hits": [
{
"_index": "test",
"_type": "sample",
"_id": "1",
"_score": 0.2161999,
"_source": {
"myField": "laser game"
}
}
]
}
}
This analyzer will create lots of ngrams in your index, such as la, las, lase...gam, game and etc. Both lasergame and laser game will produce a lot of similar tokens and will find your document as you'd expect.

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources