Synonym search in ElasticSearch - elasticsearch

I want to retrieve the data from the index using the notion of synonym. When I perform a search with title A I also want to retrieve the documents whose title contains B. For that I set up the following mapping :
{
"settings": {
"index" : {
"analysis" : {
"filter" : {
"synonym_filter" : {
"type" : "synonym",
"synonyms" : [
"A=>A,B"
]
}
},
"analyzer" : {
"synonym_analyzer" : {
"tokenizer" : "keyword",
"filter" : ["synonym_filter"]
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer" : "synonym_analyzer"
}
}
}
}
I then added 3 documents to my index
{
"title": "C"
}
{
"title": "B"
}
{
"title": "A"
}
I then used the analysis api to see if it works (everything is ok):
curl -X GET "localhost:9200/my_custom_index_title/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "synonym_analyzer",
"text": "A"
}
'
{
"tokens" : [
{
"token" : "A",
"start_offset" : 0,
"end_offset" : 1,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "B",
"start_offset" : 0,
"end_offset" : 1,
"type" : "SYNONYM",
"position" : 0
}
]
}
url -X GET "localhost:9200/my_custom_index_title/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "synonym_analyzer",
"text": "B"
}
'
{
"tokens" : [
{
"token" : "B",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
}
]
}
When I search for title A results are correct :
{
"query": {
"match": {
"title": {
"query": "A"
}
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.6951314,
"hits": [
{
"_index": "my_custom_index_title",
"_id": "i5bb_4IBqFAXxSLAgrDj",
"_score": 0.6951314,
"_source": {
"title": "A"
}
},
{
"_index": "my_custom_index_title",
"_id": "jJbb_4IBqFAXxSLAlLBj",
"_score": 0.52354836,
"_source": {
"title": "B"
}
}
]
}
}
But when I search for B the results are not correct, I just want result who contains B when I search and not A
{
"query": {
"match": {
"title": {
"query": "B"
}
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.52354836,
"hits": [
{
"_index": "my_custom_index_title",
"_id": "i5bb_4IBqFAXxSLAgrDj",
"_score": 0.52354836,
"_source": {
"title": "A"
}
},
{
"_index": "my_custom_index_title",
"_id": "jJbb_4IBqFAXxSLAlLBj",
"_score": 0.52354836,
"_source": {
"title": "B"
}
}
]
}
}
For example when I search for computer I wish to obtain laptop, computer, mac. But when I search for mac I only want to get the results for it (not laptop and computer)
I do not understand why the result for the search with B does not return only one result

I understand, in this case as you applied synonym_analyzer as a field analyzer, you indexed the synonyms.
To solve it, you can use synonyms only at search time, adding the parameter "search_analyzer". Note that I added the lowercase filter in the synonym_analyzer because the standard analyzer applies lowercase by default.
To get token synonyms for Term B do this:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"expand":"false",
"synonyms": [
"A=>A,B"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"synonym_filter"
]
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "synonym_analyzer"
}
}
}
}

Related

How to apply custom analyser?

Just discovered an issue with our Elastic Search. It is not returning anything for '&' in field name. Did some googling and I think I need a custom analyser. Never worked with ES before, assumption is I'm missing something basic here.
This is what I have got and it is not working as expected.
PUT custom_analyser
{
"settings": {
"analysis": {
"analyzer": {
"suggest_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase", "my_synonym_filter" ]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"&, and",
"foo, bar" ]
}
}
}
}
}
And trying to use it like:
GET custom_analyser/_search
{
"aggs": {
"section": {
"terms": {
"field": "section",
"size": 10,
"shard_size": 500,
"include": "jill & jerry" //Not returning anything back for this field using default analyser
}
}
}
}
Output:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
},
"aggregations": {
"section": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
}
Mappings
"_doc":{
"dynamic":"false",
"date_detection":false,
"properties":{
"section":{
"type":"keyword"
}
}
}
GET custom_analyser:
{
"custom_analyser": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "custom_analyser",
"creation_date": "1565971369814",
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"&, and",
"foo, bar"
]
}
},
"analyzer": {
"suggest_analyzer": {
"filter": [
"lowercase",
"my_synonym_filter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"uuid": "oVMOU5wPQ--vKhE3dDFG2Q",
"version": {
"created": "6030199"
}
}
}
}
}
I think there is a slight confusion here: An analyzer won't help you, because you are (correctly) using a keyword field for the aggregation, but those are not analyzed. You could only use a normalizer on those fields.
For your specific problem: The include (and exclude) are regular expressions — you'll need to escape the & to make this work as expected.
Full example
Mapping and sample data:
PUT test
{
"mappings": {
"properties": {
"section": {
"type": "keyword"
}
}
}
}
PUT test/_doc/1
{
"section": "jill & jerry"
}
PUT test/_doc/2
{
"section": "jill jerry"
}
PUT test/_doc/3
{
"section": "jill"
}
PUT test/_doc/4
{
"section": "jill & jerry"
}
Query — you need a double backslash for the escape to work here (and I'm also excluding the actual documents with "size": 0 to keep the response shorter):
GET test/_search
{
"size": 0,
"aggs": {
"section": {
"terms": {
"field": "section",
"include": "jill \\& jerry"
}
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"section" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "jill & jerry",
"doc_count" : 2
}
]
}
}
}

Elastic Search : Restricting the search result in array

My index metadata :
{
"never": {
"aliases": {},
"mappings": {
"userDetails": {
"properties": {
"Residence_address": {
"type": "nested",
"include_in_parent": true,
"properties": {
"Address_type": {
"type": "string",
"analyzer": "standard"
},
"Pincode": {
"type": "string",
"analyzer": "standard"
},
"address": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"index": {
"creation_date": "1468850158519",
"number_of_shards": "5",
"number_of_replicas": "1",
"version": {
"created": "1060099"
},
"uuid": "v2njuC2-QwSau4DiwzfQ-g"
}
},
"warmers": {}
}
}
My setting :
POST never
{
"settings": {
"number_of_shards" : 5,
"analysis": {
"analyzer": {
"standard": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse"]
}
}
}
}
}
My data :
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.375,
"hits": [
{
"_index": "never",
"_type": "userDetails",
"_id": "1",
"_score": 0.375,
"_source": {
"Residence_address": [
{
"address": "Omega Residency",
"Address_type": "Owned",
"Pincode": "500004"
},
{
"address": "Collage of Engineering",
"Address_type": "Rented",
"Pincode": "411005"
}
]
}
}
]
}
}
My query :
POST /never/_search?pretty
{
"query": {
"match": {
"Residence_address.address": "Omega"
}
}
}
My Result :
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.375,
"hits": [
{
"_index": "never",
"_type": "userDetails",
"_id": "1",
"_score": 0.375,
"_source": {
"Residence_address": [
{
"address": "Omega Residency",
"Address_type": "Owned",
"Pincode": "500004"
},
{
"address": "Collage of Engineering",
"Address_type": "Rented",
"Pincode": "411005"
}
]
}
}
]
}
}
Is there any way to restrict my result to only object containing address = Omega Residency and NOT the other object having address = Collage of Engineering?
You can only do it with nested query and inner_hits. I see that you have include_in_parent: true and not using nested queries though. If you only want to get the matched nested objects you'd need to use inner_hits from nested queries:
GET /never/_search?pretty
{
"_source": false,
"query": {
"nested": {
"path": "Residence_address",
"query": {
"match": {
"Residence_address.address": "Omega Residency"
}
},
"inner_hits" : {}
}
}
}

Elasticsearch generate suggestion fields

I've been reading in to the suggestion in elasticsearch in blogs like: https://www.elastic.co/blog/you-complete-me
But there you have to put in the name_suggest data your self, isn't there a way to automaticly add the data to the name_suggest when you map the object.
so update this mapping:
curl -X PUT localhost:9200/hotels -d '
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "string" },
"city" : { "type" : "string" },
"name_suggest" : {
"type" : "completion"
}
}
}
}
}'
and with these puts:
curl -X PUT localhost:9200/hotels/hotel/1 -d '
{
"name" : "Mercure Hotel Munich",
"city" : "Munich",
"name_suggest" : "Mercure Hotel Munich"
}'
curl -X PUT localhost:9200/hotels/hotel/2 -d '
{
"name" : "Hotel Monaco",
"city" : "Munich",
"name_suggest" : "Hotel Monaco"
}'
curl -X PUT localhost:9200/hotels/hotel/3 -d '
{
"name" : "Courtyard by Marriot Munich City",
"city" : "Munich",
"name_suggest" : "Courtyard by Marriot Munich City"
}'
so we can lose the name_suggest field.
So the ultimate goal is when you start typing Ho the first result would be Hotel
You can do it with ngrams if you want partial matches within words, or edge ngrams if you just want to match from the beginning of words.
Here's an example. I set up an index like this:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
},
"city": {
"type": "string"
}
}
}
}
}
Then added your docs:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"name":"Mercure Hotel Munich","city":"Munich"}
{"index":{"_id":2}}
{"name":"Hotel Monaco","city":"Munich"}
{"index":{"_id":3}}
{"name":"Courtyard by Marriot Munich City","city":"Munich"}
Now I can query for documents with "hot" in the name like this:
POST /test_index/_search
{
"query": {
"match": {
"name": "hot"
}
}
}
and I get back the correct docs:
{
"took": 41,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.625,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.625,
"_source": {
"name": "Hotel Monaco",
"city": "Munich"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.5,
"_source": {
"name": "Mercure Hotel Munich",
"city": "Munich"
}
}
]
}
}
There are various ways this can be tweaked or generalized. For example, you can apply the ngram analyzer to the _all field if you want to match on more than one field.
Here is the code I used to test it:
http://sense.qbox.io/gist/3583de02c4f7d33e07ba4c2def9badf90692a290

elasticsearch not attaching a default

I have tried a lot of iterations of this but it simply won't work. I am looking to have _timestamp or #timestamp attached to each document automatically.
can't seem to get this to work.
although the documents are being ingested properly
curl -XPOST 'http://X.X.X.X:9200/associations' -d '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"_default_":{
"_timestamp" : {
"enabled" : true,
"store" : true,
"path" : "post_date"
}
}
}
}'
i have also tried setting this directly within my index
curl -XPUT 'http://X.X.X.X:9200/associations' -d '{
"mappings": {
"_timestamp" :
{
"enabled":true,
"store": "yes",
"path" : "post_date",
"type": "date",
"format" : "yyyy-MM-dd HH:mm:ss"
}
}
}'
The first code block you have seems to work for me. If I use it to define an index:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"_default_": {
"_timestamp": {
"enabled": true,
"store": true,
"path": "post_date"
}
}
}
}
Then add a couple of docs to a new type:
PUT /test_index/doc/1
{
"post_date": "2015-1-25"
}
PUT /test_index/doc/2
{
"post_date": "2015-1-15"
}
I can see the timestamp in the new mapping now:
GET /test_index/_mapping
...
{
"test_index": {
"mappings": {
"_default_": {
"_timestamp": {
"enabled": true,
"store": true,
"path": "post_date"
},
"properties": {}
},
"doc": {
"_timestamp": {
"enabled": true,
"store": true,
"path": "post_date"
},
"properties": {
"post_date": {
"type": "date",
"format": "dateOptionalTime"
}
}
}
}
}
}
and I can search against the type, and ask for "_timestamp" in my fields, I get back the timestamps in the results:
POST /test_index/doc/_search
{
"fields": [
"_timestamp",
"post_date"
]
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"fields": {
"post_date": [
"2015-1-25"
],
"_timestamp": 1422144000000
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"fields": {
"post_date": [
"2015-1-15"
],
"_timestamp": 1421280000000
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/1ab1ecb73d3e87cffe0052ce1706e7985d197fad
I'm running Elasticsearch version 1.3.4, by the way.

Not able to search for string within a string in elasticsearch index

I'm trying to setup the mapping for my elasticsearch instance with full name matching and partial name matching:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '{
"mappings": {
"venue": {
"properties": {
"location": {
"type": "geo_point"
},
"name": {
"fields": {
"name": {
"type": "string",
"analyzer": "full_name"
},
"partial": {
"search_analyzer": "full_name",
"index_analyzer": "partial_name",
"type": "string"
}
},
"type": "multi_field"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"swedish_snow": {
"type": "snowball",
"language": "Swedish"
},
"name_synonyms": {
"type": "synonym",
"synonyms_path": "name_synonyms.txt"
},
"name_ngrams": {
"side": "front",
"min_gram": 2,
"max_gram": 50,
"type": "edgeNGram"
}
},
"analyzer": {
"full_name": {
"filter": [
"standard",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
},
"partial_name": {
"filter": [
"swedish_snow",
"lowercase",
"name_synonyms",
"name_ngrams",
"standard"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
I fill it with some data:
curl -XPOST 'http://127.0.0.1:9200/_bulk?pretty=1' -d '
{"index" : {"_index" : "test", "_type" : "venue"}}
{"location" : [59.3366, 18.0315], "name" : "johnssons"}
{"index" : {"_index" : "test", "_type" : "venue"}}
{"location" : [59.3366, 18.0315], "name" : "johnsson"}
{"index" : {"_index" : "test", "_type" : "venue"}}
{"location" : [59.3366, 18.0315], "name" : "jöhnsson"}
'
Perform some searches to test,
Full name:
curl -XGET 'http://127.0.0.1:9200/test/venue/_search?pretty=1' -d '{
"query": {
"bool": {
"should": [
{
"text": {
"name": {
"boost": 1,
"query": "johnsson"
}
}
},
{
"text": {
"name.partial": "johnsson"
}
}
]
}
}
}'
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.29834434,
"hits": [
{
"_index": "test",
"_type": "venue",
"_id": "CAO-dDr2TFOuCM4pFfNDSw",
"_score": 0.29834434,
"_source": {
"location": [
59.3366,
18.0315
],
"name": "johnsson"
}
},
{
"_index": "test",
"_type": "venue",
"_id": "UQWGn8L9Squ5RYDMd4jqKA",
"_score": 0.14663845,
"_source": {
"location": [
59.3366,
18.0315
],
"name": "johnssons"
}
}
]
}
}
Partial name:
curl -XGET 'http://127.0.0.1:9200/test/venue/_search?pretty=1' -d '{
"query": {
"bool": {
"should": [
{
"text": {
"name": {
"boost": 1,
"query": "johns"
}
}
},
{
"text": {
"name.partial": "johns"
}
}
]
}
}
}'
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.14663845,
"hits": [
{
"_index": "test",
"_type": "venue",
"_id": "UQWGn8L9Squ5RYDMd4jqKA",
"_score": 0.14663845,
"_source": {
"location": [
59.3366,
18.0315
],
"name": "johnssons"
}
},
{
"_index": "test",
"_type": "venue",
"_id": "CAO-dDr2TFOuCM4pFfNDSw",
"_score": 0.016878016,
"_source": {
"location": [
59.3366,
18.0315
],
"name": "johnsson"
}
}
]
}
}
Name within name:
curl -XGET 'http://127.0.0.1:9200/test/venue/_search?pretty=1' -d '{
"query": {
"bool": {
"should": [
{
"text": {
"ame": {
"boost": 1,
"query": "johnssons"
}
}
},
{
"text": {
"name.partial": "johnssons"
}
}
]
}
}
}'
Result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.39103588,
"hits": [
{
"_index": "test",
"_type": "venue",
"_id": "UQWGn8L9Squ5RYDMd4jqKA",
"_score": 0.39103588,
"_source": {
"location": [
59.3366,
18.0315
],
"name": "johnssons"
}
}
]
}
}
As you can see I'm only getting one venue back which is johnssons. Shouldn't I get both johnssons and johnsson back? What am I doing wrong in my settings?
You are using full_name analyzed as a search analyzer for the name.partial field. As a result your query is getting translated into the query for the term johnssons, which doesn't match anything.
You can use Analyze API to see what how your records are indexed. For example, this command
curl -XGET 'http://127.0.0.1:9200/test/_analyze?analyzer=partial_name&pretty=1' -d 'johnssons'
will show you that during indexing the string "johnssons" is getting translated into the following terms: "jo", "joh", "john", "johns", "johnss", "johnsso", "johnsson". While this command
curl -XGET 'http://127.0.0.1:9200/test/_analyze?analyzer=full_name&pretty=1' -d 'johnssons'
will show you that during searching the string "johnssons" is getting translated into term "johnssons". As you can see there is no match between your search term and your data here.

Resources