Elasticsearch Automatic Synonyms - elasticsearch

I am looking at Elasticsearch to handle search queries made by users in on my website.
Say that I have a document person with field vehicles_owned which is a list of strings. For example:
{
"name":"james",
"surname":"smith",
"vehicles_owned":["car","bike","ship"]
}
I would like to query which people own a certain vehicle. I understand it's possible to configure ES so that boat is treated as a synonym of ship and if I query with boat I am returned the user james who owns a ship.
What I don't understand is whether this is done automatically, or if I have to import lists of synonyms.

The idea is to create a custom analyzer for the vehicles_owned field which leverages the synonym token filter.
So you first need to define your index like this:
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt" <-- your synonym file
}
}
}
}
},
"mappings": {
"syn": {
"properties": {
"name": {
"type": "string"
},
"surname": {
"type": "string"
},
"vehicles_owned": {
"type": "string",
"index_analyzer": "synonym" <-- use the synonym analyzer here
}
}
}
}
}'
Then you can add all the synonyms you want to handle in the $ES_HOME/config/synonyms.txt file using the supported formats, for instance:
boat, ship
Next, you can index your documents:
curl -XPUT localhost:9200/your_index/your_type/1 -d '{
"name":"james",
"surname":"smith",
"vehicles_owned":["car","bike","ship"]
}'
And finally searching for either ship or boat will get you the above document we just indexed:
curl -XGET localhost:9200/your_index/your_type/_search?q=vehicles_owned:boat
curl -XGET localhost:9200/your_index/your_type/_search?q=vehicles_owned:ship

Related

Elasticseach not using synonyms from synonym file

I am new to elasticsearch so before downvoting or marking as duplicate, please read the question first.
I am testing synonyms in elasticsearch (v 2.4.6) which I have installed on Ubuntu 16.04. I am giving synonyms through a file named synonym.txt which I have placed in config directory. I have created an index synonym_test as follows-
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}'
The index contains two fields- id and some_text. I configure the field some_text with the custom analyzer as follows-
curl -XPUT localhost:9200/synonym_test/rulers/_mapping -d '{
"properties": {
"id": {
"type": "double"
},
"some_text": {
"type": "string",
"search_analyzer": "my_synonyms"
}
}
}'
Then I have inserted some data as -
curl -XPUT localhost:9200/synonym_test/external/5 -d '{
"id" : "5",
"some_text":"apple is a fruit"
}'
curl -XPUT localhost:9200/synonym_test/external/7 -d '{
"id" : "7",
"some_text":"english is spoken in england"
}'
curl -XPUT localhost:9200/synonym_test/external/8 -d '{
"id" : "8",
"some_text":"Scotland Yard is a popular game."
}'
curl -XPUT localhost:9200/synonym_test/external/9 -d '{
"id" : "9",
"some_text":"bananas contain potassium"
}'
The synonym.txt file contains following-
"britain,england,scotland"
"fruit,bananas"
After doing all this, when I run the query for term fruit (which should also return the text containing bananas as they are synonyms in file), I get the text containing fruit only.
{
"took":117,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.8465736,
"hits":[
{
"_index":"synonym_test",
"_type":"external",
"_id":"5",
"_score":0.8465736,
"_source":{
"id":"5",
"some_text":"apple is a fruit"
}
}
]
}
}
I have also tried the following links, but none seem to have helped me -
Synonym analyzer not working ,
Elasticsearch synonym analyzer not working , How to apply synonyms at query time instead of index time in Elasticsearch , how to configure the synonyms_path in elasticsearch and many other links.
So, can anyone please tell me if I am doing anything wrong? Is there anything wrong with the settings or synonym file? I want the synonyms to work (query time) so that when I search for a term, I get all documents related to that term.
Please refer to following url: Custom Analyzer on how you should configure custom analyzers.
If we follow the guides from above documentation our schema will be as follows:
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"type": "custom"
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}
Which currently works on my elasticsearch instance.

Can't query an edge_ngram field in _all

So I'm setting up an index and I'd like to have a single search that would do a partial-word edge_ngram search for one field and a more normal search of the rest of the fields. From what I understand this should be easy to do by just matching on _all. However I just can't seem to make it work.
I have been able to get the desired results from a bool query that searches _all and the specific ngram field separately but that seems hackey and I'm guessing there's just something simple that I'm missing.
Here is just a minimal example to show what I'm doing and how it's not working for me.
Here is the index setup:
curl -XPUT "http://localhost:9200/test_index?pretty=true" -d'
{
"settings": {
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"edge_ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"edge_ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
And add a simple document:
curl -XPUT "http://localhost:9200/test_index/doc/1?pretty=true" -d'
{
"text_field": "Hello, World!"
}'
_all partial search doesn't work. It returns an empty result.
curl -XPOST "http://localhost:9200/test_index/_search?pretty=true" -d'
{
"query": {
"match": {
"_all": "hell"
}
}
}'
_all whole word search works though
curl -XPOST "http://localhost:9200/test_index/_search?pretty=true" -d'
{
"query": {
"match": {
"_all": "hello"
}
}
}'
And a partial search on the specific field works
curl -XPOST "http://localhost:9200/test_index/_search?pretty=true" -d'
{
"query": {
"match": {
"text_field": "hell"
}
}
}'
The term vector looks fine too
curl -XGET "http://localhost:9200/test_index/doc/1/_termvector?fields=text_field&pretty=true"
I really can't figure out what I'm doing wrong here. Any help would be appreciated.
Here are some details about my environment.
Elasticsearch version: Version: 2.3.3, Build: 218bdf1/2016-05-17T15:40:04Z, JVM: 1.8.0_92
Linux OS: Arch Linux
Kernel version: 4.4.3-1-custom
The _all field combines the original values of all fields as a string, not the terms produced for each field. So in your case, it doesn't contain the terms produced by the edge_ngram_analyzer, just the text from the text_field field. It's just like any other text field, you can specify analyzers for it, etc. In your example, it's using the default analyzer.

How to implement case sensitive search in elasticsearch?

I have a field in my indexed documents where i need to search with case being sensitive. I am using the match query to fetch the results.
An example of my data document is :
{
"name" : "binoy",
"age" : 26,
"country": "India"
}
Now when I give the following query:
{
“query” : {
“match” : {
“name” : “Binoy"
}
}
}
It gives me a match for "binoy" against "Binoy". I want the search to be case sensitive. It seems by default,elasticsearch seems to go with case being insensitive. How to make the search case sensitive in elasticsearch?
In the mapping you can define the field as not_analyzed.
curl -X PUT "http://localhost:9200/sample" -d '{
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}'
echo
curl -X PUT "http://localhost:9200/sample/data/_mapping" -d '{
"data": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}'
Now if you can do normal index and do normal search , it wont analyze it and make sure it deliver case insensitive search.
It depends on the mapping you have defined for you field name. If you haven't defined any mapping then elasticsearch will treat it as string and use the standard analyzer (which lower-cases the tokens) to generate tokens. Your query will also use the same analyzer for search hence matching is done by lower-casing the input. That's why "Binoy" matches "binoy"
To solve it you can define a custom analyzer without lowercase filter and use it for your field name. You can define the analyzer as below
"analyzer": {
"casesensitive_text": {
"type": "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
You can define the mapping for name as below
"name": {
"type": "string",
"analyzer": "casesensitive_text"
}
Now you can do the the search on name.
note: the analyzer above is for example purpose. You may need to change it as per your needs
Have your mapping like:
PUT /whatever
{
"settings": {
"analysis": {
"analyzer": {
"mine": {
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"analyzer": "mine"
}
}
}
}
}
meaning, no lowercase filter for that custom analyzer.
Here is the full index template which worked for my ElasticSearch 5.6:
{
"template": "logstash-*",
"settings": {
"analysis" : {
"analyzer" : {
"case_sensitive" : {
"type" : "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
},
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"fluentd": {
"properties": {
"message": {
"type": "text",
"fields": {
"case_sensitive": {
"type": "text",
"analyzer": "case_sensitive"
}
}
}
}
}
}
}
As you see, the logs are coming from FluentD and are saved into a timebased index logstash-*. To make sure, I can still execute wildcard queries on the message filed, I put a multi-field mapping on that field. Wildcard/analyzed queries can be done on message field and the case sensitive one on the message.case_sensitive field.

Searching synonyms in elasticsearch

I'm trying to create synonym search over languages indexed in ES.
For example,
Indexed document -> name: German
Synonyms: German, Deutsch, XYZ
What I want to make is, when I type either German or Deutsch or XYZ, that ES returns me German...
Is that possible at all?
Yes very much so. ElasticSearch handles synonyms very well. Here is an example of how I configured synonyms on my cluster -
curl -XPOST localhost:9200/**new-index** -d '{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 0,
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms/synonyms.txt"
}
},
"analyzer": {
"synonym": {
"tokenizer": "lowercase",
"filter": [
"synonym"
]
}
}
}
},
"mappings": {
"**new-type**": {
"_all": {
"enabled": false
},
"properties": {
"Title": {
"type": "multi_field",
"store": "yes",
"fields": {
"Title": {
"type": "string",
"analyzer": "synonym"
}
}
}
}
}
}
}'
The path for the synonym file looks inside the config folder for the synonym folder and locates the text file. An example of the contents of the synonyms.txt for your requirements would be -
German, Deutsch, XYZ
REMEMBER - if you have a lower case filter at index time, the synonyms need to be in lower case. Restart nodes if not working.

Using ngrams instead of wildcards without a predefined schema

I recently discovered that I shouldn't be using wildcards for elasticsearch queries. Instead, I've been told I should use ngrams. In my experimentation, this has worked really well. What I'd like to do is be able to tell Elasticsearch to use ngrams for all mapped fields (or mapped properties that fit a specific patern).
For example:
CURL -XPUT 'http://localhost:9200/test-ngram-7/' -d '{
"mappings": {
"person": {
"properties": {
"name": {
"type": "string",
"analyzer": "partial"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"lb_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
}
},
"analyzer": {
"partial": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"lb_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
Now, when I add this mapping:
CURL -XPUT 'http://localhost:9200/test-ngram-7/person/1' -d '{
"name" : "Cobb",
"age" : 31
}'
I can easily query for "obb" and get a partial result. In my app, I don't know in advance what fields people will be mapping. I could obviously short circuit this on the client side and declare my mapping before posting the document, but it would be really cool if I could do something like this:
CURL -XPUT 'http://localhost:9200/test-ngram-7/' -d '{
"mappings": {
"person": {
"properties": {
"_default_": {
"type": "string",
"analyzer": "partial"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"lb_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
}
},
"analyzer": {
"partial": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"lb_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
Note that I'm using "default". It would also be cool if I could do like "name.*" and all properties starting with name would get filtered this way. I know elasticsearch supports default and wildcards.*, so I'm hoping that I'm just doing it wrong.
In short, I'd like for new properties to get run through ngram filters when mappings are created automatically, not using the mapping API.
You could set up a dynamic_template, see http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html for info.
Using this, you can create mapping templates for your not-known field, based on a match, pattern-matching etc, and apply analyzers etc for these templates. This will give you more fine-grained control of the behavior compared to setting the default analyzer. The default analyzer should typically be used for basic stuff like "lowercase" and "asciifolding", but if you are certain that you wih to apply the nGram for ALL fields, it certainly a valid way to go.
So, One solution I've found is to set up a "default" analyzer. The docs says
Default Analyzers An analyzer is registered under a logical name. It
can then be referenced from mapping definitions or certain APIs. When
none are defined, defaults are used. There is an option to define
which analyzers will be used by default when none can be derived.
The default logical name allows one to configure an analyzer that will
be used both for indexing and for searching APIs. The default_index
logical name can be used to configure a default analyzer that will be
used just when indexing, and the default_search can be used to
configure a default analyzer that will be used just when searching.
Here is an example:
CURL -XPUT 'http://localhost:9200/test-ngram-7/' -d '{
"settings": {
"analysis": {
"filter": {
"lb_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
}
},
"analyzer": {
"default": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"lb_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
And then this query will work:
CURL -XGET 'http://localhost:9200/test-ngram-7/person/_search' -d '{
"query":
{
"match" : {
"name" : "obb"
}
}
}'
Answering my own question because I am still interested if this is the "correct" way to do this.

Resources