How synonyms work internally in Elasticsearch? - elasticsearch

I came across with Elasticsearch some time ago and started exploring it. I got to know about synonyms feature which is amazing! Can someone explain how internally this whole synonyms process work? How index time synonyms analyzing and search time synonyms analyzing are different in terms of internal structure?
Thanks :)

Elastic Doc:
Typically, the same parser should be applied at both index time and
lookup time to ensure that the query terms are in the same format as
the inverted index terms.
When you use the search_analyzer synonyms, you are generating the synonym tokens for the search term just in search time.
When you use synonyms at indexing time, you are expanding the term to the other terms of the synonyms, that is, everything is there in the inverted index. This can decrease your storage as you are indexing more term.
IndexTime example:
PUT synonym_index_time
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
},
"filter": {
"synonyms_filter": {
"type": "synonym",
"lenient": true,
"synonyms": [
"laptop, notebook"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
Test:
GET synonym_index_time/_analyze
{
"field": "name",
"text": ["laptop"]
}
Results:
{
"tokens" : [
{
"token" : "laptop",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "notebook",
"start_offset" : 0,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 0
}
]
}
Look, the terms laptop and notebook have been indexed, but notebook is a synonym.

Related

Does Elasticsearch support nested or object fields in MultiMatch?

I have some object field named "FullTitleFts". It has field "text" inside. This query works fine (and returns some entries):
GET index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"fullTitleFts.text": "Ivan"
}
}
]
}
}
}
But this query returns nothing:
GET index/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Ivan",
"fields": [
"fullTitleFts.text"
]
}
}
]
}
}
}
Mapping of the field:
"fullTitleFts": {
"copy_to": [
"text"
],
"type": "keyword",
"fields": {
"text": {
"analyzer": "analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "text"
}
}
}
"analyzer": {
"filter": [
"lowercase",
"hypocorisms",
"protect_kw"
],
"char_filter": [
"replace_char_filter",
"e_char_filter"
],
"expand": "true",
"type": "custom",
"tokenizer": "standard"
}
e_char_filter is for replacing Cyrillic char "ё" to "е", replace_char_filter is for removing "�" from text. protect_kw is keyword_marker for some Russian unions. hypocorisms is synonym_graph for making another forms of names.
Example of analyzer output:
GET index/_analyze
{
"analyzer": "analyzer",
"text": "Алёна�"
}
{
"tokens" : [
{
"token" : "аленка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "аленушка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "алена",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I've also found this question. And it seems that the answer didn't really work - author had to add "include_in_root" option in mapping. So i wondering if multi match supports nested or object fields at all. I am also can't find anything about it in docs.
As you have provided index mapping, your field is defined as multi-field and not as nested or object field. So both match and multi_match should work without providing path. you can just use field name as fullTitleFts.text when need to search on text type and fullTitleFts when need to search on keyword type.

Elastic Search - Apply appropriate analyser to accurate result

I am new in Elastic Search. I would like to apply any analyser that satisfy below search.
Lets take an example.
Suppose I have entered below text in a document
I am walking now
I walked to Ahmedabad
Everyday I walk in the morning
Anil walks in the evening.
I am hiring candidates
I hired candidates
Everyday I hire candidates
He hires candidates
Now when I search with
text "walking"
result should be [walking, walked, walk, walks]
text "walked"
result should be [walking, walked, walk, walks]
text "walk"
result should be [walking, walked, walk, walks]
text "walks"
result should be [walking, walked, walk, walks]
Same result should also for hire.
text "hiring"
result should be [hiring, hired, hire, hires]
text "hired"
result should be [hiring, hired, hire, hires]
text "hire"
result should be [hiring, hired, hire, hires]
text "hires"
result should be [hiring, hired, hire, hires]
Thank You,
You need to use stemmer token filter
Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search.
For example, walking and walked can be stemmed to the same root word:
walk. Once stemmed, an occurrence of either word would match the other
in a search.
Mapping
PUT index36
{
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ,"lowercase"]
}
}
}
}
}
Analyze
GET index36/_analyze
{
"text": ["walking", "walked", "walk", "walks"],
"analyzer": "my_analyzer"
}
Result
{
"tokens" : [
{
"token" : "walk",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "walk",
"start_offset" : 8,
"end_offset" : 14,
"type" : "word",
"position" : 101
},
{
"token" : "walk",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 202
},
{
"token" : "walk",
"start_offset" : 20,
"end_offset" : 25,
"type" : "word",
"position" : 303
}
]
}
All the four words produce same token "walk". So any of these words would match the other in a search.
What you are searching for is a language analyzer, see the documentation here
An Word anaylzer always consists of an word-tokenizer and a word-filter as the example below shows.
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
You can now use the analyzer in your index-mapping like this:
{ mappings": {
"myindex": {
"properties": {
"myField": {
"type": "keyword",
"analyzer": "rebuilt_english"
}
}
}
}
}
Remember to use a match query in order to query full-text.

Elastic Search Analyzer for Dynamically Defined Regular Expression Searches

We have lots of documents in an elastic search index and doing full text searches at the moment. My next requirement in a project is finding all credit cards data in documents. Also user will be able to define some regular expression searching rules dynamically in the future. But with standard analyzer it is not possible to search credit card info or any user defined rule.
For instance, let's say a document contains credit card info such as 4321-4321-4321-4321 or 4321 4321 4321 4321. Elastic search indexes this data as 4 parts as seen below :
"tokens" : [
{
"token" : "4321",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "4321",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "4321",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "4321",
"start_offset" : 15,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 3
}
]
I just don't take into account Luhm algorithm now. If i do a basic regular expression search for finding a credit card with reg exp "([0-9]{4}[- ]){3}[0-9]{4}" it returns nothing because data is not analyzed and indexed for that. I thought for this purpose, i need to define a custom analyzer for regular expression searches and store the another version of data in another field or index. But as I said before in the future the user will define his/her own custom rule patterns for searching. How should i define the custom analyzer? Should i define ngram tokenizer(min:2, max:20) for that? With ngram tokenizer i think i can search for all defined regular expression rules. But is it reasonable? Project has to work with huge data without any performance problems. (A company's whole file system will be indexed). Do you have any other suggestion for this type of data discovery problem? My main purpose is finding credit cards at the moment. Thanks for helping.
Ok, here is a pair of custom analyzers that can help you detect credit card numbers and social security numbers. Feel free to adapt the regular expression as you see fit (by adding/removing other character separators that you will find in your data).
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"card_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"card_number"
]
},
"ssn_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"social_number"
]
}
},
"filter": {
"card_number": {
"type": "pattern_replace",
"preserve_original": false,
"pattern": """.*(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4}).*""",
"replacement": "$1$2$3$4"
},
"social_number": {
"type": "pattern_replace",
"preserve_original": false,
"pattern": """.*(\d{3})[\s\.\-]+(\d{2})[\s\.\-]+(\d{4}).*""",
"replacement": "$1$2$3"
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"card": {
"type": "text",
"analyzer": "card_analyzer"
},
"ssn": {
"type": "text",
"analyzer": "ssn_analyzer"
}
}
}
}
}
}
Let's test this.
POST test/_analyze
{
"analyzer": "card_analyzer",
"text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}
Will yield a nice digit-only credit card number:
{
"tokens" : [
{
"token" : "3526472847236374",
"start_offset" : 0,
"end_offset" : 86,
"type" : "word",
"position" : 0
}
]
}
Similarly for SSN:
POST test/_analyze
{
"analyzer": "ssn_analyzer",
"text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}
Will yield a nice digit-only social security number:
{
"tokens" : [
{
"token" : "442231452",
"start_offset" : 0,
"end_offset" : 86,
"type" : "word",
"position" : 0
}
]
}
And now we can search for either a credit card or a SSN. Let's say we have the following two documents. The SSN and credit card numbers are the same, yet they use different character separators
POST test/_doc
{ "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374" }
POST test/_doc
{ "text": "SSN is 442.23.1452 belongs to Mr. XYZ. He paid $20 via credit card number 3526-4728-4723-6374" }
You can now find both documents by looking for the credit card number and/or SSN in any format:
POST test/_search
{
"query": {
"match": {
"text.card": "3526 4728 4723 6374"
}
}
}
POST test/_search
{
"query": {
"match": {
"text.card": "3526 4728 4723-6374"
}
}
}
POST test/_search
{
"query": {
"match": {
"text.ssn": "442 23-1452"
}
}
}
All the above queries will match and return both documents.

In span_first query can we specify "end" paramter based on actual string that is stored in ES or do i have to specify in terms of tokens stored in ES

I asked previous question here Query in Elasticsearch for retrieving strings that start with a particular word on elasticsearch and my problem was solved by using span_first query but now my problem has been changed a bit, now my mapping has been changed because now i want to store words ending with apostrophe 's' as "word", "words", "word's" for example see below case
"joseph's -> "joseph's", "josephs", "joseph"
My mapping is given below
curl -X PUT "http://localhost:9200/colleges/" -d
'{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_comma": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)\\u0027S)\\b",
"replacement": "$1 $2s $2"
}
},
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter" : ["apostrophe_comma"],
"filter": ["lowercase", "unique"]
}
}
}
}
},
"mappings" : {
"college": {
"properties":{
"college_name" : { "type" : "string", "index": "analyzed", "analyzer": "simple_wildcard"}
}
}
}
}'
My span_first query i was using
"span_first" : {
"match" : {
"span_term" : { "college_name" : first_string[0] }
},
"end" : 1
}
Now the problem i am facing is consider below example
Suppose i have "Donald Duck's" now if anyone would search for "Donald Duck", "Donald Duck's", "Donald Ducks" etc i want them to give "Donald Duck's" but by using span_first query it is not happening because as due to mapping i have 4 tokens now "Donald", "Duck", "Ducks" and "Duck's". now for Donald "end" used in span_first query will be 1, but for other three i used 2 but as "end" is different for different tokens of same word i am not getting desired result.
In short my problem is as span_first query uses "end" parameter to describe position from beginning my token must be present now as due to my mapping i have broken one word "Duck's" to "Duck's", "Ducks" and "Duck" because of which all have "end" value different but while querying i can only use one "end" parameter that's why i don't know how to get my desired output.
If anyone of you have worked on span_first query please help me.
You can use english possessive stemmer to remove 's and english stemmer which maps to porter stem algorithm to handle plurals.
POST colleges
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"unique",
"english_possessive_stemmer",
"light_english_stemmer"
]
}
},
"filter": {
"light_english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
},
"mappings": {
"college": {
"properties": {
"college_name": {
"type": "string",
"index": "analyzed",
"analyzer": "simple_wildcard"
}
}
}
}
}
After that you will have to make two queries to get the right result. First you would have to run the user query through analyze api to get the tokens which you will pass to span queries.
GET colleges/_analyze
{
"text" : "donald ducks duck's",
"analyzer" : "simple_wildcard"
}
The output would be the tokens which will be passed to next phase i.e span query.
{
"tokens": [
{
"token": "donald",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "duck",
"start_offset": 7,
"end_offset": 12,
"type": "word",
"position": 1
},
{
"token": "duck",
"start_offset": 13,
"end_offset": 19,
"type": "word",
"position": 2
}
]
}
The tokens donald, duck, duck will be passed with end position as 1, 2 and 3 respectively.
NOTE : No stemming algorithm is 100%, you might miss some singular/plural combination. For this you could log your queries and then use either synonym token filter or mapping char filter.
Hope this solves the problem.

Elasticsearch exact matches on analyzed fields

Is there a way to have ElasticSearch identify exact matches on analyzed fields? Ideally, I would like to lowercase, tokenize, stem and perhaps even phoneticize my docs, then have queries pull "exact" matches out.
What I mean is that if I index "Hamburger Buns" and "Hamburgers", they will be analyzed as ["hamburger","bun"] and ["hamburger"]. If I search for "Hamburger", it will only return the "hamburger" doc, as that's the "exact" match.
I've tried using the keyword tokenizer, but that won't stem the individual tokens. Do I need to do something to ensure that the number of tokens is equal or so?
I'm familiar with multi-fields and using the "not_analyzed" type, but this is more restrictive than I'm looking for. I'd like exact matching, post-analysis.
Use shingles tokenizer together with stemming and whatever else you need. Add a sub-field of type token_count that will count the number of tokens in the field.
At searching time, you need to add an additional filter to match the number of tokens in the index with the number of tokens you have in the searching text. You would need an additional step, when you perform the actual search, that should count the tokens in the searching string. This is like this because shingles will create multiple permutations of tokens and you need to make sure that it matches the size of your searching text.
An attempt for this, just to give you an idea:
{
"settings": {
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 10,
"min_shingle_size": 2,
"output_unigrams": true
},
"filter_stemmer": {
"type": "porter_stem",
"language": "_english_"
}
},
"analyzer": {
"ShingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"snowball",
"filter_stemmer",
"filter_shingle"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "ShingleAnalyzer",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "ShingleAnalyzer"
}
}
}
}
}
}
}
And the query:
{
"query": {
"filtered": {
"query": {
"match_phrase": {
"text": {
"query": "HaMbUrGeRs BUN"
}
}
},
"filter": {
"term": {
"text.word_count": "2"
}
}
}
}
}
The shingles filter is important here because it can create combinations of tokens. And more than that, these are combinations that keep the order or the tokens. Imo, the most difficult requirement to fulfill here is to change the tokens (stemming, lowercasing etc) and, also, to assemble back the original text. Unless you define your own "concatenation" filter I don't think there is any other way than using the shingles filter.
But with shingles there is another issue: it creates combinations that are not needed. For a text like "Hamburgers buns in Los Angeles" you end up with a long list of shingles:
"angeles",
"buns",
"buns in",
"buns in los",
"buns in los angeles",
"hamburgers",
"hamburgers buns",
"hamburgers buns in",
"hamburgers buns in los",
"hamburgers buns in los angeles",
"in",
"in los",
"in los angeles",
"los",
"los angeles"
If you are interested in only those documents that match exactly meaning, the documents above matches only when you search for "hamburgers buns in los angeles" (and doesn't match something like "any hamburgers buns in los angeles") then you need a way to filter that long list of shingles. The way I see it is to use word_count.
You can use multi-fields for that purpose and have a not_analyzed sub-field within your analyzed field (let's call it item in this example). Your mapping would have to look like this:
{
"yourtype": {
"properties": {
"item": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
With this kind of mapping, you can check how each of the values Hamburgers and Hamburger Buns are "viewed" by the analyzer with respect to your multi-field item and item.raw
For Hamburger:
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger'
{
"tokens" : [ {
"token" : "hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger'
{
"tokens" : [ {
"token" : "Hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
For Hamburger Buns:
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger Buns'
{
"tokens" : [ {
"token" : "hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "buns",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger Buns'
{
"tokens" : [ {
"token" : "Hamburger Buns",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 1
} ]
}
As you can see, the not_analyzed field is going to be indexed untouched exactly as it was input.
Now, let's index two sample documents to illustrate this:
curl -XPOST localhost:9200/yourtypes/_bulk -d '
{"index": {"_type": "yourtype", "_id": 1}}
{"item": "Hamburger"}
{"index": {"_type": "yourtype", "_id": 2}}
{"item": "Hamburger Buns"}
'
And finally, to answer your question, if you want to have an exact match on Hamburger, you can search within your sub-field item.raw like this (note that the case has to match, too):
curl -XPOST localhost:9200/yourtypes/yourtype/_search -d '{
"query": {
"term": {
"item.raw": "Hamburger"
}
}
}'
And you'll get:
{
...
"hits" : {
"total" : 1,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "yourtypes",
"_type" : "yourtype",
"_id" : "1",
"_score" : 0.30685282,
"_source":{"item": "Hamburger"}
} ]
}
}
UPDATE (see comments/discussion below and question re-edit)
Taking your example from the comments and trying to have HaMbUrGeR BuNs match Hamburger buns you could simply achieve it with a match query like this.
curl -XPOST localhost:9200/yourtypes/yourtype/_search?pretty -d '{
"query": {
"match": {
"item": {
"query": "HaMbUrGeR BuNs",
"operator": "and"
}
}
}
}'
Which based on the same two indexed documents above will yield
{
...
"hits" : {
"total" : 1,
"max_score" : 0.2712221,
"hits" : [ {
"_index" : "yourtypes",
"_type" : "yourtype",
"_id" : "2",
"_score" : 0.2712221,
"_source":{"item": "Hamburger Buns"}
} ]
}
}
You can keep the analyzer as what you expected (lowercase, tokenize, stem, ...), and use query_string as the main query, match_phrase as the boosting query to search. Something like this:
{
"bool" : {
"should" : [
{
"query_string" : {
"default_field" : "your_field",
"default_operator" : "OR",
"phrase_slop" : 1,
"query" : "Hamburger"
}
},
{
"match_phrase": {
"your_field": {
"query": "Hamburger"
}
}
}
]
}
}
It will match both documents, and exact match (match_phrase) will be on top since the query match both should clauses (and get higher score)
default_operator is set to OR, it will help the query "Hamburger Buns" (match hamburger OR bun) match the document "Hamburger" also.
phrase_slop is set to 1 to match terms with distance = 1 only, e.g. search for Hamburger Buns will not match document Hamburger Big Buns. You can adjust this depend on your requirements.
You can refer Closer is better, Query string for more details.

Resources