Elastic Search - Apply appropriate analyser to accurate result - elasticsearch

I am new in Elastic Search. I would like to apply any analyser that satisfy below search.
Lets take an example.
Suppose I have entered below text in a document
I am walking now
I walked to Ahmedabad
Everyday I walk in the morning
Anil walks in the evening.
I am hiring candidates
I hired candidates
Everyday I hire candidates
He hires candidates
Now when I search with
text "walking"
result should be [walking, walked, walk, walks]
text "walked"
result should be [walking, walked, walk, walks]
text "walk"
result should be [walking, walked, walk, walks]
text "walks"
result should be [walking, walked, walk, walks]
Same result should also for hire.
text "hiring"
result should be [hiring, hired, hire, hires]
text "hired"
result should be [hiring, hired, hire, hires]
text "hire"
result should be [hiring, hired, hire, hires]
text "hires"
result should be [hiring, hired, hire, hires]
Thank You,

You need to use stemmer token filter
Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search.
For example, walking and walked can be stemmed to the same root word:
walk. Once stemmed, an occurrence of either word would match the other
in a search.
Mapping
PUT index36
{
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ,"lowercase"]
}
}
}
}
}
Analyze
GET index36/_analyze
{
"text": ["walking", "walked", "walk", "walks"],
"analyzer": "my_analyzer"
}
Result
{
"tokens" : [
{
"token" : "walk",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "walk",
"start_offset" : 8,
"end_offset" : 14,
"type" : "word",
"position" : 101
},
{
"token" : "walk",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 202
},
{
"token" : "walk",
"start_offset" : 20,
"end_offset" : 25,
"type" : "word",
"position" : 303
}
]
}
All the four words produce same token "walk". So any of these words would match the other in a search.

What you are searching for is a language analyzer, see the documentation here
An Word anaylzer always consists of an word-tokenizer and a word-filter as the example below shows.
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
You can now use the analyzer in your index-mapping like this:
{ mappings": {
"myindex": {
"properties": {
"myField": {
"type": "keyword",
"analyzer": "rebuilt_english"
}
}
}
}
}
Remember to use a match query in order to query full-text.

Related

How synonyms work internally in Elasticsearch?

I came across with Elasticsearch some time ago and started exploring it. I got to know about synonyms feature which is amazing! Can someone explain how internally this whole synonyms process work? How index time synonyms analyzing and search time synonyms analyzing are different in terms of internal structure?
Thanks :)
Elastic Doc:
Typically, the same parser should be applied at both index time and
lookup time to ensure that the query terms are in the same format as
the inverted index terms.
When you use the search_analyzer synonyms, you are generating the synonym tokens for the search term just in search time.
When you use synonyms at indexing time, you are expanding the term to the other terms of the synonyms, that is, everything is there in the inverted index. This can decrease your storage as you are indexing more term.
IndexTime example:
PUT synonym_index_time
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
},
"filter": {
"synonyms_filter": {
"type": "synonym",
"lenient": true,
"synonyms": [
"laptop, notebook"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
Test:
GET synonym_index_time/_analyze
{
"field": "name",
"text": ["laptop"]
}
Results:
{
"tokens" : [
{
"token" : "laptop",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "notebook",
"start_offset" : 0,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 0
}
]
}
Look, the terms laptop and notebook have been indexed, but notebook is a synonym.

Does Elasticsearch support nested or object fields in MultiMatch?

I have some object field named "FullTitleFts". It has field "text" inside. This query works fine (and returns some entries):
GET index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"fullTitleFts.text": "Ivan"
}
}
]
}
}
}
But this query returns nothing:
GET index/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Ivan",
"fields": [
"fullTitleFts.text"
]
}
}
]
}
}
}
Mapping of the field:
"fullTitleFts": {
"copy_to": [
"text"
],
"type": "keyword",
"fields": {
"text": {
"analyzer": "analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "text"
}
}
}
"analyzer": {
"filter": [
"lowercase",
"hypocorisms",
"protect_kw"
],
"char_filter": [
"replace_char_filter",
"e_char_filter"
],
"expand": "true",
"type": "custom",
"tokenizer": "standard"
}
e_char_filter is for replacing Cyrillic char "ё" to "е", replace_char_filter is for removing "�" from text. protect_kw is keyword_marker for some Russian unions. hypocorisms is synonym_graph for making another forms of names.
Example of analyzer output:
GET index/_analyze
{
"analyzer": "analyzer",
"text": "Алёна�"
}
{
"tokens" : [
{
"token" : "аленка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "аленушка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "алена",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I've also found this question. And it seems that the answer didn't really work - author had to add "include_in_root" option in mapping. So i wondering if multi match supports nested or object fields at all. I am also can't find anything about it in docs.
As you have provided index mapping, your field is defined as multi-field and not as nested or object field. So both match and multi_match should work without providing path. you can just use field name as fullTitleFts.text when need to search on text type and fullTitleFts when need to search on keyword type.

Elastic Search Analyzer for Dynamically Defined Regular Expression Searches

We have lots of documents in an elastic search index and doing full text searches at the moment. My next requirement in a project is finding all credit cards data in documents. Also user will be able to define some regular expression searching rules dynamically in the future. But with standard analyzer it is not possible to search credit card info or any user defined rule.
For instance, let's say a document contains credit card info such as 4321-4321-4321-4321 or 4321 4321 4321 4321. Elastic search indexes this data as 4 parts as seen below :
"tokens" : [
{
"token" : "4321",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "4321",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "4321",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "4321",
"start_offset" : 15,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 3
}
]
I just don't take into account Luhm algorithm now. If i do a basic regular expression search for finding a credit card with reg exp "([0-9]{4}[- ]){3}[0-9]{4}" it returns nothing because data is not analyzed and indexed for that. I thought for this purpose, i need to define a custom analyzer for regular expression searches and store the another version of data in another field or index. But as I said before in the future the user will define his/her own custom rule patterns for searching. How should i define the custom analyzer? Should i define ngram tokenizer(min:2, max:20) for that? With ngram tokenizer i think i can search for all defined regular expression rules. But is it reasonable? Project has to work with huge data without any performance problems. (A company's whole file system will be indexed). Do you have any other suggestion for this type of data discovery problem? My main purpose is finding credit cards at the moment. Thanks for helping.
Ok, here is a pair of custom analyzers that can help you detect credit card numbers and social security numbers. Feel free to adapt the regular expression as you see fit (by adding/removing other character separators that you will find in your data).
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"card_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"card_number"
]
},
"ssn_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"social_number"
]
}
},
"filter": {
"card_number": {
"type": "pattern_replace",
"preserve_original": false,
"pattern": """.*(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4}).*""",
"replacement": "$1$2$3$4"
},
"social_number": {
"type": "pattern_replace",
"preserve_original": false,
"pattern": """.*(\d{3})[\s\.\-]+(\d{2})[\s\.\-]+(\d{4}).*""",
"replacement": "$1$2$3"
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"card": {
"type": "text",
"analyzer": "card_analyzer"
},
"ssn": {
"type": "text",
"analyzer": "ssn_analyzer"
}
}
}
}
}
}
Let's test this.
POST test/_analyze
{
"analyzer": "card_analyzer",
"text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}
Will yield a nice digit-only credit card number:
{
"tokens" : [
{
"token" : "3526472847236374",
"start_offset" : 0,
"end_offset" : 86,
"type" : "word",
"position" : 0
}
]
}
Similarly for SSN:
POST test/_analyze
{
"analyzer": "ssn_analyzer",
"text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}
Will yield a nice digit-only social security number:
{
"tokens" : [
{
"token" : "442231452",
"start_offset" : 0,
"end_offset" : 86,
"type" : "word",
"position" : 0
}
]
}
And now we can search for either a credit card or a SSN. Let's say we have the following two documents. The SSN and credit card numbers are the same, yet they use different character separators
POST test/_doc
{ "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374" }
POST test/_doc
{ "text": "SSN is 442.23.1452 belongs to Mr. XYZ. He paid $20 via credit card number 3526-4728-4723-6374" }
You can now find both documents by looking for the credit card number and/or SSN in any format:
POST test/_search
{
"query": {
"match": {
"text.card": "3526 4728 4723 6374"
}
}
}
POST test/_search
{
"query": {
"match": {
"text.card": "3526 4728 4723-6374"
}
}
}
POST test/_search
{
"query": {
"match": {
"text.ssn": "442 23-1452"
}
}
}
All the above queries will match and return both documents.

In span_first query can we specify "end" paramter based on actual string that is stored in ES or do i have to specify in terms of tokens stored in ES

I asked previous question here Query in Elasticsearch for retrieving strings that start with a particular word on elasticsearch and my problem was solved by using span_first query but now my problem has been changed a bit, now my mapping has been changed because now i want to store words ending with apostrophe 's' as "word", "words", "word's" for example see below case
"joseph's -> "joseph's", "josephs", "joseph"
My mapping is given below
curl -X PUT "http://localhost:9200/colleges/" -d
'{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_comma": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)\\u0027S)\\b",
"replacement": "$1 $2s $2"
}
},
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter" : ["apostrophe_comma"],
"filter": ["lowercase", "unique"]
}
}
}
}
},
"mappings" : {
"college": {
"properties":{
"college_name" : { "type" : "string", "index": "analyzed", "analyzer": "simple_wildcard"}
}
}
}
}'
My span_first query i was using
"span_first" : {
"match" : {
"span_term" : { "college_name" : first_string[0] }
},
"end" : 1
}
Now the problem i am facing is consider below example
Suppose i have "Donald Duck's" now if anyone would search for "Donald Duck", "Donald Duck's", "Donald Ducks" etc i want them to give "Donald Duck's" but by using span_first query it is not happening because as due to mapping i have 4 tokens now "Donald", "Duck", "Ducks" and "Duck's". now for Donald "end" used in span_first query will be 1, but for other three i used 2 but as "end" is different for different tokens of same word i am not getting desired result.
In short my problem is as span_first query uses "end" parameter to describe position from beginning my token must be present now as due to my mapping i have broken one word "Duck's" to "Duck's", "Ducks" and "Duck" because of which all have "end" value different but while querying i can only use one "end" parameter that's why i don't know how to get my desired output.
If anyone of you have worked on span_first query please help me.
You can use english possessive stemmer to remove 's and english stemmer which maps to porter stem algorithm to handle plurals.
POST colleges
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"unique",
"english_possessive_stemmer",
"light_english_stemmer"
]
}
},
"filter": {
"light_english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
},
"mappings": {
"college": {
"properties": {
"college_name": {
"type": "string",
"index": "analyzed",
"analyzer": "simple_wildcard"
}
}
}
}
}
After that you will have to make two queries to get the right result. First you would have to run the user query through analyze api to get the tokens which you will pass to span queries.
GET colleges/_analyze
{
"text" : "donald ducks duck's",
"analyzer" : "simple_wildcard"
}
The output would be the tokens which will be passed to next phase i.e span query.
{
"tokens": [
{
"token": "donald",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "duck",
"start_offset": 7,
"end_offset": 12,
"type": "word",
"position": 1
},
{
"token": "duck",
"start_offset": 13,
"end_offset": 19,
"type": "word",
"position": 2
}
]
}
The tokens donald, duck, duck will be passed with end position as 1, 2 and 3 respectively.
NOTE : No stemming algorithm is 100%, you might miss some singular/plural combination. For this you could log your queries and then use either synonym token filter or mapping char filter.
Hope this solves the problem.

elasticsearch 1.6 field norm calculation with shingle filter

I am trying to understand the fieldnorm calculation in elasticsearch (1.6) for documents indexed with a shingle analyzer - it does not seem to include shingled terms. If so, is it possible to configure the calculation to include the shingled terms? Specifically, this is the analyzer I used:
{
"index" : {
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
This is the mapping used:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer"}
}
}
}
And I posted a few documents:
{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...
When using the following query with the explain API,
{
"query": {
"match": {
"text" : "the"
}
}
}
I get the following fieldnorms (other details omitted for brevity):
"_source": {
"text": "the quick"
},
"_explanation": {
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
"_source": {
"text": "the quick brown fox jumps over the"
},
"_explanation": {
"value": 0.375,
"description": "fieldNorm(doc=0)"
}
The values seem to suggest that ES sees 2 terms for the 1st document ("the quick") and 7 terms for the 2nd document ("the quick brown fox jumps over the"), excluding the shingles. Is it possible to configure ES to calculate field norm with the shingled terms too (ie. all terms returned by the analyzer)?
You would need to customize the default similarity by disabling the discount overlap flag.
Example:
{
"index" : {
"similarity" : {
"no_overlap" : {
"type" : "default",
"discount_overlaps" : false
}
},
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
Mapping:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
}
}
}
To expand further:
By default overlaps i.e Tokens with 0 position increment are ignored when computing norm
Example below shows the postion of tokens generated by the "my_analyzer" described in OP :
get <index_name>/_analyze?field=text&text=the quick
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the quick",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
According to lucene documentation the length norm calculation for default similarity is implemented as follows :
state.getBoost()*lengthNorm(numTerms)
where numTerms is
if setDiscountOverlaps(boolean) is false
FieldInvertState.getLength()
else
FieldInvertState.getLength() - FieldInvertState.getNumOverlap()

Resources