ElasticSearch edgeNGram - elasticsearch

ElasticSearch edgeNGram - elasticsearch

I have the following settings and analyzer:
put /tests
{
"settings": {
"analysis": {
"analyzer": {
"standardWithEdgeNGram": {
"tokenizer": "standard",
"filter": ["lowercase", "edgeNGram"]
}
},
"tokenizer": {
"standard": {
"type": "standard"
}
},
"filter": {
"lowercase": {
"type": "lowercase"
},
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"test": {
"_all": {
"analyzer": "standardWithEdgeNGram"
},
"properties": {
"Name": {
"type": "string",
"analyzer": "standardWithEdgeNGram"
}
}
}
}
}
And I posted the following data into it:
POST /tests/test
{
"Name": "JACKSON v. FRENKEL"
}
And here is my query:
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax"
}
}
}
And I got this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "tests",
"_type": "test",
"_id": "lfOxb_5bS86_CMumo_ZLoA",
"_score": 0.19178301,
"_source": {
"Name": "JACKSON v. FRENKEL"
}
}
]
}
}
Can someone explain to me that there is no "jax" anywhere in the "Name", and it still gets the match?
Thanks in advance

A match query performs analysis on its given value. By default, "jax" is being analyzed with standardWithEdgeNGram, which includes n-gram analysis permuting it into ["ja", "ax"], the first of which matches the "ja" from the analyzed "JACKSON v. FRENKEL".
If you don't want this behavior you can specify a different analyzer to match, using the analyzer field, for example keyword:
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax",
"analyzer" : "keyword"
}
}
}

In ES 1.3.2 the below query gave an error
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax",
"analyzer" : "keyword"
}
}
}
Error : query parsed in simplified form, with direct field name, but included more options than just the field name, possibly use its 'options' form, with 'query' element?]; }]
status: 400
I fixed the issue as below:
{
"query": {
"query_string": {
"fields": [
"Name"
],
"query": "jax",
"analyzer": "simple"
}
}
}

Related

Elasticsearch concatenate two words into one

I have a field ManufacturerName
"ManufacturerName": {
"type": "keyword",
"normalizer" : "keyword_lowercase"
},
And a normalizer
"normalizer": {
"keyword_lowercase": {
"type": "custom",
"filter": ["lowercase"]
}
}
When searching for 'ripcurl' it matches. However when searching for 'rip curl' it doesn't.
How/what would use to concatenate certain words. i.e. 'rip curl' -> 'ripcurl'
Apologies if this is a duplicate, I've spent some time seeking a solution to this.

You would want to make use of text field for what you are looking for and get this kind of requirement carried out via Ngram Tokenizer
Below is a sample mapping, query and response:
Mapping:
PUT mysomeindex
{
"mappings": {
"mydocs":{
"properties": {
"ManufacturerName":{
"type": "text",
"analyzer": "my_analyzer",
"fields":{
"keyword":{
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
},
"settings": {
"analysis": {
"normalizer": {
"my_normalizer":{
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [ "synonyms" ]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"synonyms":{
"type": "synonym",
"synonyms" : ["henry loyd, henry loid, henry lloyd => henri lloyd"]
}
}
}
}
}
Notice that the field ManufacturerName is a multi-field which has both text type and its sibling keyword type. That way for exact matches & for aggregation queries you could make use of keyword field while for this requirement you can make use of text field.
Sample Document:
POST mysomeindex/mydocs/1
{
"ManufacturerName": "ripcurl"
}
POST mysomeindex/mydocs/2
{
"ManufacturerName": "henri lloyd"
}
What elasticsearch does when you ingest the above document is, it creates tokens of size from 3 to 5 length and stored them in inverted index for e.g. `rip, ipc, pcu etc...
You can execute the below query to see what tokens gets created:
POST mysomeindex/_analyze
{
"text": "ripcurl",
"analyzer": "my_analyzer"
}
Also I'd suggest you to look into Edge Ngram tokenizer and see if that fits better for your requirement.
Query:
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "rip curl"
}
}
}
Response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.25316024,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "1",
"_score": 0.25316024,
"_source": {
"ManufacturerName": "ripcurl"
}
}
]
}
}
Query for Synonyms:
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "henri lloyd"
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.2784421,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "2",
"_score": 2.2784421,
"_source": {
"ManufacturerName": "henry lloyd"
}
}
]
}
}
Note: If you intend to make use of synonyms then the best way it to have them in the a text file and add that relative to the config folder location as mentioned here
Hope this helps!

elastic bool query must match mot getting considered

i am basically trying to write a query where it should return the document where
school is "holy international" AND grade is "second".
but the issue with the current query is that its not considering the must match query part. ie even though i don't i specify the school is the giving me this document where as it is not a match.
query is giving me all the documents where the grade is second.
i want only document where school is "holy international" AND grade is "second".
as well as i have not specified in the match query for "schools.school" but its giving me results.
mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase1": {
"tokenizer": "keyword",
"filter": ["lowercase", "my_pattern_replace1", "trim"]
},
"my_keyword_lowercase2": {
"tokenizer": "standard",
"filter": ["lowercase", "trim"]
}
},
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
}
},
"mappings": {
"test_data": {
"properties": {
"schools": {
"type": "nested",
"properties": {
"school": {
"type": "string",
"analyzer": "my_keyword_lowercase1"
},
"grade": {
"type": "string",
"analyzer": "my_keyword_lowercase2"
}
}
}
}
}
}
}
data
{
"_index": "data_index",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_version": 1,
"found": true,
"_source": {
"summary": null,
"schools": [{
"school": "little flower",
"grade": "first",
"date": "2007-06-01",
},
{
"school": "holy international",
"grade": "second",
"date": "2007-06-01",
},
],
"first_name": "Adam",
"location": "Kansas City",
"last_name": "Roger",
"country": "US",
"name": "Adam Roger",
}
}
query
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "" <-----X didnt specify anything
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second",
"operator": "and",
"minimum_should_match": "100%"
}
}
}
}
}
}
}
}
result
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "data_test",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_score": 0.2876821,
"_source": {
"first_name": "Adam"
},
"inner_hits": {
"schools": {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_nested": {
"field": "schools",
"offset": 0
},
"_score": 0.2876821,
"_source": {
"schools": {
"school": "holy international",
"grade": "second"
}
}
}
]
}
}
}
}
]
}
}

So, basically your problem is analysis step, when I load everything and checked, it become very clear:
This filter completely wipes all string from schools.school field
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
I think, that's happening because . is regexp literal, so, when I checked it:
POST /_analyze
{
"field": "schools.school",
"text": "holy international"
}
{
"tokens": [
{
"token": "",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
That's why you always get a match, every string you passed during indexing time and during search time becomes "". Some additional info from Elastic wiki - https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-pattern_replace-tokenfilter.html
After I removed patter replace filter, this query returns everything as expected:
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "holy international"
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second"
}
}
}
}
}
}
}
}

Smartcase searches/highlights with ElasticSearch

Context
I am trying to support smart-case search within our application which uses elasticsearch. The use case I want to support is to be able to partially match on any blob of text using smart-case semantics. I managed to configure my index in such a way that I am capable of simulating smart-case search. It uses ngrams of max length 8 to not overload storage requirements.
The way it works is that each document has both a generated case-sensitive and a case-insensitive field using copy_to with their own specific indexing strategy. When searching on a given input, I split the input in parts. This depends on the ngrams length, white spaces and double quote escaping. Each part is checked for capitalized letters. When a capital letter is found, it generates a match filter for that specific part using the case-sensitive field, otherwise it uses the case-insensitive field.
This has proven to work very nicely, however I am having difficulties with getting highlighting to work the way I would like. To better explain the issue, I added an overview of my test setup below.
Settings
curl -X DELETE localhost:9200/custom
curl -X PUT localhost:9200/custom -d '
{
"settings": {
"analysis": {
"filter": {
"default_min_length": {
"type": "length",
"min": 1
},
"squash_spaces": {
"type": "pattern_replace",
"pattern": "\\s{2,}",
"replacement": " "
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "8"
}
},
"analyzer": {
"index_raw": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "keyword"
},
"index_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim"],
"tokenizer": "keyword"
},
"index_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim"],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"date_detection": false,
"dynamic_templates": [
{
"case_insensitive": {
"match_mapping_type": "string",
"match": "case_insensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive"
}
}
},
{
"case_sensitive": {
"match_mapping_type": "string",
"match": "case_sensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive"
}
}
},
{
"text": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer": "index_raw",
"copy_to": ["case_insensitive","case_sensitive"],
"fields": {
"case_insensitive": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive",
"term_vector": "with_positions_offsets"
},
"case_sensitive": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive",
"term_vector": "with_positions_offsets"
}
}
}
}
}
]
}
}
}
'
Data
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis .is a! Test" }'
Query
The user searches for: tHis test which gets split into two parts as ngrams are maximum 8 in lengths: (1) tHis and (2) test. For (1) the case-sensitive field is used and (2) uses the case-insensitive field.
curl -X POST "http://localhost:9200/_search" -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*": {}
}
}
}
'
Response
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534896,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.057534896,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
}
]
}
}
Problem: highlighting
As you can see, the response shows that the smart-case search works very well. However, I also want to give feedback to the user using highlighting. My current setup uses "term_vector": "with_positions_offsets" to generate highlights. This indeed gives back correct highlights. However, the highlights are returned as both case-sensitive and case-insensitive independently.
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
This requires me to manually zip multiple highlights on the same field into one combined highlight before returning it to the user. This becomes very painful when highlights become more complicated and can overlap.
Question
Is there an alternative setup to actually get back the combined highlight. I.e. I would like to have this as part of my response.
"highlight": {
"text": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}

Attempt
Make use of highlight query to get merged result:
curl -XPOST 'http://localhost:9200_search' -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*.case_insensitive": {
"highlight_query": {
"bool": {
"must": [
{
"match": {
"*.case_insensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"*.case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
}
}
}
}
}
'
Response
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}
}
]
}
}
Warning
When ingesting the following, note the additional lower-case test keyword:
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis this .is a! Test" }'
The response to the same query becomes:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis this .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em><em> this</em> .is a!<em> Test</em>"
]
}
}
]
}
}
As you can see, the highlight now also includes the lower-case this. For such a test example, we do not mind. However, for complicated queries, the user might (and probably will) get confused when and how the smart-case has any effect. Especially when the lower-case match would include a field that only matches on lower-case.
Conclusion
This solution will give you all highlights merged as one, but might include unwanted results.

edge_ngram filter and not analzyed to match search

I have the following elastic search configuration:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"snow_filter" : {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"snow_filter",
"autocomplete_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "snowball"
},
"not": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "Brown foxes" }
{ "index": { "_id": 2 }}
{ "name": "Yellow furballs" }
{ "index": { "_id": 3 }}
{ "name": "my discovery" }
{ "index": { "_id": 4 }}
{ "name": "myself is fun" }
{ "index": { "_id": 5 }}
{ "name": ["foxy", "foo"] }
{ "index": { "_id": 6 }}
{ "name": ["foo bar", "baz"] }
I am trying to get a search to only return item 6 that has a name of "foo bar" and I am not quite sure how. This is what I am doing right now:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b"
}
}
}
}
I know it's a combination of how the tokenizer is splitting the word but sort of lost on how both be flexible and be strict enough to match this. I am guessing I need to do a multiple field on my mapping of name, but I am not sure. How can I fix the query and/or my mapping to satisfy my needs?

You're already close. Since your edge_ngram analyzer generates tokens of a minimum length of 1, and your query gets tokenized into "foo" and "b", and the default match query operator is "or", your query matches each document that has a term starting with "b" (or "foo"), three of the docs.
Using the "and" operator seems to do what you want:
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b",
"operator": "and"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4451914,
"hits": [
{
"_index": "test_index",
"_type": "my_type",
"_id": "6",
"_score": 1.4451914,
"_source": {
"name": [
"foo bar",
"baz"
]
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/4f6fb7c1fdc6942023091ee1433d7490e04e7dea

Querying elasticsearch returns all documents

i wonder why a search for a specific term returns all documents of an index and not the documents containing the requested term.
Here's the index and how i set it up:
(using the elasticsearch head-plugin browser-interface)
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1,
"analysis": {
"filter": {
"dutch_stemmer": {
"type": "dictionary_decompounder",
"word_list": [
"koud",
"plaat",
"staal",
"fabriek"
]
},
"snowball_nl": {
"type": "snowball",
"language": "dutch"
}
},
"analyzer": {
"dutch": {
"tokenizer": "standard",
"filter": [
"length",
"lowercase",
"asciifolding",
"dutch_stemmer",
"snowball_nl"
]
}
}
}
}
}
{
"properties": {
"test": {
"type": "string",
"fields": {
"dutch": {
"type": "string",
"analyzer": "dutch"
}
}
}
}
}
Then i added some docs:
{"test": "ijskoud"}
{"test": "plaatstaal"}
{"test": "kristalfabriek"}
So now when firing a search for "plaat" somehow one would expect the search would come back with the document containing "plaatstaal".
{
"match": {
"test": "plaat"
}
}
However saving me further searches elasticsearch retuns all documents regardless of its text content.
Is there anything I am missing here?
Funny enough: there is a difference when using GET or POST. While using the latter brings back no hits, GET returns all documents.
Any help is much appreciated.

When you are using GET you do not pass the request body, so search is performed without any filter and all documents are returned.
When you are using POST your search query does get passed on. It doesn't return anything probably because your document is not getting analyzed as you intended it to.

You need to configure your index to use your custom analyzer:
PUT /some_index
{
"settings": {
...
},
"mappings": {
"doc": {
"properties": {
"test": {
"type": "string",
"analyzer": "dutch"
}
}
}
}
}
If you have more fields that use this analyzer and don't want to specify for each the analyzer, you can do it like this for a specific type in that index:
"mappings": {
"doc": {
"analyzer": "dutch"
}
}
If you want ALL your types in that index to use your custom analyzer:
"mappings": {
"_default_": {
"analyzer": "dutch"
}
}
To test your analyzer in a simple way:
GET /some_index/_analyze?text=plaatstaal&analyzer=dutch
This would be the full list of steps to perform:
DELETE /some_index
PUT /some_index
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1,
"analysis": {
"filter": {
"dutch_stemmer": {
"type": "dictionary_decompounder",
"word_list": [
"koud",
"plaat",
"staal",
"fabriek"
]
},
"snowball_nl": {
"type": "snowball",
"language": "dutch"
}
},
"analyzer": {
"dutch": {
"tokenizer": "standard",
"filter": [
"length",
"lowercase",
"asciifolding",
"dutch_stemmer",
"snowball_nl"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"test": {
"type": "string",
"analyzer": "dutch"
}
}
}
}
}
POST /some_index/doc/_bulk
{"index":{}}
{"test": "ijskoud"}
{"index":{}}
{"test": "plaatstaal"}
{"index":{}}
{"test": "kristalfabriek"}
GET /some_index/doc/_search
{
"query": {
"match": {
"test": "plaat"
}
}
}
And the result of the search:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.987628,
"hits": [
{
"_index": "some_index",
"_type": "doc",
"_id": "jlGkoJWoQfiVGiuT_TUCpg",
"_score": 1.987628,
"_source": {
"test": "plaatstaal"
}
}
]
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ElasticSearch edgeNGram - elasticsearch

Related

Elasticsearch concatenate two words into one

elastic bool query must match mot getting considered

Smartcase searches/highlights with ElasticSearch

edge_ngram filter and not analzyed to match search

Querying elasticsearch returns all documents

Categories

Resources