wildcard and term search returning different results based on case - full-text-search

I am using OpenSearch version 1.3.1 via the Docker image.
Here is my index and a document:
PUT index_0
{
"settings":{
"analysis":{
"analyzer":{
"keyword_lower":{
"type":"custom",
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
},
"mappings":{
"properties":{
"id":{
"type":"text",
"index":true
},
"name":{
"type":"text",
"index":true,
"analyzer":"keyword_lower"
}
}
}
}
PUT index_0/_doc/1
{
"id":"123",
"name":"FooBar"
}
If I run this query, I get results (notice the difference in case, lowercase b):
GET index_0/_search?pretty
{"query":{"wildcard":{"name":"Foobar"}}}
But if I run this query, I do not:
GET index_0/_search?pretty
{"query":{"term":{"name":"Foobar"}}}
Why does a term search seem to be case sensitive whereas a wildcard one is not, given the same field?

Related

Elastic Search - How to update mapping field from keyword to text

{
"properties":{
"device":{
"type":"object",
"properties":{
"id":{
"type":"keyword"
},
"value":{
"type":"keyword"
}
}
}
}
}
I wanted to update mapping value as text, when I'm trying to update using https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html throws an error.
"mapper [device.value] of different type, current_type [keyword], merged_type [text]"
{
"properties":{
"device":{
"type":"object",
"properties":{
"id":{
"type":"keyword"
},
"value":{
"type":"text"
}
}
}
}
}
Someone help me update index from keyword to text?
Changing field type is a breaking change, you need to
Create a new index with new required mapping.
use reindex API to move data from old to new index(optional if you are OK with data loss)

Rank Elasticsearch results by the shortest hit

I am building a ngram search example with ES. Is is possible to take account the shortest length of all the hits?
Here's an example:
Documents:
{"aliases": ["ElonMuskTesla", "MuskTesla"]}
{"aliases": ["ElonMusk"]}
Default Result:
When searching for "Musk" against the field "aliases", the first document will have the highest score, because it has two hits matching "Musk".
What I want:
But I want the second document to appear on the top, because in my case, it's more relavant to the serach term (shortest means most similar).
I guess this might be achieved by the script score query, but don't know exactly how after browsing a bunch of seemingly related questions.
[Appendix] Mapping & Settings:
{
"settings":{
"analysis":{
"tokenizer":{
"ngram":{
"type":"ngram",
"min_gram":2,
"max_gram":40
}
},
"analyzer":{
"ngram_analyzer":{
"tokenizer":"ngram",
"filter":[
"lowercase"
]
},
"lower_analyzer":{
"tokenizer":"keyword",
"filter":[
"lowercase"
]
}
}
}
},
"mappings":{
"properties":{
"aliases":{
"type":"text",
"analyzer":"ngram_analyzer",
"term_vector":"with_positions_offsets",
"search_analyzer":"lower_analyzer"
}
}
}
}

Exact Sub-String Match | ElasticSearch

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):
SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';
Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:
Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:
Note:
The max length of customer_id field is VARCHAR2(100).
{
"customer-index":{
"aliases":{
},
"mappings":{
"customer":{
"properties":{
"customerName":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"customerId":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
},
"analyzer":"substring_analyzer"
}
}
}
},
"settings":{
"index":{
"number_of_shards":"3",
"provided_name":"customer-index",
"creation_date":"1573333835055",
"analysis":{
"filter":{
"substring":{
"type":"ngram",
"min_gram":"3",
"max_gram":"100"
}
},
"analyzer":{
"substring_analyzer":{
"filter":[
"lowercase",
"substring"
],
"type":"custom",
"tokenizer":"standard"
}
}
},
"number_of_replicas":"1",
"uuid":"XXXXXXXXXXXXXXXXX",
"version":{
"created":"5061699"
}
}
}
}
}
Request to query the data looks like this:
{
"from": 0,
"size": 10,
"sort": [
{
"name.keyword": {
"missing": "_first",
"order": "asc"
}
}
],
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "0995",
"fields": [
"customer_id"
],
"analyzer": "substring_analyzer"
}
}
]
}
}
}
With that being said, here are couple of queries/issue:
Lets say there are 3 records with customer_id:
0009950011214,
0009900011214,
0009920011214
When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.
But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.
How should I fix this?
This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?
P.S: I'm on NEST 5.5.
In your customerID field you can pass a "search_analyzer": "standard". Then in your search query remove the line "analyzer": "substring_analyzer".
This will ensure that the searched customerID is not tokenized into nGrams and is searched as is, while the customerIDs are indexed as nGrams.
I believe that's the functionality that you were trying to replicate from your SQL query.
From the mapping I can see that the field customerId is a text/keyword field.( Difference between keyword and text in ElasticSearch )
So you can use a regex filter as shown below to make searches like the sql query you have given as example, Try this-
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"customerId": {
"value": ".*0995.*",
"flags": "ALL"
}
}
}
]
}
}
}
}
}
notice the "." in the value of the regex expression.
..* is same as contains search
~(..) is same as not contains
You can also append ".*" at the starting or the end of the search term to do searches like Ends-with and Starts-with type of searches. Reference -https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-regexp-query.html

Simplest lowercase example for es keyword type

I would like to ignore case on searches. What would be the least verbose way to do this. For example, something like:
"mappings": {
"_doc": {
"properties": {
"name": {"type": "keyword", "analyzer": "ignore_case"}
}
}
}
The above is pseudo-code but what would be the best way to do this? Basically I want to have a word like:
"Hello"
And have "Hello" or "HELLO" or "hELlo" or "hello" match it.
Keyword datatype doesn't use Analyzers. You need to make use of Normalizer
If you intend to make use of keyword in that case you need to create a custom Normalizer with filter configured as lowercase and your mapping should be as follows:
PUT <your_index_name>
{
"settings":{
"analysis":{
"normalizer":{
"my_custom_normalizer":{
"type":"custom",
"filter":[
"lowercase"
]
}
}
}
},
"mappings":{
"mydocs":{
"properties":{
"mytext":{
"type":"keyword",
"normalizer":"my_custom_normalizer"
}
}
}
}
}
I think that is the least verbose way! Hope it helps!

Elasticsearch support for traditional chinese

I am trying to index and search Chinese into Elasticsearch. By using Smart Chinese Analysis (elasticsearch-analysis-smartcn) plugin I have managed to search characters and words for both simplified and traditional chinese. I have tried to insert the same text in both simplified and traditional chinese, but the search returns only one result (depending on how the search is performed); since the text is the same I would expect both results to be returned. I have read here that in order to support traditional chinese I must also install the STConvert Analysis (elasticsearch-analysis-stconvert) plugin. Can anyone provide a working example that uses these two plugins? (or an alternative method that achieves the same result)
The test index is created as
{
"settings":{
"analysis":{
"analyzer":{
"chinese":{
"type":"smartcn"
}
}
}
},
"mappings":{
"testType":{
"properties":{
"message":{
"store":"yes",
"type":"string",
"index":"analyzed",
"analyzer":"chinese"
},
"documentText": {
"store":"compress",
"type":"string",
"index":"analyzed",
"analyzer":"chinese",
"termVector":"with_positions_offsets"
}
}
}
}
}
and the two requests with the same text in simplified-traditional are
{
"message": "汉字",
"documentText": "制造器官的噴墨打印機 這是一種制造人體器官的裝置。這種裝置是利用打印機噴射生物 細胞、 生長激素、凝膠體,形成三維的生物活體組織。凝膠體主要是為細胞提供生長的平台,之后逐步形成所想要的器官或組織。這項技術可以人工方式制造心臟、肝臟、腎臟。這項研究已經取得了一定進展,目前正在研究如何將供應營養的血管印出來。這個創意目前已經得到了佳能等大公司的贊助"
}
{
"message": "汉字",
"documentText": "制造器官的喷墨打印机 这是一种制造人体器官的装置。这种装置是利用打印机喷射生物 细胞、 生长激素、凝胶体,形成叁维的生物活体组织。凝胶体主要是为细胞提供生长的平台,之后逐步形成所想要的器官或组织。这项技术可以人工方式制造心脏、肝脏、肾脏。这项研究已经取得了一定进展,目前正在研究如何将供应营养的血管印出来。这个创意目前已经得到了佳能等大公司的赞助"
}
Finally, a sample search that I want to return two results is
{
"query":{
"query_string":{
"query":"documentText : 制造器官的喷墨打印机",
"default_operator":"AND"
}
}
}
After many attempts I found a configuration that works. I did not manage to make smartcn work with stconvert plugin, so I used the cjk analyzer of elasticsearch, with an addition of icu_tokenizer instead. By using t2s and s2t as filters, each character is stored in both forms, traditional and simplified.
{
"settings":{
"analysis":{
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"t2s_convert": {
"type": "stconvert",
"delimiter": ",",
"convert_type": "t2s"
},
"s2t_convert": {
"type": "stconvert",
"delimiter": ",",
"convert_type": "s2t"
}
},
"analyzer": {
"my_cjk": {
"tokenizer": "icu_tokenizer",
"filter": [
"cjk_width",
"lowercase",
"cjk_bigram",
"english_stop",
"t2s_convert",
"s2t_convert"
]
}
}
}
},
"mappings":{
"testType":{
"properties":{
"message":{
"store":"yes",
"type":"string",
"index":"analyzed",
"analyzer":"my_cjk"
},
"documentText": {
"store":"compress",
"type":"string",
"index":"analyzed",
"analyzer":"my_cjk",
"termVector":"with_positions_offsets"
}
}
}
}
}

Resources