Elasticsearch support for traditional chinese - elasticsearch

I am trying to index and search Chinese into Elasticsearch. By using Smart Chinese Analysis (elasticsearch-analysis-smartcn) plugin I have managed to search characters and words for both simplified and traditional chinese. I have tried to insert the same text in both simplified and traditional chinese, but the search returns only one result (depending on how the search is performed); since the text is the same I would expect both results to be returned. I have read here that in order to support traditional chinese I must also install the STConvert Analysis (elasticsearch-analysis-stconvert) plugin. Can anyone provide a working example that uses these two plugins? (or an alternative method that achieves the same result)
The test index is created as
{
"settings":{
"analysis":{
"analyzer":{
"chinese":{
"type":"smartcn"
}
}
}
},
"mappings":{
"testType":{
"properties":{
"message":{
"store":"yes",
"type":"string",
"index":"analyzed",
"analyzer":"chinese"
},
"documentText": {
"store":"compress",
"type":"string",
"index":"analyzed",
"analyzer":"chinese",
"termVector":"with_positions_offsets"
}
}
}
}
}
and the two requests with the same text in simplified-traditional are
{
"message": "汉字",
"documentText": "制造器官的噴墨打印機 這是一種制造人體器官的裝置。這種裝置是利用打印機噴射生物 細胞、 生長激素、凝膠體,形成三維的生物活體組織。凝膠體主要是為細胞提供生長的平台,之后逐步形成所想要的器官或組織。這項技術可以人工方式制造心臟、肝臟、腎臟。這項研究已經取得了一定進展,目前正在研究如何將供應營養的血管印出來。這個創意目前已經得到了佳能等大公司的贊助"
}
{
"message": "汉字",
"documentText": "制造器官的喷墨打印机 这是一种制造人体器官的装置。这种装置是利用打印机喷射生物 细胞、 生长激素、凝胶体,形成叁维的生物活体组织。凝胶体主要是为细胞提供生长的平台,之后逐步形成所想要的器官或组织。这项技术可以人工方式制造心脏、肝脏、肾脏。这项研究已经取得了一定进展,目前正在研究如何将供应营养的血管印出来。这个创意目前已经得到了佳能等大公司的赞助"
}
Finally, a sample search that I want to return two results is
{
"query":{
"query_string":{
"query":"documentText : 制造器官的喷墨打印机",
"default_operator":"AND"
}
}
}

After many attempts I found a configuration that works. I did not manage to make smartcn work with stconvert plugin, so I used the cjk analyzer of elasticsearch, with an addition of icu_tokenizer instead. By using t2s and s2t as filters, each character is stored in both forms, traditional and simplified.
{
"settings":{
"analysis":{
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"t2s_convert": {
"type": "stconvert",
"delimiter": ",",
"convert_type": "t2s"
},
"s2t_convert": {
"type": "stconvert",
"delimiter": ",",
"convert_type": "s2t"
}
},
"analyzer": {
"my_cjk": {
"tokenizer": "icu_tokenizer",
"filter": [
"cjk_width",
"lowercase",
"cjk_bigram",
"english_stop",
"t2s_convert",
"s2t_convert"
]
}
}
}
},
"mappings":{
"testType":{
"properties":{
"message":{
"store":"yes",
"type":"string",
"index":"analyzed",
"analyzer":"my_cjk"
},
"documentText": {
"store":"compress",
"type":"string",
"index":"analyzed",
"analyzer":"my_cjk",
"termVector":"with_positions_offsets"
}
}
}
}
}

Related

Elasticsearch Text with Path Hierarchy vs KeyWord using Prefix query performance

I'm trying to achieve the best way to filter results based on folder hierarchies. We will use this to simulate a situation where we want to get all assets/documents in provided folder and all subfolders (recursive search).
So for example for such a structure
/someFolder/someSubfolder/1
/someFolder/someSubfolder/1/subFolder
/someFolder/someSubfolder/2
/someFolder/someSubfolder/2/subFolder
If we search for /someFolder/someSubfolder/1
We want to get as results
/someFolder/someSubfolder/1
/someFolder/someSubfolder/1/subFolder
Now I've found two ways to do this. Not sure which one would be better from performance perspective.
Use Text property with path_hierarchy Tokenizer
Use Keyword property and use Query prefix to get results
Both of the above seem to work as I want them to (unless I missed something). Not sure which one would be better. On one hand I've read that filtering should be done on Keywords. On the other hand path_hierarchy Tokenizer seems to be created exactly for these scenarios but we can only use it with Text field.
Below I prepared a sample code.
Create index and push some test data into it.
PUT test-index-2
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"properties": {
"folderPath": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST test-index-2/_doc/
{
"folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces"
}
POST test-index-2/_doc/
{
"folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces/SomeTestValue/11"
}
Now both of below queries will return two results for matching partial path hierarchy.
1.
GET test-index-2/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces" }}
]
}
}
}
GET test-index-2/_search
{
"query": {
"prefix" : { "folderPath.keyword": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces" }
}
}
Now the question would be: Which solution is better if we want to get a subset of results ?

Cannot retrieve data which includes specific symbols in Kibana

I try to use Kibana to retrive the comment data which includes some specific symbols like ?and 。 They are not general symbols.
I try to use escape character \ for them, the KQL is like comment:\?or comment:\\?, but it doesn't work, can anyone help?
When you create a sample doc and let ES auto-generate the mapping for you,
POST comments/_doc
{
"comment": "?"
}
running
GET comments/_mapping
will get you
"comment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
Now, the text type's analyzer is usually standard by default.
When we attempt to see how our non-standard chars got analyzed
GET comments/_analyze
{
"text": "?",
"analyzer": "standard"
}
the result is
{
"tokens" : [ ]
}
meaning we cannot search for its contents using the standard-analyzed text field but need to
either define a different default analyzer
or define this analyzer in one of the comment's fields
Going with the 2nd approach (since it's good practice to keep differently-analyzed fields separate),
PUT comments2
{
"mappings": {
"properties": {
"comment": {
"type": "text",
"fields": {
"whitespace_analyzed": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
}
POST comments2/_doc
{
"comment": "?"
}
After verifying
GET comments2/_analyze
{
"text": "?",
"analyzer": "whitespace"
}
we can do the following in KQL
comment.whitespace_analyzed:"?"
Note that there are a bunch of built-in analyzers to choose from but you're more than welcome to create your own.

Exact Sub-String Match | ElasticSearch

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):
SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';
Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:
Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:
Note:
The max length of customer_id field is VARCHAR2(100).
{
"customer-index":{
"aliases":{
},
"mappings":{
"customer":{
"properties":{
"customerName":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"customerId":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
},
"analyzer":"substring_analyzer"
}
}
}
},
"settings":{
"index":{
"number_of_shards":"3",
"provided_name":"customer-index",
"creation_date":"1573333835055",
"analysis":{
"filter":{
"substring":{
"type":"ngram",
"min_gram":"3",
"max_gram":"100"
}
},
"analyzer":{
"substring_analyzer":{
"filter":[
"lowercase",
"substring"
],
"type":"custom",
"tokenizer":"standard"
}
}
},
"number_of_replicas":"1",
"uuid":"XXXXXXXXXXXXXXXXX",
"version":{
"created":"5061699"
}
}
}
}
}
Request to query the data looks like this:
{
"from": 0,
"size": 10,
"sort": [
{
"name.keyword": {
"missing": "_first",
"order": "asc"
}
}
],
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "0995",
"fields": [
"customer_id"
],
"analyzer": "substring_analyzer"
}
}
]
}
}
}
With that being said, here are couple of queries/issue:
Lets say there are 3 records with customer_id:
0009950011214,
0009900011214,
0009920011214
When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.
But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.
How should I fix this?
This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?
P.S: I'm on NEST 5.5.
In your customerID field you can pass a "search_analyzer": "standard". Then in your search query remove the line "analyzer": "substring_analyzer".
This will ensure that the searched customerID is not tokenized into nGrams and is searched as is, while the customerIDs are indexed as nGrams.
I believe that's the functionality that you were trying to replicate from your SQL query.
From the mapping I can see that the field customerId is a text/keyword field.( Difference between keyword and text in ElasticSearch )
So you can use a regex filter as shown below to make searches like the sql query you have given as example, Try this-
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"customerId": {
"value": ".*0995.*",
"flags": "ALL"
}
}
}
]
}
}
}
}
}
notice the "." in the value of the regex expression.
..* is same as contains search
~(..) is same as not contains
You can also append ".*" at the starting or the end of the search term to do searches like Ends-with and Starts-with type of searches. Reference -https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-regexp-query.html

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Elasticsearch completion - generating input list with analyzers

I've had a look at this article: https://www.elastic.co/blog/you-complete-me
However, it requires writing some logic in the client to create multiple "input". Is there a way to define an analyzer (maybe using shingle or ngram/edge-ngram) that will generate the multiple terms for input?
Here's what I tried (and it obviously doesn't work):
DELETE /products/
PUT /products/
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"product": {
"properties": {
"name": {"type": "string"
,"copy_to": ["name_suggest"]
}
,"name_suggest": {
"type": "completion",
"payloads": false,
"analyzer": "autocomplete"
}
}
}
}
}
PUT /products/product/1
{
"name": "Apple iPhone 5"
}
PUT /products/product/2
{
"name": "iPhone 4 16GB"
}
PUT /products/product/3
{
"name": "iPhone 3 GS 16GB black"
}
PUT /products/product/4
{
"name": "Apple iPhone 4 S 16 GB white"
}
PUT /products/product/5
{
"name": "Apple iPhone case"
}
POST /products/_suggest
{
"suggestions": {
"text":"i"
,"completion":{
"field": "name_suggest"
}
}
}
Don't think there's a direct way to achieve this.
I'm not sure why it would be needed to store ngrammed tokens considering elasticsearch already stores the 'input' text as an FST structure. New releases also allow for fuzziness in the suggest query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html#fuzzy
I can understand the need for something like a shingle analyser to generate the inputs for you, but there doesn't seem to be a way yet. Having said that, the _analyze endpoint can be used to generate tokens from the analyzer of your choice and those tokens can be passed to the 'input' field (with or without any other added logic). This way you won't have to replicate your analyzer logic in your application code. That's the only way i can think of to achieve the desired input field.
Hope it helps.

Resources