How do I configure elastic search to use the icu_tokenizer? - elasticsearch

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
{
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
}
}
}
}
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
{
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
}'
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
{
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
}
}
}'
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?

The missing part is:
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"icu_normalizer"
],
"tokenizer" : "icu_tokenizer"
}
}
}
},
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
}
test the custom analyzer using followings DSLJson
POST /icu/_analyze
{
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
}
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

Related

How to use nested in Ealsticsearch 7.10

Followings are the steps on how using nested field in elastersearch.
First step:
curl -XPUT 'localhost:9200/my_index/my_type/1?pretty' -d'
{
"group" : "fans",
"user" : [ // 1
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}'
Second step:
curl -XPUT 'localhost:9200/my_index?pretty' -d'
{
"mappings": {
"my_type": {
"properties": {
"user": {
"type": "nested" // 1
}
}
}
}
}'
Before i copy the code, i have delete all the index on my machine.
However, after running step 2, something went woring like the following .
{
"error" : {
"root_cause" : [
{
"type" : "resource_already_exists_exception",
"reason" : "index [my_index/yHhgr8iEQqGnHo5Ugex2dA] already exists",
"index_uuid" : "yHhgr8iEQqGnHo5Ugex2dA",
"index" : "my_index"
}
],
"type" : "resource_already_exists_exception",
"reason" : "index [my_index/yHhgr8iEQqGnHo5Ugex2dA] already exists",
"index_uuid" : "yHhgr8iEQqGnHo5Ugex2dA",
"index" : "my_index"
},
"status" : 400
}
I really don't konw what to do about this.(I have also tried create nested field first. It also went wrong)
I'm new to elastersearch, really need help. Thankyou very mutch!!!
Since you are using Elasticsearch version 7.10, you cannot add the mapping type in the index mapping definition. Refer to this to know more about the removal of mapping types.
You can not change the mapping of an index that already exists, you need to delete it and index the data with the new mapping, or reindex into a new index with the new mapping.
You need to first create the index with the following mapping:
PUT /my_index
{
"mappings": {
"properties": {
"user": {
"type": "nested" // 1
}
}
}
}
And then index the documents into the index. Refer to this official documentation, to know more about nested type.
PUT /my_index/_doc/1
{
"group" : "fans",
"user" : [ // 1
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}

Elasticsearch does not found an existing document using a the DSL

I dont know why, using the URI Search way to search a document is returning the right document, but the document is not found if I use the API DSL.
To reproduce the issue:
Without any index created, I insert this document:
curl http://localhost:9299/integrationtest-index/searchable/ID_XXXX2 -d '{ "ref" : "XXXX2", "field1" : "value1" }'
So the index is created automatically with the default mapping (type searchable):
curl http://localhost:9299/integrationtest-index?pretty
{
"integrationtest-index" : {
"aliases" : { },
"mappings" : {
"searchable" : {
"properties" : {
"field1" : {
"type" : "string"
},
"ref" : {
"type" : "string"
}
}
}
},
"settings" : {
"index" : {
"field1" : "value1",
"ref" : "XXXX2",
"number_of_shards" : "5",
"creation_date" : "1466780216631",
"number_of_replicas" : "1",
"uuid" : "GBj2VF-wQy6JP74AqoIn5g",
"version" : {
"created" : "2020099"
}
}
},
"warmers" : { }
}
}
This query return one document:
curl http://localhost:9299/integrationtest-index/searchable/_search?q=ref:XXXX2
But this other query response that does not exist:
curl -XPOST http://localhost:9299/integrationtest-index/searchable/_search/exists -d '
{
"query": {
"term" : {
"ref" : "XXXX2"
}
}
}'
Why the last query said that the document does not exist?
Environment:
ElasticSearch 2.2.0
Ubuntu 16.04 LTS
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14)
I have the same problem every few months, so I decided response myself and share my stupids errors.
By default, elasticsearch use index:analyzed, so the query with term does not found any document.
If you use the URI Search way, elasticsearch is executing a query_string and not a term query.
This query is working:
curl -XPOST http://localhost:9299/integrationtest-index/searchable/_search/exists -d '
{
"query": {
"match" : {
"ref" : "XXXX2"
}
}
}'
More information in the documentation, in the section Why doesn’t the term query match my document?

Elasticsearch How do I get a metadata using Image Plugin

I defined matadata by the mapping of the Elasticsearch image Plugin.
Mapping:
"photo" : {
"mappings" : {
"scenery" : {
"properties" : {
"my_img" : {
"type" : "image",
"feature" : {"FCTH" : { }, ... },
"metadata" : {
"jpeg.image_height" : {"type" : "string","store" : true},
"jpeg.image_width" : {"type" : "string","store" : true}
}
}
}
}
}
}
After an index, although searched, metadata does not return.
How do I get a metadata?
I tried:
curl -XPOST 'localhost:9200/photo/scenery/_search' -d '{
"query":{
"image":{
"my_img":{
"feature":"CEDD",
"index":"photo",
"type":"scenery",
"id":"0",
"path":"my_img",
"hash":"BIT_SAMPLING"
}
}
}
}'
Result:
{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":5,"max_score":1.0,"hits":[{"_index":"photo","_type":"scenery","_id":"0","_score":1.0, "_source" : {"file_name": "376423.jpg", "my_img": "/9j/4AAQSkZJRgABAQ...
Perhaps, the original data (base64 encoded image) will be returned _source field. You can use that instead, the fields option.
Try this query.
curl -XPOST 'localhost:9200/photo/scenery/_search' -d '{
"query":{
...
},
"fields": ["my_img.metadata.jpeg.image_height","my_img.metadata.jpeg.image_width" ]
}'

Elasticsearch search fo words having '#' character

For example, I am right now searching like this:
http://localhost:9200/posts/post/_search?q=content:%23sachin
But, I am getting all the results with 'sachin' and not '#sachin'. Also, I am writing a regular expression for getting the count of terms. The facet looks like this:
"facets": {
"content": {
"terms": {
"field": "content",
"size": 1000,
"all_terms": false,
"regex": "#sachin",
"regex_flags": [
"DOTALL",
"CASE_INSENSITIVE"
]
}
}
}
This is not returning any values. I think it has something to do with escaping the '#' inside the regular expression, but I am not sure how to do it. I have tried to escape it \ and \\, but it did not work. Can anyone help me in this regard?
This article gives information on how save # and # using custom analyzers:
https://web.archive.org/web/20160304014858/http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
curl -XPUT 'http://localhost:9200/twitter' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tweet_filter" : {
"type" : "word_delimiter",
"type_table": ["# => ALPHA", "# => ALPHA"]
}
},
"analyzer" : {
"tweet_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "tweet_filter"]
}
}
}
},
"mappings" : {
"tweet" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "tweet_analyzer"
}
}
}
}
}'
This isn't dealing with facets, but the redefining of the type of those special characters in the analyzer could help.
Another approach that worth to consider is to index a special (e.g. "reserved") word instead of hash symbol. For example: HASHSYMBOLCHAR. Make sure that you will replace '#' chars in query as well.

Index fields with hyphens in Elasticsearch

I'm trying to work out how to configure elasticsearch so that I can make query string searches with wildcards on fields that include hyphens.
I have documents that look like this:
{
"tags":[
"deck-clothing-blue",
"crew-clothing",
"medium"
],
"name":"Crew t-shirt navy large",
"description":"This is a t-shirt",
"images":[
{
"id":"ba4a024c96aa6846f289486dfd0223b1",
"type":"Image"
},
{
"id":"ba4a024c96aa6846f289486dfd022503",
"type":"Image"
}
],
"type":"InventoryType",
"header":{
}
}
I have tried to use a word_delimiter filter and a whitespace tokenizer:
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tags_filter" : {
"type" : "word_delimiter",
"type_table": ["- => ALPHA"]
}
},
"analyzer" : {
"tags_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["tags_filter"]
}
}
}
},
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string",
"analyzer" : "tags_analyzer"
}
}
}
}
}
But these are the searches (for tags) and their results:
deck* -> match
deck-* -> no match
deck-clo* -> no match
Can anyone see where I'm going wrong?
Thanks :)
The analyzer is fine (though I'd lose the filter), but your search analyzer isn't specified so it is using the standard analyzer to search the tags field which strips out the hyphen then tries to query against it (run curl "localhost:9200/_analyze?analyzer=standard" -d "deck-*" to see what I mean)
basically, "deck-*" is being searched for as "deck *" there is no word that has just "deck" in it so it fails.
"deck-clo*" is being searched for as "deck clo*", again there is no word that is just "deck" or starts with "clo" so the query fails.
I'd make the following modifications
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "whitespace",
"filter" : ["lowercase"] <--- you don't need this, just thought it was a nice touch
}
}
}
then get rid of the special analyzer on the tags
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string"
}
}
}
}
let me know how it goes.

Resources