Create Custom Analyzer after index has been created - elasticsearch

I am trying to add a custom analyzer.
curl -XPUT 'http://localhost:9200/my_index' -d '{
"settings" : {
"analysis" : {
"filter" : {
"my_filter" : {
"type" : "word_delimiter",
"type_table": [": => ALPHA", "/ => ALPHA"]
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "my_filter"]
}
}
}
}
}'
It works on my local environment when I can recreate the index every time I want, the problem comes when I try to do the same on other environments like qa or prod, where the index has already been created.
{
"error": "IndexAlreadyExistsException[[my_index] already exists]",
"status": 400
}
How can I add my custom analyzer through the HTTP API?

In the documentation I found that to update index settings I can do this:
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{
"index" : {
"number_of_replicas" : 4
}
}'
And to update analyzer settings the documentation says:
"...it is required to close the index first and open it after the changes are made."
So I ended up doing this:
curl -XPOST 'http://localhost:9200/my_index/_close'
curl -XPUT 'http://localhost:9200/my_index' -d '{
"settings" : {
"analysis" : {
"filter" : {
"my_filter" : {
"type" : "word_delimiter",
"type_table": [": => ALPHA", "/ => ALPHA"]
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "my_filter"]
}
}
}
}
}'
curl -XPOST 'http://localhost:9200/my_index/_open'
Which fixed everything for me.

For folks using AWS Elastic-search service, closing and opening is not allowed, They need to follow re-indexing as mentioned here.
Basically create a temp index with all mappings of current original index and add/modify those mappings and settings(where analyzers sit), delete original index and create a new index with that name and copy back all mappings and settings from temp index.

Related

How do I configure elastic search to use the icu_tokenizer?

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
{
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
}
}
}
}
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
{
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
}'
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
{
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
}
}
}'
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?
The missing part is:
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"icu_normalizer"
],
"tokenizer" : "icu_tokenizer"
}
}
}
},
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
}
test the custom analyzer using followings DSLJson
POST /icu/_analyze
{
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
}
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

After using the Elasticsearch JDBC Importer, 'asciifolding' is not working as expected

Using the Elasticsearch JDBC importer with this configuration:
bin=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/bin
lib=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/lib
echo '{
"type" : "jdbc",
"jdbc" : {
"url" : "ip/db",
"user" : "myuser",
"password" : "a7sdf7hsdf8hn78df",
"sql" : "SELECT title, body, source_id, time_order, type, blablabla...",
"index" : "importeditems",
"type" : "item",
"elasticsearch.host": "_eth0_",
"detect_json" : false
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter
I've indexed some documents correctly with the form:
{
"title":"Tiempo de Opinión: Puede comenzar un ciclo",
"body":"Sebas Álvaro nos trae cada lunes historias y anécdotas de la montaña<!-- com -->",
"source_id":21188,
"time_order":"1438638043:55c2c6bb96d4c"
"type":"rss"
}
I'm trying to ignore the accents (for example, opiniónin title has an ó), so if a user searches "tiempo de opinión" or "tiempo de opinion" with a match_phrase it gives a match with the documents with or without accent.
So after using the importer and indexing everything, I changed my index settings to defaultanalyzer with an asciifolding filter.
curl -XPOST 'localhost:9200/importeditems/_close'
curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
"analysis": {
"analyzer": {
"default": {
"tokenizer" : "standard",
"filter": [ "lowercase", "asciifolding"]
}}}}'
curl -XPOST 'localhost:9200/importeditems/_open'
Then I make a match_phrase to match"tiempo de opinion" (no accent) and "tiempo de opinión" (with accent)
# No accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
{
"query": {
"match_phrase" : {
"title" : "tiempo de opinion"
}}}'
# With accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
{
"query": {
"match_phrase" : {
"title" : "tiempo de opinión"
}}}'
But no match is given when they exist (if I match_phrase the phrase tiempo de it returns some hits containing tiempo de opinión).
I think the problem is due to de JDBC Importer because I reproduced the error without using the importer, adding another index and entries by hand, changing the index settings also to asciifolding and everything works as expected. You can see this working example right here.
If I check the settings of the index created after using the importer (importeditems)
curl -XGET 'localhost:9200/importeditems/_settings?pretty=true'
This outputs:
{
"importeditems" : {
"settings" : {
"index" : {
"creation_date" : "1457533907278",
"analysis" : {
"analyzer" : {
"default" : {
"filter" : [ "lowercase", "asciifolding" ],
"tokenizer" : "standard"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "x",
"version" : {
"created" : "2010199"
}}}}
... and if I check the settings of the manually created index (test):
curl -XGET 'localhost:9200/test/_settings?pretty=true'
I get the same exact output:
{
"test" : {
"settings" : {
"index" : {
"creation_date" : "1457603253278",
"analysis" : {
"analyzer" : {
"default" : {
"filter" : [ "lowercase", "asciifolding" ],
"tokenizer" : "standard"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "x",
"version" : {
"created" : "2010199"
}}}}
Can someone please tell why is not working if I use the Elasticsearch JDBC Importer and why is it working if I add raw data?
I finally solved the issue by first changing the settings by adding the analysis module:
curl -XPOST 'localhost:9200/importeditems/_close'
curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
"analysis": {
"analyzer": {
"default": {
"tokenizer" : "standard",
"filter": [ "lowercase", "asciifolding"]
}}}}'
curl -XPOST 'localhost:9200/importeditems/_open'
... and then importing all the data again.
It's extrange, because as I stated on the post, I did exactly the same in both cases (with the JDBC Importer and the raw data):
Index data
Change index settings
Make the query with match_phrase
And it worked with the raw data (test) and not with the one I used the importer with (importeditems). The only thing I can think about is that the importeditems were more than 12GB and it needs time to re-create the content with the asciifolding on it. That's why the changes were not reflecting just after the asciifolding was activated.
Anyways, if someone is having the same issue and specially for those who are working with a huge amount of data, remember first to set the analyzer, and then index all the data.
According to the docs:
Queries can find only terms that actually exist in the inverted index,
so it is important to ensure that the same analysis process is applied
both to the document at index time, and to the query string at search
time so that the terms in the query match the terms in the inverted
index.

set default analyzer of index

First I wanted to set default analyzer of ES, and failed. And then according to other questions and websites, I'm trying to set default analyzer of one index.But there are some problems too.
I have configured ik analyzer, and I can set some fields' analyzer, here is my command:
curl -XPUT localhost:9200/test
curl -XPUT localhost:9200/test/test/_mapping -d'{
"test":{
"properties":{
"name":{
"type":"string",
"analyzer":"ik"
}
}
}
}'
and get the message:
{"acknowledged":true}
also, it works as my wish.
but, if I try to set default analyzer of index:
curl -XPOST localhost:9200/test1?pretty -d '{ "index":{
"analysis" : {
"analyzer" : {
"default" : {
"type" : "ik"
}
}
}
}
}'
I will get error message:
{
"error" : {
"root_cause" : [ {
"type" : "index_creation_exception",
"reason" : "failed to create index"
} ],
"type" : "illegal_argument_exception",
"reason" : "no default analyzer configured"
},
"status" : 400
}
So strange,isn't it?
Looking forward to your opinions about this problem. Thanks! :)
You're almost there, you're simply missing /_settings in your path. Do it like this instead. Also note that you need to close the index first and then reopen it after updating analyzers.
// close index
curl -XPOST 'localhost:9200/test1/_close'
add this to the path
|
v
curl -XPUT localhost:9200/test1/_settings?pretty -d '{ "index":{
"analysis" : {
"analyzer" : {
"default" : {
"type" : "ik"
}
}
}
}
}'
// re-open index
curl -XPOST 'localhost:9200/test1/_open'

WordNet integrated with ElasticSearch - How to add new synonyms

I work with ElasticSearch version 1.2.3
I've integrated WordNet 3.0 as a Synonym database for ElasticSearch Synonyms Analyzer. (Full WordNet install: configure, make, make install)
I've added the following code to the ElasticSearch index settings (the index name is local_es)
curl -XPUT 'localhost:9200/local_es/_settings' -d '{
"settings" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "lowercase",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
}
}
}
}
}'
I've also have updated the mapping with the following code:
enter code here
curl -XPUT 'localhost:9200/local_es/shadowpage/_mapping' -d '{
"shadowpage" : {
"shadowPageName" : {
"enabled" : true,
"analyzer" : "synonym"
},
"properties" : {
"name" : { "type" : "string", "index" : "analyzed", "analyzer" : "synonym" }
}
}
}'
All is working as expected.
As you can see, ElasticSearch takes its data from the file path of analysis/wn_s.pl
wn_s.pl file is a WordNet prolog file that contains all the database synonyms.
How can I add new synonyms to the database?
Do I add it directly to the WordNet database? Or in wn_s.pl file?
If you are going to be actively modifying your synonym database you should probably just transform the synsets in the wordnet database into the basic comma delimited file in this format
"british,english",
"queen,monarch"
Then use and edit this file as your synonym resource.

elasticsearch percolator stemmer

I'm attempting to use the percolation function in elasticsearch. It works great but out of the box there is no stemming to handle singular/plurals etc. The documentation is rather thin on this topic so I was wondering if anyone has gotten this working and what settings are required. At the moment I'm not indexing my documents since I'm not searching them, just passing them through the percolator to trigger notifications.
You can use the percolate API to test documents against percolators without indexing them. However, the percolate API requires and index and a type for your doc. This is so that it knows how each field in your document is defined (or mapped).
Analyzers belong to an index, and the fields in a mapping/type definition can use either globally defined analyzers, or custom analyzers defined for your index.
For instance, we could define a mapping for index test, type test using a globally defined analyzer as follows:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"properties" : {
"title" : {
"type" : "string",
"analyzer" : "english"
}
}
}
}
}
'
Or alternatively, you could setup a custom analyzer that belongs just to the test index:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"test" : {
"properties" : {
"title" : {
"type" : "string",
"analyzer" : "my_english"
}
}
}
},
"settings" : {
"analysis" : {
"analyzer" : {
"my_english" : {
"stopwords" : [],
"type" : "english"
}
}
}
}
}
'
Now we can create our percolator, specifying which index it belongs to:
curl -XPUT 'http://127.0.0.1:9200/_percolator/test/english?pretty=1' -d '
{
"query" : {
"match" : {
"title" : "singular"
}
}
}
'
And test it out with the percolate API, again specifying the index and the type:
curl -XGET 'http://127.0.0.1:9200/test/test/_percolate?pretty=1' -d '
{
"doc" : {
"title" : "singulars"
}
}
'
# {
# "ok" : true,
# "matches" : [
# "english"
# ]
# }

Resources