Copy analyzed text into another field - elasticsearch

We are trying to build a WordCloud over four Text-Fields. Each field has its own Stop Analyzer.
For example TextFr with a French Stop Analyzer, TextDe with a German Stop Analyzer. The analyzed result should be copied into another field called WordCloudText on which the aggregations takes place.
Do you have any advice how to do this? Is this even possible?
Thanks for your help

I don't think there is a way to copy the analyzed output of a field, only the values (unanalyzed) of a field. Probably the easiest way to achieve this is to define your own analyzer that filters all of four languages. Something like this:
PUT stackoverflow
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
}
},
"analyzer": {
"eng_stop": {
"type": "stop",
"stopwords": "_english_"
},
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"all_lang_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop",
"dutch_stop"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"field": {
"type": "keyword",
"fields": {
"english": {"type": "text", "analyzer": "eng_stop" },
"dutch": {"type": "text", "analyzer": "dutch_stop" },
"word_cloud": {"type": "text", "analyzer": "all_lang_stop"}
}
}
}
}
}
}
The key is the custom analyzer called all_lang_stop that combines multiples top filters. Then you use a multi-field to have your data automatically copied to into each type of stop analyzer.
Alternatively, if your text is already separated into different fields by language, you can use the copy_to directive on each individual language field to copy it into the word_cloud field. Note that copy_to copies the input value, not the output value of the analyzer, so you still need the combined analyzer. Something like this:
"mappings": {
"record": {
"properties": {
"english": {"type": "text", "analyzer": "eng_stop", copy_to: "word_cloud"},
"dutch": {"type": "text", "analyzer": "dutch_stop", copy_to: "word_cloud"},
"word_cloud": {"type": "text", "analyzer": "all_lang_stop"}
}
}
}

Related

Achieving literal text search combined with subword matching in Elasticsearch

I have populated an Elasticsearch database using the following settings:
mapping = {
"properties": {
"location": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"description": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"commentaar": {
"type": "text",
"analyzer": "ngram_analyzer"
},
}
}
settings = {
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {"custom_tool": mapping}
}
I used the ngram analyser because I wanted the be able to have subword matching. So a search for "ackoverfl" would return the entries containing "stackoverflow".
My search queries are made as follows:
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and",
"analyzer": "whitespace"
}
}
Where needle is the text from my search bar.
Sometimes I would also like to do literal phrase searching. For example:
If my search term is:
"the ap hangs in the tree"
(Notice that I use quotation marks here with the intention the search for a literal piece of text).
Then in my results I get a document containing:
the apple hangs in the tree
This results is unwanted.
How could I implement having a subword matching search capability while also having the option to search for literal phrases (by using for example quotation marks) ?

Custom stopword analyzer is not woring properly

I have created an index with a custom analyzer for stop words. I want that elastic-search to ignore these words at the time of searching. Then I added one document data in elasticsearch mapping.
but when I am querying in kibana for "the" keyword with the query. It should not show any successful match, because in my_analzer I have put "the" in my_stop_word section. But it is showing the match. I have studied that if you mention one analyzer at the time of indexing in the mapping field. then it takes that analyzer by default at the time of the query.
please help!
PUT /pandey
{
"settings":
{
"analysis":
{
"analyzer":
{
"my_analyzer":
{
"tokenizer": "standard",
"filter": [
"my_stemmer",
"english_stop",
"my_stop_word",
"lowercase"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop":{
"type": "stop",
"stopwords": "_english_"
},
"my_stop_word": {
"type": "stop",
"stopwords": ["robot", "love", "affection", "play", "the"]
}
}
}
},
"mappings": {
"properties": {
"dialog": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT pandey/_doc/1
{
"dailog" : "the boy is a robot. he is in love. i play cricket"
}
GET pandey/_search
{
"query": {
"match": {
"dailog": "the"
}
}
}
A small spelling mistake can lead to this.
You defined mapping for dialog but added document with field name dailog. the dynamic field mappings behavior of elastic will index it without error. we can disable it though.
So the query, "dailog": "the" will get the result using default analyzer.

Language Analyzer doesnt work find singular results

I have a bunch of categories with translations in my category field. I have defined language analyzers for the fields in my index so I can search for them. But it doesnt find the singular version of my words. wasmachine in titles.title-nl is singular of wasmachines but not found. What am I missing?
Demo document
"_source" : {
"google_id" : 2706,
"titles" : [
{
"title-en" : "laundry appliances",
"title-de" : "waschen & trocknen",
"title-fr" : "appareils de blanchisserie",
"title-nl" : "wasmachines"
}
]
}
Way I mapped them
PUT categories/_mapping/category
{
"dynamic": false,
"properties": {
"titles.title-nl": {
"type": "text",
"analyzer": "dutch"
},
"titles.title-en": {
"type": "text",
"analyzer": "english"
},
"titles.title-de": {
"type": "text",
"analyzer": "german"
},
"titles.title-fr": {
"type": "text",
"analyzer": "french"
}
}
}
The way I search for them
GET categories/_search
{
"size": 4,
"query": {
"multi_match": {
"query": "wasmachines",
"fields": ["titles.title-de","titles.title-en", "titles.title-fr", "titles.title-nl"]
}
}
}
The problem is that the default dutch analyzer doesn't know how to stem the word wasmachines, you will need to recreate your index with a custom analyzer using a stemmer_override.
Looking in the elastic documentation you can do the following to recreate the dutch analyzer and tell that wasmachines should be stemmed to wasmachine, just put wasmachine => wasmachines inside the rules for the stemmer_override
PUT categories/
{
"settings": {
"analysis": {
"filter": {
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"dutch_keywords": {
"type": "keyword_marker",
"keywords": ["voorbeeld"]
},
"dutch_stemmer": {
"type": "stemmer",
"language": "dutch"
},
"dutch_override": {
"type": "stemmer_override",
"rules": [
"fiets=>fiets",
"bromfiets=>bromfiets",
"wasmachine=>wasmachines",
"ei=>eier",
"kind=>kinder"
]
}
},
"analyzer": {
"rebuilt_dutch": {
"tokenizer": "standard",
"filter": [
"lowercase",
"dutch_stop",
"dutch_keywords",
"dutch_override",
"dutch_stemmer"
]
}
}
}
}
}
You will also need to use that new analyzer in your mapping:
PUT categories/_mapping/category
{
"dynamic": false,
"properties": {
"titles.title-nl": {
"type": "text",
"analyzer": "rebuilt_dutch"
},
"titles.title-en": {
"type": "text",
"analyzer": "english"
},
"titles.title-de": {
"type": "text",
"analyzer": "german"
},
"titles.title-fr": {
"type": "text",
"analyzer": "french"
}
}
}
After that you will be able to search for wasmachine and get the documents that have wasmachines.

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

Resources