Stemming in elastic search replacing the original string - elasticsearch

I used the following settings to create ES index.
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
I noticed that while analysing the stemmer replaces the original string with the stemmed word. Is there a way to index the original string and stemmed token both ?

Your question is about a "preserve_original" parameter for stemmer token filter:
You will find "preserve_original" e.g. for Word Delimiter Token Filter but not for stemmer token filter.
If you need the original word e.g. for aggregation you can copy the field to another one with a suited analyzer.
If you need the original on the same position of your index you have to wrap the stemmer and build your own analyzer as plugin.

Related

How to build an Elasticsearch phrase query that matches text with special characters?

During the last few days I've been playing around elastic-search indexing and searching and I've to build different queries that I intended to. My problem right now is being able to build a query that is able to match text with special characters even if I don't type them in the "search bar". I'll give an example to easily explain what I mean.
Imagine you have a document indexed that contains a field called page content. Inside this field, you can have a part of the text such as
"O carro do João é preto." (means João's car is black in portuguese)
What I want to be able to do is type something like:
O carro do joao e preto
and still be able to get the proper match.
What I've tried so far:
I've been using the match phrase query provided in the documentation of elasticsearch (here) such as the example below:
GET _search
{
"query": {
"match_phrase": {
"page content":
{
"query": "o carro do joao e preto"
}
}
}
}
The result of this query gives me 0 hits. Which is perfectly acceptable given that the provided content of the query is different from what has been stored in that document.
I've tried setting the ASCII Folding Token Filter (here) but I'm not sure of how to use it. So what I've basically done is creating a new index with this query:
PUT /newindex '
{
"page content": "O carro do João é preto",
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
}'
Then if I try to query, using the match_phrase query provided above, like this:
O carro do joao e preto
it should show me the correct result as I wanted it to. But the thing is it isn't working for me. Am I forgetting something? I've been around this for the last two days without success and I feel like it's something that I'm missing.
So question: What do I have to do to get the desired matching?
Managed to find the answer to my own question. I had to change the analyzer a little bit when I created the index. Further details in this previous answer:
My code now:
{
"settings" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding"]
},
"text" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase"],
"char_filter" : "html_strip"
},
"sortable" : {
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : "html_strip"
}
}
}
}
}

How can I index a field using two different analyzers in Elastic search

Say that I have a field "productTitle" which I want to use for my users to search for products.
I also want to apply autocomplete functionality. So I m using an autocomplete_analyzer with the following filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}
However, at the same time when users make a search I don't want the "edge_ngram" to be applied, since it produces lot of irrelevant results.
For example when users want to search for "mi" and start typing "m", "mi".. they should get the results starting with m,mi as auto-complete options. However, when they actually make the query, they should only get results with the word "mi". Currently they also see results with "mini" etc..
Therefore, is it possible to have "productTitle" indexed using two different analyzers? Is multi-field type an option for me?
EDIT: Mapping for productTitle
"productTitle" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
,
"second" analyzer
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
}
So when I'm querying for :
"filtered" : {
"query" : {
"match" : {
"productTitle" : {
"query" : "mi",
"type" : "boolean",
"minimum_should_match" : "2<75%"
}
}
}
}
I also get results like "mini". But I need to only get results including just "mi"
Thank you
hmm ... as far as I know, there is no way to apply multiple analyzers for same field ... what You can make is to use "Multi Fields".
here is an example how to apply different analyzers for "subfields":
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#_multi_fields_with_multiple_analyzers
The correct way of preventing what you describe in your answer is to specify both analyzer and search_analyzer in your field mapping, like this:
"productTitle": {
"type": "string",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
}
The autocomplete analyzer will kick in at indexing time and tokenize your title according to your edge_ngram configuration and the standard analyzer will kick in at search time without applying the edge_ngram stuff.
In this context, there is no need for multi-fields unless you need to tokenize the productTitle field in different ways.

Elasticsearch. How to find phrases if query has no spaces

For example, I have a document with a phrase "Star wars" in the name field.
I would like to make a search with DSL and query "starwars" and get this document.
I am trying to get something like this
GET _search
{
"query" : {
"match_phrase" : {
"name": {
"query" : "starwars"
}
}
}
}
How can I do it with elasticsearch?
I think you would need to update the analyzer on that name field with a custom analyzer that includes the synonym token filter with a synonym for starwars.
The docs on creating a custom analyzer should help you out. Additionally, the standard analyzer is applied by default if you did not specify any analyzer for that name field in your mapping. You can base your custom analyzer on that and add that synonym token filter in that array of filters. Perhaps, give some more thought to how you want the content to be analyzed for the other requirements you have as well as this.
With this analyzer update you should be able to use that query and get the result you expect.
Example:
{
"filter" : {
"my_synonym" : {
"type" : "synonym",
"synonyms" : [
"star wars => starwars"
]
}
},
"analyzer" : {
"standard_with_synonym" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_synonym", "stop"]
}
}
}

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources