how to remove a stop word from default _english_ stop-words list in elasticsearch? - elasticsearch

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.

You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

Related

What keywords are used by the Swedish analyzer?

On this part of the elasticsearch docs, it says that the Swedish analyzer can be reimplemented like this:
PUT /swedish_example
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": ["exempel"]
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
My question is, how does this analyser recognise keywords? Sure, the keywords can be defined in the settings.analysis.filter.swedish_keywords.keywords field, but what if I'm too lazy to do that? Does Elasticsearch look at some other keywords list of pre-defined Swedish keywords? Because in the example above it looks like there is no such list provided in the settings.
In other words, is it solely up to me to define keywords or does Elasticsearch look at some other list to find keywords by default?
Yes, you need to specify this list by you. Otherwise, this filter wouldn't do anything.
As per documentation of Elasticsearch:
Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before
any stemming filters.
Alternatively, you could specify:
keywords_path
A path (either relative to config location, or absolute) to a list of
words.
keywords_pattern
A regular expression pattern to match against words in the text.
More information about this filter - https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

How to handle auto completion on multi word text?

My input text is a multiword english text and I have the requirement to implement a autocompletion feature for that text.
I initially looked at search completion suggesters only to figure out that those can only match the first characters of the input. This is fine for auto completion of product names or address but not very useful when requiring a auto completion on any word in the input text.
After that I setup an edge_ngram analyzer and query to locate those documents which contain the input string. That works just fine but I don't know how to use this information to provide options for my auto completion.
I could use a highlighter in order to show the words which match the query. That data could in turn be used to setup a list of options. This solution seems rather hacky and not very elegant and I wonder how this problem is usually solved?
I'm unfortunately not able to maintain another field which could include the auto completion options for the documents.
I'm currently using highlight information of the query in order to construct the autocomplete options.
My Query:
{
"query": {
"match": {
"fields.content.auto": {
"query": "content co",
"analyzer": "standard"
}
}
},
"highlight": {
"fields": {
"fields.content.auto": {
"fragment_size": 0,
"number_of_fragments": 10,
"pre_tags" : [ "%ha%" ],
"post_tags" : [ "%he%" ]
}
}
},
"_source": ["uuid", "language"]
}
My auto field used the autocomplete analyzer.
"auto": {
"type": "string",
"analyzer": "autocomplete"
}
And this is the index configuration that I'm using:
{
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"autocomplete_filter"
]
}
}
}
}
The solution was mainly inspired by the Search-as-you-type post.
I process the response JSON in order to get the autocomplete options.
The highlight information is used to extract all found tokens. These tokens are next used to construct the potential autocomplete phrase by also comparing it to the phrase that the user has already entered. The neat thing is that a stop word filter can be applied and thus stopwords will never be highlighted and in turn never be used for autocomplete suggestions.
A PoC Java code of this processor can be found here
I'm not yet sure whether I'll run with this solution but I want to share it anyway.
I think your best option is to create a dedicated index for storing just the suggestions using the edge_ngram analyzer. If you use the completion suggesters you need to explicitly define your actual suggestions anyway. The completion suggester is also document centric in ES 5.x so if you index multiple documents with the same suggestions you will get duplicate suggestions returned on a match. There is a de-duplication option in ES 6, but that has only just been released.
If you have a dedicated suggestion index you can use a hash of the suggestion as a document ID to avoid duplicates. You can start indexing document titles and other useful meta data as suggestions. Later on you could include historical searches entered by users that are seen as successful due to the user ultimately clicking on or purchasing the returned results.

Tokenize a big word into combination of words

Suppose I have Super Bowl is the value of a document's property in the elasticsearch. How can the term query superbowl match Super Bowl?
I read about letter tokenizer and word delimiter but both don't seem to solve my problem. Basically I want to be able to convert combination of a large word into meaningful combination of words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-letter-tokenizer.html
I know this is quite late but you could use synonym filter
You could define that super bowl is the same as "s bowl", "SuperBowl" etc.
There are ways to do this without changing what you actually index. For example, if you are using at least 5.2 (where normalizers were introduced), but it can also be earlier version but 5.x makes it easier, you can define a normalizer to lowercase your text and not change it and then use a fuzzy query at search time to account for the space between super and bowl. My solution though is specific to this example you have given. As it is with Elasticsearch most of time, one needs to think about what kind of data goes into Elasticsearch and what it is required at search time.
In any case, if you are interested in an approach here it is:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
POST test/test/1
{"title":"Super Bowl"}
GET /test/_search
{
"query": {
"fuzzy": {
"title.keyword": "superbowl"
}
}
}

Is Simple Query Search compatible with shingles?

I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?
The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.

Keep non-stemmed tokens on Elasticsearch

I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server):
{
"analysis":{
"filter":{
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true,
},
"stop_pt":{
"type": "stop",
"ignore_case": true,
"stopwords": "_brazilian_"
},
"stemmer_pt": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_asciifolding",
"stop_pt",
"stemmer_pt"
]
}
}
}
}
I haven't really touched my type mappings (apart from a few numeric fields I've declared "type":"long") so I expect most fields to be using this default analyzer I've specified above.
This works as expected, but the thing is that some users are frustrated because (since tokens are being stemmed), the query "vulnerabilities" and the query "vulnerable" return the same results, which is misleading because they expect the results having an exact match to be ranked first.
Whats is the default way (if any) to do this in elasticsearch? (maybe keep the unstemmed tokens in the index as well as the stemmed tokens?) I'm using version 1.5.1.
I ended up using "fields" field to index my attributes in different ways. Not sure whether this is optimal but this is the way I'm handling it right now:
Add another analyzer (I called it "no_stem_analyzer") with all filters that the "default" analyzer has, minus "stemmer".
For each attribute I want to keep both non stemmed and stemmed variants, I did this (example for field "DESCRIPTION"):
"mappings":{
"_default_":{
"properties":{
"DESCRIPTION":{
"type"=>"string",
"fields":{
"no_stem":{
"type":"string",
"index":"analyzed",
"analyzer":"no_stem_analyzer"
},
"stemmed":{
"type":"string",
"index":"analyzed",
"analyzer":"default"
}
}
}
},//.. other attributes here
}
}
At search time (using query_string_query) I must also indicate (using field "fields") that I want to search all sub-fields (e.g. "DESCRIPTION.*")
I also based my approach upon [this answer].(elasticsearch customize score for synonyms/stemming)

Resources