How to use a custom analyser on specific elasticsearch documents - elasticsearch

Suppose I have a custom analyser that I want to use on only specific documents that have the table entity_type, how would I go about that?
Document I want to match:
{
... other keys
"_source": {
"entity_type": "table" // <-- I want to match this and use the custom analyser on this entire document
}
}
Custom analyser (currently just set to the default but I want it to only affect tables)
elasticsearch.indices.create(
index="myIndex",
body={
"settings": {
"analysis": {
"char_filter": {
"underscore_to_dash": {
"type": "mapping",
"mappings": ["_ => -"],
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"],
"char_filter": ["underscore_to_dash"],
}
},
},
}
})

Analyzer can only be applied to a specific field. So in your case it might make sense to have 2 fields and use one for "entity_type": "table" and another for other docs.

Related

Elasticsearch autocomplete on large text field

I am using ES 7.x version and basically requirement is to provide autosuggest / type ahead on large text field which have file/document content.
I have explored multiple way and for all it returns the entire source document or specific field if I restrict using _source. I have tried out edge ngram or n-gram tokenizer, Prefix Query, Completion suggestor.
Below is sample Document (content field might have 1000s sentences):
{
"content":"Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana),
the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch."
}
Below is expected output:
Search query: el
Output: ["elasticsearch","elastic","elk"]
Search Query: analytics e
Output: ["analytics engine"]
Currently I am not able to achieve above output using the OOTB functionality. So I have used highlighting functionality of elasticsearch and applied regex on result and created unique list of suggestion using Java.
Below is my current implement using highlight functionality.
Index Mapping:
PUT index
{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"analysis": {
"filter": {
"stop_filter": {
"type": "stop",
"stopwords": "_english_"
},
"ngram_filter": {
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "12"
}
},
"analyzer": {
"text_english": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"stop_filter"
]
},
"whitespace_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"ngram_analyzer": {
"filter": [
"lowercase",
"stop_filter",
"ngram_filter"
],
"type": "custom",
"tokenizer": "letter"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"autocorrect": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
},
"analyzer": "text_english"
}
}
}
}
Below is Elasticsearch query which executed from Java
POST autosuggest/_search
{
"_source": "content.autocorrect",
"query": {
"match_phrase": {
"content.autocorrect": "analytics e"
}
},
"highlight": {
"fields": {
"content.autocorrect": {
"fragment_size": 500,
"number_of_fragments": 1
}
}
}
}
We have applied regex pattern on above query result.
Please let me know if there is any way to achieve without above workaround.
The completion suggester is not the right tool for your use case. As well ngrams and edge ngrams might be overkill depending on how much content you have.
Have you tried the match_phrase_prefix query which matches full tokens next to one another and the last one as a prefix?
The query below is very simple and should work the way you expect.
POST test/_search
{
"query": {
"match_phrase_prefix": {
"content": "analytics e"
}
}
}

Custom stopword analyzer is not woring properly

I have created an index with a custom analyzer for stop words. I want that elastic-search to ignore these words at the time of searching. Then I added one document data in elasticsearch mapping.
but when I am querying in kibana for "the" keyword with the query. It should not show any successful match, because in my_analzer I have put "the" in my_stop_word section. But it is showing the match. I have studied that if you mention one analyzer at the time of indexing in the mapping field. then it takes that analyzer by default at the time of the query.
please help!
PUT /pandey
{
"settings":
{
"analysis":
{
"analyzer":
{
"my_analyzer":
{
"tokenizer": "standard",
"filter": [
"my_stemmer",
"english_stop",
"my_stop_word",
"lowercase"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop":{
"type": "stop",
"stopwords": "_english_"
},
"my_stop_word": {
"type": "stop",
"stopwords": ["robot", "love", "affection", "play", "the"]
}
}
}
},
"mappings": {
"properties": {
"dialog": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT pandey/_doc/1
{
"dailog" : "the boy is a robot. he is in love. i play cricket"
}
GET pandey/_search
{
"query": {
"match": {
"dailog": "the"
}
}
}
A small spelling mistake can lead to this.
You defined mapping for dialog but added document with field name dailog. the dynamic field mappings behavior of elastic will index it without error. we can disable it though.
So the query, "dailog": "the" will get the result using default analyzer.

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Empty value generates mapper_parsing_exception for Elasticsearch completion suggester field

I have a name field which is a completion suggester, and indexing generates a mapper_parsing_exception error, stating value must have a length > 0.
There are indeed some empty values in this field. How do I accommodate them?
ignore_malformed had no effect, either at the properties or index level.
I tried filtering out empty strings in the analyzer, setting a min length:
PUT /genes
{
"settings": {
"analysis": {
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
},
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"remove_empty"
]
}
}
}
},
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase"
}
}
}
}
}
Or filter empty strings as a stopword:
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
Attempting to apply a filter to the name mapping generates an unsupported parameter error:
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase",
"filter": "remove_empty"
}
}
}
}
This sure feels like it ought to be simple. Is there a way to do this?
Thanks!
I have faced the same issue. After some research it seems to me that currently the only option is to change data (e.g. replace empty values with some dummy non-empty values) before indexing.
But there is also good news. This issue exists on GitHub and was resolved about a month ago. It is planned to be released in version 6.4.0.

Multi-field search for synonym in the query string

It looks like Elasticsearch does not take field analyzers into account for multi-field search using query string, without specifying field.
Can this be configured for index or specified in the query?
Here it is a hands on example.
Given files from commit (spring-data-elasticsearch).
There is a test SynonymRepositoryTests, which will pass with QueryBuilders.queryStringQuery("text:british") and QueryBuilders.queryStringQuery("british").analyzer("synonym_analyzer") queries.
Is it possible to make it passing with QueryBuilders.queryStringQuery("british") query, without specifying field and analyzer for query?
You could query without specifying fields or analyzers. By default query string query will query on _all field which is combination of all fields and uses standard analyzer. so QueryBuilders.queryStringQuery("british") will work.
You can exclude some fields from all fields while creating index and you can also create custom all field with the help of copy_to functionality.
UPDATE
You would have to use your custom analyzer on _all fields while creating index.
PUT text_index
{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"prefix_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
}
}
}
},
"mappings": {
"test_type": {
"_all": {
"enabled": true,
"analyzer": "prefix_analyzer" <---- your synonym analyzer
},
"properties": {
"name": {
"type": "string"
},
"tag": {
"type": "string",
"analyzer": "simple"
}
}
}
}
}
You can replace prefix_analyzer with your synonym_analyzer and then it should work.

Resources