We are implementing search company list using elasticsearch but its not what we expected
**Example companies:**
Infosys technologies
Infosys technologies ltd
Infosys technologies pvt ltd
Infosys technologies Limited
Infosys technologies Private Limited
BAC Infosys ltd
Scenario:
When search the keyword "Infosys" it should return "Infosys
technologies" list.
When search the keyword "Infosys ltd" it should return "Infosys
technologies" list.
When search the keyword "BAC Infosys ltd" it should return "BAC
Infosys ltd" list.
The below settings and mapping used
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"keyword_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"companies": {
"properties": {
"company_name": {
"type": "string",
"store": "true",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "keyword_analyzer",
"null_value": "null"
}
}
}
}
}
Query:
{"query":
{
"bool": {
"must": [
{ "match": { "company_name": "Infosys technologies" }}
],
"minimum_should_match": "80%"
}
}
}
Please help me how to achieve this.
you are missing a few things both in terms of the search queries and mappings.looking at your scenarios and using your current mappings settings
1) The result will also have BAC value. You should switch to edge n-grams.But this will not allow you to search from between.
2)It also depend what kind of search you are building, you can avoid the arrangement i suggested in 1. for your all scenarios lets assume your list can have BAC value also in the result for scenarios but ranked lower in the list. For this you can use proximity queries query with fuzzy on for spell checks.
Above three scenarios dont't explain me the whole functionality and uses -cases for your search feature, but i think proximity search offered by elastic can give you more flexibility to meet your cases.
Shingles could help:
https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html
For you case the ng_Gram analyzer is not pertinent, it should impact performance and relevance score. Create a shingle filter and a custom analyzer with standard tokenizer and lowercase filter.
HtH,
Related
I am using ES 7.x version and basically requirement is to provide autosuggest / type ahead on large text field which have file/document content.
I have explored multiple way and for all it returns the entire source document or specific field if I restrict using _source. I have tried out edge ngram or n-gram tokenizer, Prefix Query, Completion suggestor.
Below is sample Document (content field might have 1000s sentences):
{
"content":"Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana),
the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch."
}
Below is expected output:
Search query: el
Output: ["elasticsearch","elastic","elk"]
Search Query: analytics e
Output: ["analytics engine"]
Currently I am not able to achieve above output using the OOTB functionality. So I have used highlighting functionality of elasticsearch and applied regex on result and created unique list of suggestion using Java.
Below is my current implement using highlight functionality.
Index Mapping:
PUT index
{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"analysis": {
"filter": {
"stop_filter": {
"type": "stop",
"stopwords": "_english_"
},
"ngram_filter": {
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "12"
}
},
"analyzer": {
"text_english": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"stop_filter"
]
},
"whitespace_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"ngram_analyzer": {
"filter": [
"lowercase",
"stop_filter",
"ngram_filter"
],
"type": "custom",
"tokenizer": "letter"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"autocorrect": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
},
"analyzer": "text_english"
}
}
}
}
Below is Elasticsearch query which executed from Java
POST autosuggest/_search
{
"_source": "content.autocorrect",
"query": {
"match_phrase": {
"content.autocorrect": "analytics e"
}
},
"highlight": {
"fields": {
"content.autocorrect": {
"fragment_size": 500,
"number_of_fragments": 1
}
}
}
}
We have applied regex pattern on above query result.
Please let me know if there is any way to achieve without above workaround.
The completion suggester is not the right tool for your use case. As well ngrams and edge ngrams might be overkill depending on how much content you have.
Have you tried the match_phrase_prefix query which matches full tokens next to one another and the last one as a prefix?
The query below is very simple and should work the way you expect.
POST test/_search
{
"query": {
"match_phrase_prefix": {
"content": "analytics e"
}
}
}
I am currently working on typeahead support (with contains, not just starts-with) for over 100.000.000 entries (and that number could grow arbitrarily) using ElasticSearch.
The current setup works, but I was wondering if there is a better approach to it.
I'm using AWS Elasticsearch, so I don't have full control over the cluster.
My index is defined as follows:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": "lowercase"
},
"search_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
},
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
}
}
}
},
"mappings": {
"account": {
"properties": {
"tags": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tags_prefix": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tenantId": {
"type": "text",
"analyzer": "keyword"
},
"referenceId": {
"type": "text",
"analyzer": "keyword"
}
}
}
}
}
The structure of the documents is:
{
"tenantId": "1234",
"name": "A NAME",
"referenceId": "1234567",
"tags": [
"1234567",
"A NAME"
],
"tags_prefix": [
"1234567",
"A NAME"
]
}
The point behind the structure is that documents have searcheable fields, over which typeahead works, it's not over everything in the document, so it could be things not even in the document itself.
The search query is:
{
"from": 0,
"size": 10,
"highlight": {
"fields": {
"tags": {}
}
},
"query": {
"bool": {
"must": {
"multi_match": {
"query": "a nam",
"fields": ["tags_prefix^100", "tags"]
}
},
"filter": {
"term": {
"tenantId": "1234"
}
}
}
}
}
I'm doing a multi_match because, while I need typeahead, the results that have the match at the start need to come back first, so I followed the recommendation in here
The current setup is 10 shards, 3 master nodes (t2.mediums), 2 data/ingestion nodes (t2.mediums) with 35GB EBS disk on each, which I know is tiny given the final needs of the system, but useful enough for experimenting.
I have ~6000000 records inserted, and the response time with a cold cache is around 300ms.
I was wondering if this is the right approach or are there some optimizations I could implement to the index/query to make this more performant?
First, I think that the solution you build is good, and the optimisations you are looking for should only be considered if you have an issue with the current solution, meaning the queries are too slow. No need for pre-mature optimisations.
Second, I think that you don't need to provide the tags_prefix in your docs. all you need is to use the edge_ngram_tokenizer on the tags field, which will create the desired prefix tokens for the search to work. you can use multi fields in order to have multiple tokenizers for the same 'tags' field.
Third, use the edge_ngram_tokenizer settings carefully, especially the 'min_gram' and 'max_gram' settings. the reason is that having too high max_gram will:
a. create too many prefix tokens, will use too much space
b. decrease the index rate, as indexing takes longer
c. is not useful - you don't expect auto-complete to take into account 300 prefix characters. a better max prefix token settings should be (to my opinion) in the range of 10-20 characters max (or even less).
Good luck!
I am trying to implement "autosuggestion" feature in my application where when user type a set of letters she should be able to view a list of suggestions for the given input , i would like to have my feature working as similar to how Alibaba or similar sites work .
I am using Elastic search and java.Can any one help me or give any suggestions on how to implement this functionality.
suggestins in elastic search can be implemented using 1)prefix match 2)completion suggesters 3)Edge NGrams.
in this approach i choose "Ngrams analyzer" for auto suggestion , start by defining a nGram filter and then link the filter to custom analyzer and apply analyzer to fields which we choose to provide suggestions.
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
once we have mapping set , we can assign the analyzer(nGram_analyzer) to fileds which take part of suggestions.
Next step would be to write a query to extartc suggestions , that would be as follows.
"query": {
"match": {
"_all": {
"query": "gates",
"operator": "and"(optional , needed when we have multiple words or a senetence)
}
}
}
The below material would help to understand in depth about ngram tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/analysis-ngram-tokenizer.html
PS : each approach listed above will have its own advantages and backdrops.
I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...
I am new to elasticsearch and i am trying to configure synonyms but it is not working as expected.
I have following data in my fields
1) Techincal Lead, Module Lead, Software Engineer, Senior Software Engineer
I want if I search for tl then it should retun "Technical Lead" or "tl"
However it is returning me "Technical Lead" and "Module Lead" because lead is tokenized at index tme.
Could you please help me in getting resolve this issue with exact settings.
I have seen that index time and search time tokenization but unable to understand that.
synonyms.txt:
tl,TL => Technical Lead
se,SE => Software Engineer
sse => Senior Software Engineer
Mapping file:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"tweet": {
"properties": {
"Domain": {
"type": "string",
"analyzer": "synonym"
},
"Designation": {
"analyzer": "synonym",
"type": "string"
},
"City": {
"type": "string",
"analyzer": "synonym"
}
}
}
}
}
Your tokens are identical here, so you have that part down. What you need to do is ensure that you are doing an "AND" match instead of an "or" as it appears to be just matching on any word rather than all.
Check out your tokens:
localhost:9200/test/_analyze?analyzer=synonym&text=technical lead
localhost:9200/test/_analyze?analyzer=synonym&text=tl
And the query
{
"query": {
"match": {
"domain": {
"query": "tl",
"operator": "and"
}
}
}
}
Usually you want your search and index analyzers to be the same. However, there are many advanced examples where this is not preferable. However, in the case with synonyms, often you do not want to use synonyms in one or the other when you have expansions turned on.
i.e. tl,technical lead
However, since you are using => type of synonyms, this really doesn't matter because all words will be converted into the word on the right rather than creating a bunch of tokens for every word between the commas.