ElasticSearch autocomplete doesn't work with the middle words - elasticsearch

Using python elasticsearch-dsl:
class Record(Document):
tags = Keyword()
tags_suggest = Completion(preserve_position_increments=False)
def clean(self):
self.tags_suggest = {
"input": self.tags
}
class Index:
name = 'my-index'
settings = {
"number_of_shards": 2,
}
When I index
r1 = Record(tags=['my favourite tag', 'my hated tag'])
r2 = Record(tags=['my good tag', 'my bad tag'])
And when I try to use autocomplete with the word in the middle:
dsl = Record.search()
dsl = dsl.suggest("auto_complete", "favo", completion={"field": "tags_suggest"})
search_response = dsl.execute()
for option in search_response.suggest.auto_complete[0].options:
print(option.to_dict())
It won't return anything, but it will when I search "my favo". Any good practices to fix that (make it return 'my favourite tag' when I request suggestions for "favo")?

Check Mapping
Search in Elasticsearch, Is also depends on how you are indexing your data. I would suggest to have look on index mapping with the below query:
curl -X GET "elasticsearch.url:port/index_name/_mapping?pretty"
You need to check how data is being inserted like is it using any analyzer or tokeninzer to save data. If you have not specified any analyzer elasticsearch default uses standard analyzer. It will produce the terms accordingly.
As per your use case you need to apply analyzer, tokens & filters. Here is the one Example where i have to use like query and implemented ngram token filter.
Solution
As i can see you are using suggester, The suggest feature suggests similar looking terms based on a provided text by using a suggester.
If you want to achieve autocomplete, I would suggest to use search as you type.
I tried to reproduce your use case and below is something which worked for me.
Create Index
PUT /test1?pretty
{
"mappings": {
"properties": {
"tags": {
"type": "search_as_you_type"
}
}
}
}
Indexing data
POST test1/_doc?pretty
{
"tags":"my favourite tag"
}
POST test1/_doc?pretty
{
"tags":"my hated tag"
}
POST test1/_doc?pretty
{
"tags":"my good tag"
}
POST test1/_doc?pretty
{
"tags":"my bad tag"
}
Query with your keyword
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "my",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "bad",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "fav",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}

You can achive this by setting preserve_position_increments parameter to false in your mappings.
"tags_completion": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": false,
"preserve_position_increments": false,
"max_input_length": 50
}
You can query it in console like this:
GET /_search
{
"suggest" : {
"my-suggester": {
"prefix": "favou",
"completion": {
"field": "tags_completion",
"skip_duplicates": true,
"fuzzy": {
"fuzziness": 1
}
}
}
}
}
}

Related

elasticsearch query string(full text search) which can't be searched

we have a document below. I can't searched with financialmarkets. but it can be searched with industry_icon_financialmarkets.png. Can anyone tell me what is the reason?
content is the text type field.
document:
{
"title":"test",
"content":"industry_icon_financialmarkets.png"
}
Query:
{
"from": 0,
"size": 2,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "\"industry_icon_financialmarkets.png\""
}
}
]
}
}
}
The default analyzer for text field is standard which won't break industry_icon_financialmarkets into tokens using _ as a delimiter. I would suggest you to use simple analyzer instead which will breaks text into terms whenever it encounters a character which is not a letter.
You can also add sub-field of type keyword to retain the original value.
So the mapping of the field should be:
{
"content": {
"type": "text",
"analyzer": "simple",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
At the time of creating index, we should have our own mapping for each fields based on its type to get the expected result.
Mapping
PUT relevance
{"mapping":{"ID":{"type":"long"},"title":
{"type":"keyword","analyzer":"my_analyzer"},
"content":
{"type":"string","analyzer":"my_analyzer","search_analyzer":"my_analyzer"}},
"settings":
{"analysis":
{"analyzer":
{"my_analyzer":
{"tokenizer":"my_tokenizer"}},
"tokenizer":
{"my_tokenizer":
{"type":"ngram","min_gram":3,"max_gram":30,"token_chars":
["letter","digit"]
}
}
},"number_of_shards":5,"number_of_replicas":2
}
}
Then start inserting documents,
POST relevance/_doc/1
{
"name": "1elastic",
"content": "working fine" //replace special characters with space using program before inserting into ES index.
}
Query
GET relevance/_search
{"size":20,"query":{"bool":{"must":[{"match":{"content":
{"query":"fine","fuzziness":1}}}]}}}

Sort similar data by property

I have the following data:
[
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "5"
},
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "45"
}
]
I use the following query to get data:
{
"query": {
"bool": {
"must": {
"match": {
"DocumentId": "85"
}
},
"should": [
{
"match": {
"figureText": {
"boost": 5,
"query": "General Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"match": {
"descriptionShort": {
"boost": 4,
"query": "Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"term": {
"partNumber": {
"boost": 1,
"value": "1012626-001FG05"
}
}
}
]
}
}
}
Currently, it will returns the item with "itemNumeric" = 45 and I would like to get itemNumeric = "5" (the lowest).
Is a tips exists to do that ? I tried with "sort":[{"itemNumeric":"desc"}]
Thx
Looking at your comment, you can resolve the issue in two ways.
Solution 1: Updating your mapping, so that your query would work as expected now
PUT my_index/_mapping/_doc
{
"properties": {
"itemNumeric": {
"type": "text",
"fielddata": true
}
}
}
Solution 2: Check the mapping of your itemNumeric field in case if your mapping has been created dynamically, you field itemNumeric would be multi-field.
"itemNumeric": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
In this case you can have your sorting logic applied on itemNumeric.keyword field.
"sort":[{"itemNumeric.keyword":"desc"}]
In elasticsearch, whenever you have text data, it is always recommended to have two fields created for it. One of type text so that you can apply full text queries and other of type keyword so that you can use if to implement sorting or any aggregation operations.
Solution 1 is not recommended as ES official documentation mentions below reason
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting
the inverted index. Note that this can however use significant memory.
I'd suggest to read about multi-field and fielddata so that you will have more clarity on what's happening.

Proper way to query documents using an 'uppercase' token filter

We have an ElasticSearch index with some fields that use custom analyzers. One of the analyzers includes an uppercase token filter in order to get rid of case sensitivity while making queries (e.g. we want "ball" to also match "Ball" or "BALL")
The issue here is when doing regular expressions, the pattern is matched against the term in the index which is all uppercase. So "app*" won't match "Apple" in our index, because behind the scenes its really indexed as "APPLE".
Is there a way to get this to work without doing some hacky things outside of ES?
I might play around with "query_string" instead and see if that has any different results.
This all depends on the type of the query you are using. If that type will use the analyzer of the field itself to analyze the input string then it should be fine.
If you are using the regexp query, this one will NOT analyze the input string, so if you pass app.* to it, it will stay the same and this is what it will user for search.
But, if you use properly the query_string query that one should work:
{
"settings": {
"analysis": {
"analyzer": {
"my": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"uppercase"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "my"
}
}
}
}
}
And the query itself:
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
To make sure it's doing what I think it is, I always use the _validate api:
GET /_validate/query?explain&index=test
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
which will show what ES is doing to the input string:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "some_field:APP*"
}
]

How do I specify a different analyzer at query time with Elasticsearch?

I would like to use a different analyzer at query time to compose my query.
I read that is possible from the documentation "Controlling Analysis" :
[...] the full sequence at search time:
The analyzer defined in the query itself, else
The search_analyzer defined in the field mapping, else
The analyzer defined in the field mapping, else
The analyzer named default_search in the index settings, which defaults to
The analyzer named default in the index settings, which defaults to
The standard analyzer
But i don't know how to compose the query in order to specify different analyzers for different clauses:
"query" => [
"bool" => [
"must" => [
{
"match": ["my_field": "My query"]
"<ANALYZER>": <ANALYZER_1>
}
],
"should" => [
{
"match": ["my_field": "My query"]
"<ANALYZER>": <ANALYZER_2>
}
]
]
]
I know that i can index two or more different fields, but I have strong secondary memory constraints and I can't index the same information N times.
Thank you
If you haven't yet, you first need to map the custom analyzers to your index settings endpoint.
Note: if the index exists and is running, make sure to close it first.
POST /my_index/_close
Then map the custom analyzers to the settings endpoint.
PUT /my_index/_settings
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer1": {
"type": "standard",
"stopwords_path": "stopwords/stopwords.txt"
},
"custom_analyzer2": {
"type": "standard",
"stopwords": ["stop", "words"]
}
}
}
}
}
Open the index again.
POST /my_index/_open
Now you can query your index with the new analyzers.
GET /my_index/_search
{
"query": {
"bool": {
"should": [{
"match": {
"field_1": {
"query": "Hello world",
"analyzer": "custom_analyzer1"
}
}
}],
"must": [{
"match": {
"field_2": {
"query": "Stop words can be tough",
"analyzer": "custom_analyzer2"
}
}
}]
}
}
}

Elastic search multiple terms in a dictionary

I have mapping like:
"profile": {
"properties": {
"educations": {
"properties": {
"university": {
"type": "string"
},
"graduation_year": {
"type": "string"
}
}
}
}
}
which obviously holds the educations history of people. Each person can have multiple educations. What I want to do is search for people who graduated from "SFU" in "2012". To do that I am using filtered search:
"filtered": {
"filter": {
"and": [
{
"term": {
"educations.university": "SFU"
}
},
{
"term": {
"educations.graduation_year": "2012"
}
}
]
}
But what this query does is to find the documents who have "SFU" and "2012" in their education, so this document would match, which is wrong:
educations[0] = {"university": "SFU", "graduation_year": 2000}
educations[1] = {"university": "UBC", "graduation_year": 2012}
Is there anyway I could filter both terms on each education?
You need to define nested type for educations and use nested filter to filter it, or Elasticsearch will internally flattens inner objects into a single object, and return the wrong results.
You can refer here for detail explainations and samples:
http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/

Resources