I am using a hosted elasticsearch cloud with 2 indexes each has 50,000 documents and one of this size is around 300MB (on this we are applying search) and another one is of 50MB (on this we are applying suggestion) and also added query caching to both the indexes.
index with size of 300MB (using for search)
This query is taking 1.2s to 1.5s response time
Settings JSON
{
"index.blocks.read_only_allow_delete": "false",
"index.priority": "1",
"index.query.default_field": [
"*"
],
"index.write.wait_for_active_shards": "1",
"index.refresh_interval": "9000s",
"index.requests.cache.enable": "true",
"index.analysis.analyzer.edge_ngram_analyzer.filter": [
"lowercase"
],
"index.analysis.analyzer.edge_ngram_analyzer.tokenizer": "edge_ngram_tokenizer",
"index.analysis.analyzer.keyword_analyzer.filter": [
"lowercase",
"asciifolding",
"trim"
],
"index.analysis.analyzer.keyword_analyzer.char_filter": [],
"index.analysis.analyzer.keyword_analyzer.type": "custom",
"index.analysis.analyzer.keyword_analyzer.tokenizer": "keyword",
"index.analysis.analyzer.singular_plural_analyzer.type": "snowball",
"index.analysis.analyzer.edge_ngram_search_analyzer.tokenizer": "lowercase",
"index.analysis.tokenizer.edge_ngram_tokenizer.token_chars": [
"letter"
],
"index.analysis.tokenizer.edge_ngram_tokenizer.min_gram": "2",
"index.analysis.tokenizer.edge_ngram_tokenizer.type": "edge_ngram",
"index.analysis.tokenizer.edge_ngram_tokenizer.max_gram": "5",
"index.number_of_replicas": "1"
}
index with the size of 50MB (using for the suggestion)
This query is taking 0.5s to 0.6s response time.
Setting JSON
{
"index.blocks.read_only_allow_delete": "false",
"index.priority": "1",
"index.query.default_field": [
"*"
],
"index.write.wait_for_active_shards": "1",
"index.refresh_interval": "90000s",
"index.requests.cache.enable": "true",
"index.analysis.analyzer.edge_ngram_analyzer.filter": [
"lowercase"
],
"index.analysis.analyzer.edge_ngram_analyzer.tokenizer": "edge_ngram_tokenizer",
"index.analysis.analyzer.keyword_analyzer.filter": [
"lowercase",
"asciifolding",
"trim"
],
"index.analysis.analyzer.keyword_analyzer.char_filter": [],
"index.analysis.analyzer.keyword_analyzer.type": "custom",
"index.analysis.analyzer.keyword_analyzer.tokenizer": "keyword",
"index.analysis.analyzer.singular_plural_analyzer.type": "snowball",
"index.analysis.analyzer.edge_ngram_search_analyzer.tokenizer": "lowercase",
"index.analysis.tokenizer.edge_ngram_tokenizer.token_chars": [
"letter"
],
"index.analysis.tokenizer.edge_ngram_tokenizer.min_gram": "2",
"index.analysis.tokenizer.edge_ngram_tokenizer.type": "edge_ngram",
"index.analysis.tokenizer.edge_ngram_tokenizer.max_gram": "5",
"index.number_of_replicas": "0"
}
I want to increase response time for both the queries.
Right now system/elasticsearch cloud configurations images link added below.
Can you please help me to increase the performance of queries.
I figured it out and solved my issue by installing elasticsearch to my local server and removed the network dependency.
The issue was remote elasticsearch cloud and slower network. Now, both queries giving me a response within 15ms to 20ms.
Related
Please consider the scenario.
Existing System
I have an index named contacts_index with 100 documents.
Each document has property named city with some text value in it.
Index has settings as the following
{
"analyzer": {
"city_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "city_tokenizer"
},
"search_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"tokenizer": {
"city_tokenizer": {
"token_chars": [
"letter"
],
"min_gram": "2",
"type": "ngram",
"max_gram": "30"
}
}
}
The index has the following mapping for city field to support matching sub-text search and keyword search.
{
"city" : {
"type" : "text",
"analyzer" : "city_analyzer",
"search_analyzer" : "search_analyzer"
}
}
Proposed System
Now we want to perform autocomplete on city field. for example for city with value Seattle. We want to get the document when the user types s, se, sea, seat, seatt, seattl, seattle but Only when they query with the above prefix text. For example not when they type eattle. etc..
We have planned to attain this with the help of one more multi-field for city property with different of type text and different analyzer.
To attain this we have done the following.
Updated the settings to support autocomplete
PUT /staging-contacts-index-v4.0/_settings?preserve_existing=true
{
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "autocomplete_tokenizer"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"token_chars": [
"letter"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "100"
}
}
}
}
Update the mapping of city field with multi-field autocomplete to support autocomplete
{
"city" : {
"type" : "text",
"fields" : {
"autocomplete" : {
"type" : "text",
"analyzer" : "autocomplete_analyzer",
"search_analyzer" : "search_analyzer"
}
},
"analyzer" : "city_analyzer",
"search_analyzer" : "search_analyzer"
}
}
Findings
For any new document that will be newly created after updating autocomplete multi-field settings, autocomplete search is working as expected
For existing documents, if the value of city field changes, for example seattle to chicago, the document is fetched when making autocomplete search.
We are planning to make use of update api to fetch and update the existing 100 documents so that autocomplete works for existing documents as well. However while trying to use the update api, we are getting
{"result" : "noop"}
And the autocomplete search is not working.
I can infer that since the values were not changing, elasticsearch not creating tokens for autocomplete field.
Question
From the research we have done, there are two options to make sure the existing 100 documents can perform autocomplete search.
Use Reindex api for existing 100 documents.
Fetch all 100 documents and Use document Index api to update the existing 100 documents which will create all the tokens in the process.
Which option is preferable and why?
Thanks for taking time to read through.
I am new to using elastic search. I managed to get things working somewhat close to what I intended. I am using the following configuration.
{
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"shingle_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"shingle_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter",
"autocomplete_filter"
]
}
}
}
}
I have this applied over multiple fields and doing a multi match query.
Following is the java code:
NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.multiMatchQuery(i)
.field("title")
.field("alias")
.fuzziness(Fuzziness.ONE)
.type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
.build();
The problem is it matches with fields that have letters with some leading characters.
For example, if my search input is "ron" I want it to match with "ron mathews", but I don't want it match with "iron". How can I make sure that I am matching with letters having no leading characters?
Update-1
Turning off fuzzy transposition seems to improve search results. But I think we can make it better.
You probably want to score "ron" higher than "ronaldo" and the exact match of complete field "ron" even higher so the best option here would be to use few subfields with standard and keyword analyzers and boost those fields in your multi_match query.
Also, as you figured out yourself, be careful with the fuzziness. Might make sense to run 2 queries in a should with one being fuzzy and another boosted so that exact matches are ranked higher.
Let me jump straight to the code.
PUT /test_1
{
"settings": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"university of tokyo => university_of_tokyo, u_tokyo",
"university" => "college, educational_institute, school"
],
"tokenizer": "whitespace"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"shingle",
"synonym"
]
}
}
}
}
}
output
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [shingle] cannot be used to parse synonyms"
},
"status": 400
}
Basically,
Lets Say I have following index_time synonyms
"university => university, college, educational_institute, school"
"tokyo => tokyo, japan_capitol"
"university of tokyo => university_of_tokyo, u_tokyo"
If I search for "college" I expect to match "university of tokyo"
but since index contains only "university of tokyo" => university_of_tokyo, u_tokyo.....the search fails
I was expecting if I use analyzer{'filter': ["single", "synonym"]}
university of tokyo -shingle-> university -synonyms-> college, institue
How do I obtain the desired behaviour?
I was getting a similar error, though I was using synonym graph....
I tried using lenient=true in the synonym graph definition and got rid of the error. Not sure if there is a downside....
"graph_synonyms" : {
"lenient": "true",
"type" : "synonym_graph",
"synonyms_path" : "synonyms.txt"
},
According to this link Tokenizers should produce single tokens before a synonym filter.
But to answer your problem first of all your second rule should be modified to be like this to make all of terms synonyms
university , college, educational_institute, school
Second Because of underline in the tail of first rule (university_of_tokyo) all the occurrences of "university of tokyo" are indexed as university_of_tokyo which is not aware of it's single tokens. To overcome this problem I would suggest a char filter with a rule like this:
university of tokyo => university_of_tokyo university of tokyo
and then in your synonyms rule:
university_of_tokyo , u_tokyo
This a way to handle multi-term synonyms problem as well.
I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}
I am trying to use elasticsearch for live data filtering. Right now I use a single machine which gets constantly pushed new data (every 3 seconds via _bulk). Even so I did set up a ttl the index gets quite big after a day or so and then elasticsearch hangs. My current mapping:
curl -XPOST localhost:9200/live -d '{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
},
"no_keyword": {
"type": "custom",
"tokenizer": "whitespace",
"filter": []
}
}
}
},
"mappings": {
"log": {
"_timestamp": {
"enabled": true,
"path": "datetime"
},
"_ttl":{
"enabled":true,
"default":"8h"
},
"properties": {
"url": {
"type": "string",
"search_analyzer": "lowercase_keyword",
"index_analyzer": "lowercase_keyword"
},
"q": {
"type": "string",
"search_analyzer": "no_keyword",
"index_analyzer": "no_keyword"
},
"datetime" : {
"type" : "date"
}
}
}
}
}'
I think a problem is purging the old documents but I could be wrong. Any ideas on how to optimize my setup?
To avoid elasticsearch hanging, you might want to increase amount of memory available to java process.
If all your documents have the same 8 hour life span, it might be more efficient to use rolling aliases instead of ttl. The basic idea is to create a new index periodically (every hour, for example) and use aliases to keep track of current indices. As time goes, you can update the list of indices in the alias that you search and simply delete indices that are more than 8 hour long. Deleting an index is much quicker than removing indices using ttl. A sample code that demonstrates how to create rolling aliases setup can be found here.
I am not quite sure how much live data you are trying to keep, but if you are just testing incoming data against a set of queries, you might also consider using Percolate API instead of indexing data.