ElasticSearch Edge NGram vs Prefix query - elasticsearch

Let's say we have a text field that is relatively short, let's say maximum 10 characters and is saved as a keyword.
I want my users to be able to prefix-search this field (not autocomplete / search-as-you-type).
I have read on Elastic's documentation that the prefix query scales poorly, and they give a couple of examples to demonstrate it.
When is it ok to use prefix search, and when should I use index-time edge-ngrams? Building and storing index-time edge-ngrams of this field sounds excessive, but maybe I'm missing something.

First thing first:
Keyword fields are only searchable by their exact value
Thus you can't prefix-search a field that is defined as keyword type . prefix-search is only work for analyzed fields.
In general prefix-search has relatively poor performance. You can improve it's functionality by index_prefixes but even then it is relatively intense operation so if you plan to scale your business, it is better to use N-Grams because it is more resource efficient.

Related

Are there downsides of indexing every field in Elasticsearch index? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I found information about how number of shards, number of fields in mapping, affect performance of Elasticsearch. But I could not find any information about how indexing or not indexing a field affect cluster performance.
Imagine I have document like:
{
"name":"John Doe",
"photoPath":"/var/www/public/images/i1/j/d/123456.jpg",
"passwordHash":"$12$3b$abc...354",
"bio":"..."
}
I need to put 10 to 100 such documents to the cluster each second. When I put such document in index I am pretty sure I'd need to fulltext search for name and fulltext search for bio. I will never search for photoPath and I will never need fulltext search for password hash.
When I do my mapping I have several options:
make all fields text and analyze them with simple analyzer (i.e. tokenize by any not-character) - in that case I will have terms like "i1", "3b" or "123456" in my index
make name and bio text, make password hash keyword and make photoPath non-indexed
So my questions are:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types?
Am I correct in my assumption that having less fields indexed helps performance?
Am I correct in my assumption that indexing fewer fields will improve indexing performance?
Am I correct in my assumption that actual search will be faster if I index only what I need?
Here we go with the answers:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types? --> see detailed explanation below
Am I correct in my assumption that having less fields indexed helps performance? --> Yes
Am I correct in my assumption that indexing fewer fields will improve indexing performance? --> Yes
Am I correct in my assumption that actual search will be faster if I index only what I need? --> Most likely
Detailed explanation:
Every index comes with a mapping in which you not just specify what data should get indexed but also in how many fields your data is stored and how to process the data before storing it. In its default configuration Elasticsearch will dynamically create this mapping for you based on the type of data you sent to it.
Every single entry in your mapping consumes some bytes which will add to the size of the cluster state (the data structure that contains all the meta-information about your cluster such as information about nodes, indices, fields, shards, etc. and lives in RAM). For some users the cluster state simply got too big which severely affected performance. As a safety measurement Elasticsearch by default does not allow you to have more than 1000 fields in a single index.
Depending on mapping type and optional mapping parameters Elasticsearch will create one or more data structures for every single field you store. There are less and more "expensive" types, e.g. keyword and boolean are rather "cheap" types, whereas "text" (for full text search) is a rather expensive type, as it also requires preprocessing (analysis) of your strings. By default Elasticsearch maps strings to a multifield made up of 2 fields: one that goes by <fieldname> which is of type text and supports full-text search, and one that goes by <fieldname>.keyword of type keyword which only supports exact match search. On top of keyword fields and some other field types allow you to do analytics and use them for sorting. If you don't need one or the other, then please customize your mapping by storing it only for the use case you need. It makes a huge difference if you only need to display a field (no need to create any additional data structures for that field), whether you need to be able to search in a field (requiring specific data structures like inverted indices), or whether you need to do analytics on a field or sort by that field (requiring specific data structures like doc_values). Besides the Elasticsearch fields you specify in your mapping with a type you also can control the data structures that should get created with the following mapping-parameters: index, doc_values, enabled (just to name a few)
At search time it also makes a difference over how many fields you are searching and how big your index is. The fewer fields, the smaller the index, the better for fast search requests.
Conclusion:
So, your idea to customize your mapping by only storing some fields as keyword fields, some as text fields, some as multifields makes perfect sense!
As the question has several parts, I would try to answer them with official elasticsearch(ES) resources. Before that let's break what OP has in the ES index and every field use case:
{
"name":"John Doe", //used for full text search
"photoPath":"/var/www/pub/images/i1/j/d/123.jpg", // not used for search or retrival.
"passwordHash":"$12$3b$abc...354", // not used for search or retrival.
"bio":"..." //used for full text search**
}
Now as OP just mentioned photoPath and passwordHash aren't used for full-text search, I am assuming that these fields will not be used even for retrieval purposes.
So first, we need to understand what's the difference b/w indexing a field and storing the field and this is explained very well in this and this article. In short, if you have _source disabled(default is enabled), you will not be able to retrieve a field if it's not stored.
Now coming to the optimization part and improving the performance part. it's very simple that if you (send/store) more data what you actually need, then you wasting resources(nertwork,CPU,memory, disk). And ES is no different here.
Now coming to OP assumptions/questions:
In what ways, if any, am I improving performance in case I use the second option with custom-tailored field types? This option definitely better than first as you are not indexing the fields which you don't need for a search, but there is still room for optimization if you don't need to retrieve them, then it's better not to store them as well as remove from index mapping.
Am I correct in my assumption that having fewer fields indexed helps performance? Yes, as this way your inverted index would be smaller and you would be able to cache more data from your inverted index to file system cache and searching in small no of data is always faster. Apart from that, it helps to improve the relevance of your search by not indexing the unnecessary fields for your search.
Am I correct in my assumption that indexing fewer fields will improve indexing performance? Explained in the previous answer.
Am I correct in my assumption that the actual search will be faster if I index only what I need? It not only improves the search speed but improves indexing speed(as there will be lesser segments and merging them takes less time)
I can add a lot more information but I wanted to keep this short and simple. Let me know if anything isn't clear and would be happy to add more information.

What will be the affect of fielddata=true when querying a ~10M document index and more questions

I have an index of ~10M docs. In each document I have a 'text' field where I put a string in and in the end I want aggregate all the terms inside this field. When trying to do that I only get the entire string.
I heard only bad things about using fielddata=true.
For this amount of documents, is it really such a bad practice to use fielddata=true in terms of memory?
Is there a difference (in terms of performance) between using an analyzer in the indexing pipeline (just set an analyzer on a specific field) to using an analyzer as a function (run analyzer on a string, get the results and put them in a document)?
Synonyms - I have defined a list of synonyms, I believe I already know the answer but still I'll give it a try, Is it possible to simply update such list of synonyms and that's it? or it's a mandatory to re-index after updating the synonyms list?
yes the lack of memory is an issue but you should test it to findout how much memory do you need. 10M is not too much doc for 32G Heap memory limit.
I didn't understand the question
at the time of creating index you should point to list (file) of synonyms words. but after that you can update the list without need to re-index. of course not simple contraction (for that you should re-index). https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

Elasticsearch - Autocomplete return word/term/token suggestions instead of whole documents

I am trying to implement a simple auto completion for query terms.
There are many different approaches but most of them do return documents instead of terms
- or the authors simply stopped explaining from that point and i am not able to adapt.
A user is typing in a query - e.g. phil
What i want is to provide a list of term completion suggestions like philipp, philius, philadelphia, ...
I am able to get document matches via (edge)ngrams, phrase_prefix and so on but i am am stuck at retrieving matching terms (completion suggestions).
Can someone give me a hint?
I have documents like this {"title":"...", "description":"...", "content":"..."}
All fields have larger string values but especially the field content contains fulltext content.
I do not want to suggest the whole title of a document containing e.g. Philadelphia. Just the word "Philadelphia".
Looking for something like that, myself.
In SOLR it was relatively simple to configure (although a pain to build and keep up-to-date) using solr.SpellCheckComponent. Somehow the same underlying Lucene functionality is used differently between SOLR and ElasticSearch, and in ElasticSearch it is geared towards finding whole documents (or whole field values, if you will) or so it seems...
Despite the profusion of "elasticsearch autocomplete" articles, none appears to deal with this particular issue. Like it doesn't exist. Maybe their use case is different and ElasticSearch works for them just fine, who knows?
At this point I think that preparing the exact field values to use with ElasticSearch autocomplete (yes, that's the input field values, not analyzer tokens) maybe the only way to solve the problem. Which is terrible, because the performance is going to be very low.
Try term suggester:
The term suggester suggests terms based on edit distance. The provided
suggest text is analyzed before terms are suggested. The suggested
terms are provided per analyzed suggest text token. The term suggester
doesn’t take the query into account that is part of request.

Elasticsearch questions: search, performance and caching

I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.

How to write fast Elastic Search queries

Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)

Resources