I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.
Related
I am trying to adapt an existing boolean query that runs query-string searches against multiple fields so that it returns the number of hits for each matched term, for each search result.
This seems like it should be a straightforward request since the default relevancy scoring takes the matched term frequencies into account on a doc-by-doc basis, and highlighting must parse the fields to identify the positions of matched terms, but from scouring the docs there doesn't seem to be an easy way to do this that doesn't require some additional parsing of the results returned by Elasticsearch.
I would really like to avoid having to do more than one call to Elasticsearch per query for performance reasons, so would like to adapt the existing search queries if possible.
I know that the search API has an "explain" option that, when set to "true", makes a nested "_explanation" object be returned for each search result, but most of the information in these objects is irrelevant to what I want to know (the matched term frequencies), and I haven't found a way to exclude any of that information from being returned in the search results. I am reluctant to use this option because a) I've seen advice that it should only be used for debugging purposes and not in production, and b) the queries I'm running are not for individual terms but for query strings that can contain an arbitrary number of matched terms for each query, thus making the explanation objects much larger in some cases (therefore increasing the response payload) and more complex to parse. It's also not clear if the "_explanation" object has a well-defined structure anyway.
I've also considered parsing highlighted text fragments to determine matched term frequencies, since I'm already returning highlighted fields as part of the query. However, again this would require some additional parsing of the API response which would uncomfortably couple the method for obtaining matched term frequencies with the custom pre- and post-tags of the highlighted fields.
Edit: Just to clarify, I would be open to a separate Elasticsearch call per search if necessary, but not any that would require submitting a set of document IDs matched from the first query, because this would mean the API calls couldn't be done in parallel and because the number of returned results in the first call could be quite high, which I presume would impact performance of the second call.
The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I found information about how number of shards, number of fields in mapping, affect performance of Elasticsearch. But I could not find any information about how indexing or not indexing a field affect cluster performance.
Imagine I have document like:
{
"name":"John Doe",
"photoPath":"/var/www/public/images/i1/j/d/123456.jpg",
"passwordHash":"$12$3b$abc...354",
"bio":"..."
}
I need to put 10 to 100 such documents to the cluster each second. When I put such document in index I am pretty sure I'd need to fulltext search for name and fulltext search for bio. I will never search for photoPath and I will never need fulltext search for password hash.
When I do my mapping I have several options:
make all fields text and analyze them with simple analyzer (i.e. tokenize by any not-character) - in that case I will have terms like "i1", "3b" or "123456" in my index
make name and bio text, make password hash keyword and make photoPath non-indexed
So my questions are:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types?
Am I correct in my assumption that having less fields indexed helps performance?
Am I correct in my assumption that indexing fewer fields will improve indexing performance?
Am I correct in my assumption that actual search will be faster if I index only what I need?
Here we go with the answers:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types? --> see detailed explanation below
Am I correct in my assumption that having less fields indexed helps performance? --> Yes
Am I correct in my assumption that indexing fewer fields will improve indexing performance? --> Yes
Am I correct in my assumption that actual search will be faster if I index only what I need? --> Most likely
Detailed explanation:
Every index comes with a mapping in which you not just specify what data should get indexed but also in how many fields your data is stored and how to process the data before storing it. In its default configuration Elasticsearch will dynamically create this mapping for you based on the type of data you sent to it.
Every single entry in your mapping consumes some bytes which will add to the size of the cluster state (the data structure that contains all the meta-information about your cluster such as information about nodes, indices, fields, shards, etc. and lives in RAM). For some users the cluster state simply got too big which severely affected performance. As a safety measurement Elasticsearch by default does not allow you to have more than 1000 fields in a single index.
Depending on mapping type and optional mapping parameters Elasticsearch will create one or more data structures for every single field you store. There are less and more "expensive" types, e.g. keyword and boolean are rather "cheap" types, whereas "text" (for full text search) is a rather expensive type, as it also requires preprocessing (analysis) of your strings. By default Elasticsearch maps strings to a multifield made up of 2 fields: one that goes by <fieldname> which is of type text and supports full-text search, and one that goes by <fieldname>.keyword of type keyword which only supports exact match search. On top of keyword fields and some other field types allow you to do analytics and use them for sorting. If you don't need one or the other, then please customize your mapping by storing it only for the use case you need. It makes a huge difference if you only need to display a field (no need to create any additional data structures for that field), whether you need to be able to search in a field (requiring specific data structures like inverted indices), or whether you need to do analytics on a field or sort by that field (requiring specific data structures like doc_values). Besides the Elasticsearch fields you specify in your mapping with a type you also can control the data structures that should get created with the following mapping-parameters: index, doc_values, enabled (just to name a few)
At search time it also makes a difference over how many fields you are searching and how big your index is. The fewer fields, the smaller the index, the better for fast search requests.
Conclusion:
So, your idea to customize your mapping by only storing some fields as keyword fields, some as text fields, some as multifields makes perfect sense!
As the question has several parts, I would try to answer them with official elasticsearch(ES) resources. Before that let's break what OP has in the ES index and every field use case:
{
"name":"John Doe", //used for full text search
"photoPath":"/var/www/pub/images/i1/j/d/123.jpg", // not used for search or retrival.
"passwordHash":"$12$3b$abc...354", // not used for search or retrival.
"bio":"..." //used for full text search**
}
Now as OP just mentioned photoPath and passwordHash aren't used for full-text search, I am assuming that these fields will not be used even for retrieval purposes.
So first, we need to understand what's the difference b/w indexing a field and storing the field and this is explained very well in this and this article. In short, if you have _source disabled(default is enabled), you will not be able to retrieve a field if it's not stored.
Now coming to the optimization part and improving the performance part. it's very simple that if you (send/store) more data what you actually need, then you wasting resources(nertwork,CPU,memory, disk). And ES is no different here.
Now coming to OP assumptions/questions:
In what ways, if any, am I improving performance in case I use the second option with custom-tailored field types? This option definitely better than first as you are not indexing the fields which you don't need for a search, but there is still room for optimization if you don't need to retrieve them, then it's better not to store them as well as remove from index mapping.
Am I correct in my assumption that having fewer fields indexed helps performance? Yes, as this way your inverted index would be smaller and you would be able to cache more data from your inverted index to file system cache and searching in small no of data is always faster. Apart from that, it helps to improve the relevance of your search by not indexing the unnecessary fields for your search.
Am I correct in my assumption that indexing fewer fields will improve indexing performance? Explained in the previous answer.
Am I correct in my assumption that the actual search will be faster if I index only what I need? It not only improves the search speed but improves indexing speed(as there will be lesser segments and merging them takes less time)
I can add a lot more information but I wanted to keep this short and simple. Let me know if anything isn't clear and would be happy to add more information.
In Elasticsearch, can I associate each document with a (different) function that must be satisfied by parameters I supply on a search, in order to be returned on that search?
The particular functions I would particularly like to use involve a loop, some kind of simple branching (if-statement of switch-statement), an array-like data structure, strings comparisons, and simple boolean operators.
couple of keynotes here:
At query time:
- If your looking to shape the relevancy function, meaning the actual relevancy score of each document, you could use a script score query.
- If you're only looking to filter out unwanted documents, you could use a script query that allows you to do just that.
Both of those solutions enables you to compute a score comparing incoming query parameters against existing previously indexed values.
Take note that usage of scripts at query time can lead to increased memory usage and performance issues.
Elastic can also handle a second batch of filtering rules that are applied to the actual query result in the form of a post filter. Can come in handy sometime if you're not in a position of stream processing the output at API view level.
At index time:
There is such a thing called script fields that allows you to store a function that computes a result based on other fields value and incoming query parameters. they can be really powerful given the fact that they are assigned at index time. I think they might be what you are looking for.
I would not be using those if i weren't to have those field values compared against query params. Reason is that I like my index process to be lean and fast so I tend to compute those kinds of values at stream level, in upstream from the actual bulk indexing query.
Although convenient, those custom scripts results are likely to be achievable with a combination of regular queries and filters. In each release, the elasticsearch teams is adding new query and field types that let you do what you use to do via scripted queries whiteout the risk of blowing out you memory. a good example of this is the rank feature datatype recently introduced in the 7.x release.
A piece of advice for you. think of your elasticsearch service as a regular API in your datalayer. As such you can do query processing before the actual call to elastic and you can do data processing from the actual elastic results. If you really can't fit your business rules in there, that would be your last resort.
Fell free to contact me if you still have any questions. All the best.
Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)