Is there a way to limit a field to a certain number of characters when getting results from Elasticsearch? I know how to limit my results to a specific set of fields, but I don't see how to get just a piece of the data. I would like to receive just the first 100 characters to display a preview of data and limit I/O.
I have seen that highlighting gives the option of setting a fragment size, but I am not necessarily querying for anything from the field I want a substring of.
Elasticsearch doesn't provide such an option. The ideal way to achieve this kind of scenario is to change the way you are indexing data and may be store a snippet along with the long length fields. ( #foresightyj has provided good links to exclude/include fields at indexing/querying time)
Related
I have a requirement where I need to page through a big set of results. Now, I understand how primitive paging is not a good idea here considering that if I have to search let's say 10th Page, elastic search would have to load all 10 pages in memory, sort them and then aggregate to give the results which is not an ideal situation.
However, when using search after, we provide the last sorted value from first page and we basically tell elastic search - "Give me results which are beyond this field". My question here is how is this performant? As I understand it is the same as adding a new filter to your original query which says that next set of results should have the sort value greater than last page. Is that all that's there to it or is there something more that I am missing?
To make it short, normal pagination with from/size will always need to keep track of all hits from the from parameter of the very first request (i.e. increasing with each new pagination).
Whereas, using search_after it is not necessary to do so as the amount of data to keep track of is only as big as the size parameter (i.e. constant with each pagination).
If you want to dig into more details, I suggest you have a look at the following tickets:
#4940: Improve scroll search by using Lucene's IndexSearcher#searchAfter
#8192: Search: Expose Lucene's searchAfter in the search API
#16125: Add search_after parameter in the SearchAPI
The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.
I'm confused between source filtering (i.e. using the _source_include parameter) and the fields option of the GET API in elasticsearch. How are they different in terms of performance? When are they supposed to be used?
Update: re: fields
Note that this is the 1.x documentation if you just arrived here from the future.
For backwards compatibility, if the fields parameter specifies fields which are not stored (store mapping set to false), it will load the _source and extract it from it. This functionality has been replaced by the source filtering parameter.
-- https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-fields.html#search-request-fields
AFAICT:
_source tells elasticsearch whether to include the source of matched documents in the response. The "source" is the data in the document as it was inserted.
fields tells elasticsearch to include source, but only include the defined fields.
Permformance: Unless you have low bandwidth to the Elasticsearch server, it might be negligible.
I had the same doubt, here I found what can be the answer.
fields restricts the fields whose contents are parsed and returned
_source_filtering restricts the fields which are returned
Another way of seeing it is to think that fields is used to optimize data transfer and CPU usage while _source_filtering only optimizes data transfer
Source filtering allows us to control which parts of the original JSON document are returned for each hit[...]It's worth keeping in mind that this only saves us on bandwidth costs between the nodes participating in the search as well as the client, not CPU or Disk, as was the case when using fields.
In addition:
One feature about fields that's not commonly known is the ability to select metadata-fields as well. Of particular note is its ability to select the _ttl-field, which actually returns the number of milliseconds until the document expires, not the original lifespan of the document. A very handy feature indeed.
The fields parameter applies only to stored fields. From the 2.3 documentation:
Besides indexing the values of a field, you can also choose to store
the original field value for later retrieval. Users with a Lucene
background use stored fields to choose which fields they would like to
be able to return in their search results. In fact the _source field
is a stored field. In Elasticsearch, setting individual document
fields to be stored is usually a false optimization. The whole
document is already stored as the _source field. It is almost always
better to just extract the fields that you need using the _source
parameter.
See source filetring for how to limit the fields returned from _source
I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.
We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?
Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.
Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.