Recommended way to store data in elasticsearch - elasticsearch

I want to use elasticsearch on my backend and I have few questions:
My DB contains semi-structured data of products, i.e. each product may have different attributes inside it.
I want to be able to search a text on most of the fields and also search a text on one specific field.
What is the recommended way to store the document in ES ? to store all text in on field (maybe using _all feature) or leave it in different fields.
My concern of different fields is that I might have a lot of indexes (because I have many different product attributes)
I'm using couchbase as my main DB.
What is the recommended way to move the documents from it to ES, assuming I need to make some modifications on the document ?
To update the index from my code explicitly or use external tool ?
10x,

It depends on how many docs you are indexing at a time. If the number of docs are like >2million. Then it's better to store everything in one field , which will save time while indexing.
If the docs indexed are very less, then index them field by field and then search on _all field. This will give a clear view on the data and will be really helpful for what to display and what not to display.

Related

why did elasticsearch designed "store" field?

By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
I am curious how does the implementation work on Elasticsearch backend works. How can they make a value not retrievable but searchable? (I would imagine it would need to be stored somewhere in order for you to search it right?) Why is Elasticsearch designed this way? what efficiency did it achieve for designing it this way?
The source document is actually "stored" in the _source field (but it is not indexed) and all fields of the source documents are indexed (but not stored). All field values can usually be retrieved from the _source field using source filtering. This is how ES is configured by default, but you're free to change that.
You can, for instance, decide to not store the _source document at all and store only certain fields of your document. This might be a good idea if for instance your document has a field which contains a huge blob of text. It might not be wise to store the _source because that would take a lot of space for nothing. That huge blob of text might only be useful for full-text search and so would only need to be indexed, while all other fields might need to be indexed and stored as well because they need to be retrieved in order to be displayed.
So the bottom line is:
if a field can be searched, it doesn't need to be stored, it only needs to be indexed
if a field can be retrieved, it can either be configured to be stored or retrieved/filtered from the _source field (which is stored by default)

which is the best way to create types in terms of performance in elasticsearch

i have a RDBMS tables having multiple columns and its hetrogenous and need to create an index in elasticsearch from these tables. So which is the best practise intems of creation of types in elasticsearch. i was thinking about the multiple option
1) create types as same as rdbms tables and add documents as same as records in table
2) create a type with two fileds, in which one of the field for identification of that document and other field will be the concatenation of tables columns vales. So in this way only two fileds will be there across the all tables and search on the one field.
So could you let me know, which is the best way to create the types. please let me know, if need more info.
Index the data in the form that best facilitates your search requirements.
If it makes sense to combine everything into one big searchable everything field, do that (by default, elasticsearch already does this, in addition to the separate fields you index). If you are going to regret not being able to separate them when you need to search for data from one particular column, go the other route. If you need to know whether a document has been matched based on it's title or it's body, for instance, they should be indexed in different fields.
The best way to cripple your performance is trying to kludge together queries for things your index's structure doesn't support well.

What's the difference between source filtering and the fields option in the elasticsearch get API?

I'm confused between source filtering (i.e. using the _source_include parameter) and the fields option of the GET API in elasticsearch. How are they different in terms of performance? When are they supposed to be used?
Update: re: fields
Note that this is the 1.x documentation if you just arrived here from the future.
For backwards compatibility, if the fields parameter specifies fields which are not stored (store mapping set to false), it will load the _source and extract it from it. This functionality has been replaced by the source filtering parameter.
-- https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-fields.html#search-request-fields
AFAICT:
_source tells elasticsearch whether to include the source of matched documents in the response. The "source" is the data in the document as it was inserted.
fields tells elasticsearch to include source, but only include the defined fields.
Permformance: Unless you have low bandwidth to the Elasticsearch server, it might be negligible.
I had the same doubt, here I found what can be the answer.
fields restricts the fields whose contents are parsed and returned
_source_filtering restricts the fields which are returned
Another way of seeing it is to think that fields is used to optimize data transfer and CPU usage while _source_filtering only optimizes data transfer
Source filtering allows us to control which parts of the original JSON document are returned for each hit[...]It's worth keeping in mind that this only saves us on bandwidth costs between the nodes participating in the search as well as the client, not CPU or Disk, as was the case when using fields.
In addition:
One feature about fields that's not commonly known is the ability to select metadata-fields as well. Of particular note is its ability to select the _ttl-field, which actually returns the number of milliseconds until the document expires, not the original lifespan of the document. A very handy feature indeed.
The fields parameter applies only to stored fields. From the 2.3 documentation:
Besides indexing the values of a field, you can also choose to store
the original field value for later retrieval. Users with a Lucene
background use stored fields to choose which fields they would like to
be able to return in their search results. In fact the _source field
is a stored field. In Elasticsearch, setting individual document
fields to be stored is usually a false optimization. The whole
document is already stored as the _source field. It is almost always
better to just extract the fields that you need using the _source
parameter.
See source filetring for how to limit the fields returned from _source

In Elasticsearch, what happens if I set 'store' to yes on a few fields, but _source to false?

We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?
Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.
Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.

ElasticSearch: Impact of setting a "not_analyzed" field as "store":"yes"?

Suppose I have a string field specified as not_analyzed in the mapping. If I then add "store":"yes" to the mapping, will ElasticSearch duplicate the storage? My understanding of not_analyzed fields is that they are not run through an Analyzer, indexed as is, but a client is able to match against it. So, if a field is both not_analyzed and store:yes, this could cause ElasticSearch to keep two copies of the string.
My question:
If a string field is stored as both not_analyzed and store:yes, will there be duplicate storage of the string?
I hope that's clear enough. Thanks!
You're mixing up the concept of indexed field and stored field in lucene, the library that elasticsearch is built on top of.
A field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
A field is stored when you want to be able to retrieve it. Let's say Lucene provides some kind of storage too, which doesn't have anything to do with the inverted index itself.
When you search using lucene you get back a list of document ids that match. Then you can retrieve some text from their stored fields, which is what you literally show as search results. If you don't store a field you'll never be able to get it back from lucene (this is not true for elasticsearch though, as I'm going to explain below).
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.
Therefore the two data structures are not related to each other. If you both index and store a field in lucene, its content will not be present twice in the same form. Stored fields are stored as they are, as you send them to lucene, while indexed fields might be analyzed and will be part of the inverted index, which is something else. Stored fields are made to be retrieved for a specific document (by lucene document id), while indexed fields are made to search, in such a structure that literally inverts the text having as a result each term as key, together with a list of document ids that contain it (the postings list).
When it comes to elasticsearch things change a little though. When you don't configure a field as stored in your mapping (default is store:no) you are able to retrieve it anyway by default. This happens because elasticsearch always stores in lucene the whole source document that you send to it (unless you disable this feature) within a special lucene field, called _source.
When you search using elasticsearch you get back by default the whole source field, but you can also ask for specific fields. What happens in that case is that elasticsearch checks whether those specific fields are stored or not in lucene. If they are the content will be retrieved from lucene, otherwise the _source stored field will be retrieved from lucene, parsed as json (pull parsing) and those specific fields will be extracted. In the first case it might be a little faster, but not necessarily. If your source is really big and you only want to load a couple of fields, configuring them as stored in lucene would probably make the loading process faster; on the other hand, if your _source is not that big and you want to load many fields, then it's probably better to load only one stored field (the _source), which would lead to a single disk seek, parse it etc. In most of the cases using the _source field works just fine.
To answer your question: inverted index and lucene storage are two completely different things. You end up having two copies of the same data in lucene only if you decide to store a field (store:yes in the mapping), since elasticsearch keeps that same content within the json _source, but this doesn't have anything to do with the fact that you're indexing or analyzing the field.

Resources