When we talk about inverted index, we always talk about indexing unstructured text documents. But documents in ElasticSearch are in JSON format, they are "key"-"value" pairs. So I want to know how the inverted index of JSON documents looks like. In another word, when we do the search like "select * from table where name = john", what does ES do?
An inverted index basically stores a relationship between terms and the document/field they were found in. Now, those terms can come from unstructured text, but not only. A JSON document also contains text, which ES analyzes and indexes.
Basically, from a 30000 feet perspective, the way it works is that ES parses the JSON documents it receives, iterates over all fields and analyzes/tokenizes the value of all those fields. The tokens that come out of this analysis process are then indexed into the inverted index.
Long story short, it doesn't have to be unstructured text that gets indexed into an inverted index, it can also be a JSON document, etc, which also contain structured, unstructured text, but also numerical figures, dates, etc.
Related
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
I am curious how does the implementation work on Elasticsearch backend works. How can they make a value not retrievable but searchable? (I would imagine it would need to be stored somewhere in order for you to search it right?) Why is Elasticsearch designed this way? what efficiency did it achieve for designing it this way?
The source document is actually "stored" in the _source field (but it is not indexed) and all fields of the source documents are indexed (but not stored). All field values can usually be retrieved from the _source field using source filtering. This is how ES is configured by default, but you're free to change that.
You can, for instance, decide to not store the _source document at all and store only certain fields of your document. This might be a good idea if for instance your document has a field which contains a huge blob of text. It might not be wise to store the _source because that would take a lot of space for nothing. That huge blob of text might only be useful for full-text search and so would only need to be indexed, while all other fields might need to be indexed and stored as well because they need to be retrieved in order to be displayed.
So the bottom line is:
if a field can be searched, it doesn't need to be stored, it only needs to be indexed
if a field can be retrieved, it can either be configured to be stored or retrieved/filtered from the _source field (which is stored by default)
In Lucene i want to store the full document as well which would be just stored and not analysed. What i want to do is something like _source in Elastic Search.
But I'm confused as what would be the best data type in Lucene to store such data.
What field type should I use in Lucene to store such data. Should it be a StringField or something else?
I think elasticsearch stores _source as hex data. Not sure though.
Which data type would take less space and still be fast enough to retrieve?
As per this this part of the doc, it seems that Lucene treats each and every data type as:
opaque bytes
which could ideally convey that it doesn't really matter what type of field you're having as long as it's relative to your requirement, where Lucene would anyways convert them.
So deciding on which data type the field should be, totally depends on how do you want your fields to be in the index and also how're you gonna use them to visualize graphs in Kibana. Hope it helps!
I am new to Solr and below is my requirement in Solr
I have loads of emails stored in text format (semi-structured).
using Solr I have to index these documents when I am searching for a particular string (could be name) Solr should return the entire matching document/s as a response.
Kindly let me know how to do this in Solr. Is it advisable to store indexes in HDFS?
Solr can store original representation of the field with stored flag. So, you could store your text format in a field and then index it, or split it and index in multiple fields.
However, you may be better off storing those documents outside of Solr and structure content in Solr specifically for searching. Then, your middle-ware combines results returned from Solr with original documents stored somewhere.
The bigger emails are, the better it is for you to store them outside of Solr.
The difference between the two, who hold all of the fields, eludes me.
If my document has:
{"mydoc":
{"properties":
{"name":{"type":"string","store":"true"}},
{"number":{"type":"long","store":"false"}},
{"title":{"type":"string","include_in_all":"false","store":"true"}}
}
}
I understand that _source is a field that has all the fields. But so does _all?
Does this mean that "name" is saved several times (twice? in _source and in _all), increasing the disk space the document takes?
Is "name" stored once for the field, once for _source, and once for _all?
what about "number", is it stored in _all, even though not in _source?
When should I use _source in my query, and when _all?
What is the use case where I can disable _all, and what functionality would then be denied?
It's pretty much the same as the difference between indexed fields and stored fields in lucene.
You use indexed fields when you want to search on them, while you store fields that you want to return as search results.
The _source field is meant to store the whole source document that was originally sent to elasticsearch. It's use as search result, to be retrieved. You can't search on it. In fact it is a stored field in lucene and not indexed.
The _all field is meant to index all the content that come from all the fields that your documents are composed of. You can search on it but never return it, since it's indexed but not stored in lucene.
There's no redundancy, the two fields are meant for a different usecase and stored in different places, within the lucene index. The _all field becomes part of what we call the inverted index, use to index text and be able to execute full-text search against it, while the _source field is just stored as part of the lucene documents.
You would never use the _source field in your queries, only when you get back results since that's what elasticsearch returns by default. There are a few features that depend on the _source field, that you lose if you disable it. One of them is the update API. Also, if you disable it you need to remember to configure as store:yes in your mapping all the fields that you want to return as search results. I would rather say don't disable it unless it bothers you, since it's really helpful in a lot of cases. One other common usecase would be when you need to reindex your data; you can just retrieve all your documents from elasticsearch itself and just resend them to another index.
On the other hand, the _all field is just a default catch all field, that you can use when you just want to search on all fields available and you don't want to specify them all in your queries. It's handy but I wouldn't rely on it too much on production, where it's better to run more complex queries on different fields, with different weights each. You might want to disable it if you don't use it, this will have a smaller impact than disabling the _source in my opinion.
Suppose I have a string field specified as not_analyzed in the mapping. If I then add "store":"yes" to the mapping, will ElasticSearch duplicate the storage? My understanding of not_analyzed fields is that they are not run through an Analyzer, indexed as is, but a client is able to match against it. So, if a field is both not_analyzed and store:yes, this could cause ElasticSearch to keep two copies of the string.
My question:
If a string field is stored as both not_analyzed and store:yes, will there be duplicate storage of the string?
I hope that's clear enough. Thanks!
You're mixing up the concept of indexed field and stored field in lucene, the library that elasticsearch is built on top of.
A field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
A field is stored when you want to be able to retrieve it. Let's say Lucene provides some kind of storage too, which doesn't have anything to do with the inverted index itself.
When you search using lucene you get back a list of document ids that match. Then you can retrieve some text from their stored fields, which is what you literally show as search results. If you don't store a field you'll never be able to get it back from lucene (this is not true for elasticsearch though, as I'm going to explain below).
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.
Therefore the two data structures are not related to each other. If you both index and store a field in lucene, its content will not be present twice in the same form. Stored fields are stored as they are, as you send them to lucene, while indexed fields might be analyzed and will be part of the inverted index, which is something else. Stored fields are made to be retrieved for a specific document (by lucene document id), while indexed fields are made to search, in such a structure that literally inverts the text having as a result each term as key, together with a list of document ids that contain it (the postings list).
When it comes to elasticsearch things change a little though. When you don't configure a field as stored in your mapping (default is store:no) you are able to retrieve it anyway by default. This happens because elasticsearch always stores in lucene the whole source document that you send to it (unless you disable this feature) within a special lucene field, called _source.
When you search using elasticsearch you get back by default the whole source field, but you can also ask for specific fields. What happens in that case is that elasticsearch checks whether those specific fields are stored or not in lucene. If they are the content will be retrieved from lucene, otherwise the _source stored field will be retrieved from lucene, parsed as json (pull parsing) and those specific fields will be extracted. In the first case it might be a little faster, but not necessarily. If your source is really big and you only want to load a couple of fields, configuring them as stored in lucene would probably make the loading process faster; on the other hand, if your _source is not that big and you want to load many fields, then it's probably better to load only one stored field (the _source), which would lead to a single disk seek, parse it etc. In most of the cases using the _source field works just fine.
To answer your question: inverted index and lucene storage are two completely different things. You end up having two copies of the same data in lucene only if you decide to store a field (store:yes in the mapping), since elasticsearch keeps that same content within the json _source, but this doesn't have anything to do with the fact that you're indexing or analyzing the field.