How does elastic search keep the index - elasticsearch

I'm wondering how does elasticsearch search so fast. Does it use inverted index and how is it represented in memory? How is it stored on disk? How does it load from disk to memory? And how does it merges indexes so fast (I mean when searching how does it combine two lists so fast)?

elasticsearch uses lucene to store inverse document indexes. Lucene in turn will store read-only files called segments with inverse index data. Each segment contains some documents. Those segments are read only and are never changed. To delete or update documents elasticsearch will maintain a delete/update list which will be used to overwrite results from read-only segments.
With this approach some segments might become obsolete altogether or contain only few up-to date data. Such segments will be rewritten or deleted.
There is an interesting elasticsearch plugin which visualizes the segments and the rewriting process:
https://github.com/polyfractal/elasticsearch-segmentspy
To see it in action start indexing a lot of data and see the segment information.
With the Segment API you can retrieve information about the segments:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-segments.html

I'll share what I know of ElasticSearch (ES). Yes ES uses an inverted index, here is how it would be structured - if we have a white space analyzer on these documents -
{
"_id": 1,
"text": "Hello, John"
}
AND
{
"_id": 2,
"text": "Bonjour, John"
}
INVERTED INDEX
Word | Docs
___________________
Hello | 1
Bonjour | 2
John | 1&2
This index is built at index time, the document is allocated to a shard based on hashing the document ID. Whenever a search request is made, a lookup is performed an all shards, the results of which are then merged and returned to the requester. The results are returned and merged blazingly fast due to the performance of the inverted index.
ES stores data within the data folder created once you have launched ES and created an index. The file structure resembles this - /data/clustername/nodes/..., if you look into this directory you will understand how it's organised. You can define how ES' index data is stored here. For instance, all indexed data stored in memory on on disk.
There is plenty of information on the ES website there are also several published books on ES, you can see these here.

Related

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

Confusion about Elasticsearch

I have some confusion about ElasticSearch's Index.
In some place I read it's the equivalent of rdbms' database and some other place, an Index is like what we have at the end of books : list of words with corresponding documents that contain the word.
If someone can clarify.
Thanks
An Elasticsearch cluster can contain multiple Indices (databases). These indices hold multiple Documents (rows), and each document has Properties or field(columns).
you can check list of your available indices with http://localhost:9200/_cat/indices?v .
but in general (computer sciences and DB) indexing means like you said.
list of words with corresponding documents that contain the word
. this structure improves the speed of data retrieval operations on a database table. this concept could be used in many DB like mysql or oracle. in elasticsearch by default all document will be indexed. (you can change this settings to not indexing some columns/fields)

Elasticsearch lucene, understand code path for search

I want to understand how each of the lucene index files (nvd,dvd,tim,doc.. mainly these four) are used in ES query.
E.g. say my index has ten docs and i am doing a aggregation query. I would like to understand how ES/Lucene performs access to these four files for a single query.
I am trying to see if I can make some optimization in my system which is mostly a disk heavy system to speed up query performance.
I looked at ES code and understand that the QueryPhase is the most expensive and it seems to be doing a lot of randomn access to disk for the log oriented data I have.
I want to now dive deeper on lucene level as well and possibly debug code and see in action. Lucene code has zero log messages for IndexReader related classes. Also debugging lucene code directly seems unhelpful since the unittest don't create indexes with tim, doc, nvd, dvd files
Any pointers ?
As I know, ES don't do much on search details, if your want optimize search, my experience is optimize your data layout, here is some important lucene files description:
(see http://lucene.apache.org/core/7_2_1/core/org/apache/lucene/codecs/lucene70/package-summary.html#package.description):
Term Index(.tip) # ON MEMORY.
Term Dictionary(.tim) # ON DISK.
Frequencies(.doc) # ON DISK.
Per-Document Values(.dvd, .dvm), very useful on aggregation. # ON DISK.
Field Index(.fdx) # ON MEMORY.
Field Data(.fdt), finally data fetch from disk in here. # ON DISK.
And there are some point can optmize performance:
trying use small date type, for example: INTEGER or LONG values instead of STRING.
CLOSE DocValues on unnecessary field, at the same time open DocValues on that filed which your want to sort/aggregation.
just incluse necessasy field on source like "_source": { "includes": ["some_necessasy_field"]}.
only index field that your need using ES defined mappings.
split your data on multi index.
add SSD.

Index type in elasticsearch

I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?
I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood

elasticsearch - routing VS. indexing for query performance

I'm planning a strategy for querying millions of docs in date and user directions.
Option 1 - indexing by user. routing by date.
Option 2 - indexing by date. routing by user.
What are the differences or advantages when using routing or indexing?
One of the design patterns that Shay Banon # Elasticsearch recommends is: index by time range, route by user and use aliasing.
Create an index for each day (or a date range) and route documents on user field, so you could 'retire' older logs and you don't need queries to execute on all shards:
$ curl -XPOST localhost:9200/user_logs_20140418 -d '{
"mappings" : {
"user_log" : {
"_routing": {
"required": true,
"path": "user"
},
"properties" : {
"user" : { "type" : "string" },
"log_time": { "type": "date" }
}
}
}
}'
Create an alias to filter and route on users, so you could query for documents of user_foo:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"alias": "user_foo",
"filter": {"term": {"user": "foo"}},
"routing": "foo"
}
}]
}'
Create aliases for time windows, so you could query for documents this_week:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"index": ["user_logs_20140418", "user_logs_20140417", "user_logs_20140416", "user_logs_20140415", "user_logs_20140414"],
"alias": "this_week"
},
"remove": {
"index": ["user_logs_20140413", "user_logs_20140412", "user_logs_20140411", "user_logs_20140410", "user_logs_20140409", "user_logs_20140408", "user_logs_20140407"],
"alias": "this_week"
}
}]
}'
Some of the advantages of this approach:
if you search using aliases for users, you hit only shards where the users' data resides
if a user's data grows, you could consider creating a separate index for that user (all you need is to point that user's alias to the new index)
no performance implications over allocation of shards
you could 'retire' older logs by simply closing (when you close indices, they consume practically no resources) or deleting an entire index (deleting an index is simpler than deleting documents within an index)
Indexing is the process of parsing
[Tokenized, filtered] the document that you indexed[Inverted Index]. It's like appendix of an text book.
When the indexed data exceeds one server limit. instead of upgrading server configurations, add another server and share data with them. This process is called as sharding.
If we search it will search in all shards and perform map reduce and return results.If we group similar data together and search some data in specific data means it reduce processing power and increase speed.
Routing is used to store group of data in particular shards.To select a field for routing. The field should be present in all docs,field should not contains different values.
Note:Routing should be used in multiple shards environment[not in single node]. If we use routing in single node .There is no use of it.
Let's define the terms first.
Indexing, in the context of Elasticsearch, can mean many things:
indexing a document: writing a new document to Elasticsearch
indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)
Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices
Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client
Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day).
Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its _id. The hash falls into the hash range of one of the shards ==> that's where the document goes.
You can use custom routing to control this: instead of hashing the _id, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular).
Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?"
To answer, the strategy should account for:
how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks
how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user)
total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster
In practice, I've seen the following designs:
one index per fixed time interval. You'll see this with logs (e.g.
Logstash writes to daily indices by default)
one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies
one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards
one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.
I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals.
BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)

Resources