Elasticsearch / Lucene: Determine total bytes used by field in index - elasticsearch

I have a lucene index (generated by ES 6.X.X) and I want to understand how much storage space (bytes) each of my fields is consuming.
Does anyone know how I can figure out how many bytes it requires to store all of the values for a given field across all documents? I have the index opened up in Luke, but I am new to that tool and it's not clear to me how I can answer this question. Is there a better tool for this?

Related

Advice on efficient ElasticSearch document design

I'm working on a project that deals with listings (think: Craiglist, Ebay, Trulia, etc).
The basic unit of information is a "Listing", something like this:
{
"id": 1,
"title": "Awesome apartment!",
"price": 1000000,
// other stuff
}
Some fields can be searched upon (e.g price, location, etc), others are just for display purposes on the application (e.g title, description which contains lots of HTML etc).
My question is: should i store all the data in one document, or split it into two (one for searching e.g 'ListingSearchIndex', one for display, e.g 'ListingIndex').
I also have to do some pretty hefty aggregations across the documents too.
I guess the question is, would searching across smaller documents then doing another call to fetch the results by id be faster than just searching across the full document?
The main factors is obviously speed, but if i split the documents then maintenance would be a factor too.
Any suggestions on best practices?
Thanks :)
In my experience with Elasticsearch, shard configuration has been significant in cluster performance/ speed when querying, aggregating etc. Since, every shard by itself consumes cluster resources (memory/cpu) and has a cost towards cluster overhead it is ideal to get the shard count right so the cluster is not overloaded. Our cluster was over-sharded and it impacted loading search results, visualizations, heavy aggregations etc. Once we fixed our shard count it worked flawlessly!
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
Besides performance, I think there's other aspects to consider here.
ElasticSearch offers weaker guarantees in terms of correctness and robustness than other databases (on this topic see their blog post ElasticSearch as a NoSQL database). Its focus is on search, and search performance.
For those reasons, as they mention in the blog post above:
Elasticsearch is commonly used in addition to another database
One way to go about following that pattern:
Store your data in a primary database (e.g. a relational DB)
Index only what you need for your search and aggregations, and to link search results back to items in your primary DB
Get what you need from the primary DB before displaying - i.e. the data for display should mostly come from the primary DB.
The gist of this approach is to not treat ElasticSearch as a source of truth; and instead have another source of truth that you index data from.
Another advantage of doing things that way is that you can easily reindex from your primary DB when you change your index mapping for a new search use case (or on changing index-time processing like analyzers etc...).
I think you can't answer this question without knowing all your queries in advance. For example consider that you split to documents and later you decide that you need to filter based on a field stored in one index and sort by a field that is stored in another index. This will be a big problem!
So my advice to you, If you are not sure where you are heading, just put everything in one index. You can later reindex and remodel.

Reducing the size of elasticsearch index

I've just taken over a project that maintains an elasticsearch indexes and I'm new to the area. The indexes have grown massive in size. When I run:
GET /_cat/indices?v
I can see multiple multi-terabyte indexes. This is costing a lot in AWS fees. The last engineer who worked on this (no longer with the company) left some notes saying the n-gram configuration was causing the index to grow to a massive size. How do I find out information about how n-grams are set up on the index? When I run:
GET /my_index/_mapping
And get the mapping information I don't see any mention of n-grams for any of the fields. How can I see this information? I can see in the indexer code that it's set up to to have min ngram of 2 and max ngram of 12. Shoudln't this be in the mapping data returned from the call above?
Also, what other analysis can I perform on the indexes to get a better insight into their size, and what can be done to help reduce their footprint.
Thanks.

ElasticSearch - Setting readonly to an index improves performance?

I have this scenario where a system generates an index every day inside a node. Each index has 1 primary shard.
So, all the documents indexed in a day goes to a certain index, and after the day passed, a new index is created.
I keep the indices of the last 60 days (so this means that I always have 60 shards in the node). I can't close the old indices because I want them to support searches. After the 60th day passed then I delete them.
As I was reading the following article I noticed this about the index buffer:
It defaults to 10%, meaning that 10% of the total memory allocated to a node will be used as the indexing buffer size. This amount is then divided between all the different shards
This means that for the index of the day I have 10% / 60 of buffer index memory. So I am not really using the 10%.
The question is, what happens if I set to read only the older indices:
index.blocks.read_only
Set to true to make the index and index metadata read only, false to allow writes and metadata changes.
Will I see a benefit for doing this? Like having the entire 10% of index buffer in my only writable index? Or an improve in the searches of the other indices since they can be merged in an only segmented as they do not longer recieves writes?
Thanks!
Pablo
PD: Im using ElasticSearch 1.7.3 but Im looking forward to migrate to 2.2 in a future. I would like to know, if possible, if I have a benefit if this would be translated to 2.2
No, you won't get a benefit from adding the read_only block. The only way you would free up the buffer is to close the index entirely.
Also, you may be interested to know about https://github.com/elastic/elasticsearch/pull/14121 which gives "actively indexing" shards a bigger share of the buffer. Coming in Elasticsearch 5.0.

Efficient way to search and sort data with elasticsearch as a datastore

We are using elasticsearch as a primary data store to save data and our indexing strategy is time based(for example, we create an index every 6 hours - configurable). The search-sort queries that come to our application contain time range; and based on input time range we calculate the indices need to be used for searching data.
Now, if the input time range is large - let's say 6 months, and we delegate the search-sort query to elasticsearch then elasticsearch will load all the documents into memory which could drastically increase the heap size(we have a limitation on the heap size).
One way to deal with the above problem is to get the data index by index and sort the data in our application ; indices are opened/closed accordignly; for example, only latest 4 indices are opened all the time and remaining indices are opened/closed based on the need. I'm wondering if there is any better way to handle the problem in hand.
UPDATE
Option 1
Instead of opening and closing indexes you could experiment with limiting the field data cache size.
You could limit the field data cache to a percentage of the JVM heap size or a specific size, for example 10Gb. Once field data is loaded into the cache it is not removed unless you specifically limit the cache size. Putting a limit will evict the oldest data in the cache and so avoid an OutOfMemoryException.
You might not get great performance but then it might not be worse than opening and closing indexes and would remove a lot of complexity.
Take into account that Elasticsearch loads all of the documents in the index when it performs a sort so that means whatever limit you put should be big enough to load that index into memory.
See limiting field data cache size
Option 2
Doc Values
This means writing necessary meta data to disk at index time, so that means the "fielddata" required for sorting lives on disk and not in memory. It is not a huge amount slower than using in memory fielddata and in fact can alleviate problems with garbage collection as less data is loaded into memory. There are some limitations such as string fields needing to be not_analyzed.
You could use a mixed approach and enable doc values on your older indexes and use faster and more flexible fielddata on current indexes (if you could classify your indexes in that way). That way you don't penalize the queries on "active" data.
See Doc Values documentation

Elasticsearch - implications of splitting documents into separate indexes

Let's say I have 100,000 documents from different customer groups, which are formatted the same with the same type of information.
Documents from individual customer groups get refreshed at different times of the day. I've been recommended to give each customer group their own index so when my individual customer index is refreshed locally I can create a new index for that customer and delete the old index for that customer.
What are the implications for splitting the data into multiple indexes and querying using an alias? Specifically:
Will it increase my server HDD requirements?
Will it increase my server RAM requirements?
Will elasticsearch be slower to search by querying the alias to query all the indexes?
Thank you for any help or advice.
Every index has some overhead on all levels but it's usually small. For 100,000 documents I would question the need for splitting unless these documents are very large. In general each added index will:
Require some amount of RAM for insert buffers and other per-index related tasks
Have it's own merge overhead on disk relative to a larger single index
Provide some latency increase at query time due to result merging if a query spans multiple indexes
There are a lot of factors that go into determining if any of these are significant. If you have lots of RAM and several CPUs and SSDs then you may be fine.
I would advise you to build a solution that uses the minimum number of shards as possible. That probably means one (or at least only a few) index(es).

Resources