Multiple Elasticsearch indexes with a single node - elasticsearch

I'm using Elasticsearch as a centralized logging platform . As most examples show I've been logging to multiple indexes time-stamped by day (e.g. logmessages-2017-04-14)
However, I only have a single node setup that contains all these daily indexes. Would I be better off just logging to a single logmessages index on this single node?
Since I only have a single node I have replicas set to 0 and shards set to 1 for each daily index. I'm indexing about 100,000 documents per month.

The answer is "no".
A logging use case always has a retention period defined, meaning after some time you don't need those logs anymore and you need to delete them. This is the same with Elasticsearch indices: when the retention period has been reached that log is deleted.
With time based indices, you delete one day's index and that's it. It's much much more preferred to delete entire indices than individual documents from indices.

Related

In Elasticsearch cluster, is there a way through which shards can be allocated a particular node during the time of creation?

I have a multinode elasticsearch cluster. On that cluster, I want to divide shards of same index on different nodes.
Suppose a document is to be ingested into the index that have different key-value pairs. Based on that key-value, I want my master-node to allocate a specific data-node that contains a list of documents having the same key-value.
My approach is to have a single index across the nodes available in the cluster and the shards of this index should get distributed in such a manner that the document having similar key-value pair be on same node. Is there a way around to this?
Also I want to increase number of shards in an index but getting error, "index <index_name> must be read-only to resize index." How do I increase number of shards?
there is the _routing field which can group documents in a particular shard. but you cannot automatically assign a shard with a value to a specific node. the closest you could get would be to manually handle it via reroute
however why you would want to do that is not clear, and definitely not recommended as it's a lot of manual control over something that Elasticsearch is pretty good at handling

Overhead of empty elastic search indices on performance

We use Elastic search for full text search use cases. The data is metadata collected across different objects and stored as ES document. We also update the document in ES whenever the master data gets updated. So, basically it is not a logging use case.
We create one ES index (one primary and 1 replica shard) as soon as we have a tenant who gets onboard for our application. This is to ensure that the ES index is ready when the first object gets created.
We do not anticipate volume of data in the index. The data could range between few hundred of MBs per index. So this is a relatively empty index.
Also, full text search is an optional add-in feature in application, so not all tenants may opt for the same, however our technical team suggested creating index upfront.
What is the overhead of such indices on the ES performance? Are we doing anything different from best practices of ES?
Any input is appreciated.
Empty Elasticsearch index don't have much overhead, as there is actually no data in them, only places where empty indices data is present in the cluster state(index mapping, setting etc) which every node in the cluster has and any change in the index mapping or settings ie index metadata updates the cluster state and gets updated on all the nodes in ES cluster.
If you have sufficient memory and ES heap size, you don't have to worry at all about these empty indices which IMO makes sense considering your use-case.

logstash output twitter to elasticsearch - how many indexes to have

Given logstash configs can have multiple inputs and outputs
What considerations drive the decisions as to the number of indexes to have as outputs stored in elastic search if I'm using the twitter input on logstash?
Should I have 1 index per monitored account, 1 per tag or keyword or are there other considerations that would affect the design?
There is overhead in elasticsearch for each open index, so they'll each consume HEAP.
It's common to put more than one type of document in an index (that's what the [type] field is for). Note that, in elasticsearch v2, any identically-named fields must have the same mapping ("myField", if a string in one type, must always be a string).
Shards have a recommended upper limit on size, about 60GB IIRC.
Finally, arrange your index so that enforcing your retention policy is easy. If everything is kept for 7 days, then a daily index would work well. Use 'curator' to delete old indexes.
I prefer to make a smaller number of large indexes.

elasticsearch: is creating one index for each log good?

I am using elasticsearch to index logs from an automation run of test cases. I am creating an index for each of the runs (that can have from 1000 to million events). I create about 200 indices per day. Is this a good methodology to create an index for each run or should I just have 1 index and then put all the logs from multiple runs into this index?
The amount of data is huge and so I chose separate indices. I am expecting 200 logs everyday each with 1million events. Please help me
Depends how long you want to retain your data and the size of your cluster. At 200 indices per day, each with lots of associated files, you're looking at a lot of file handles. So, that doesn't sound like it would scale beyond a few weeks or months on a very small cluster since you'll be running out of file handles.
A better strategy might be to do what logstash does by default which is to create a new index every day. Then your next choice will be to play with the number of shards and nodes in the cluster. Assuming you want to store a worst case of 200M log entries per day on a 3 or 5 node cluster, probably the default of 5 shards is fine. If you go for more nodes, you'll probably want more shards so that each shard is smaller. Also consider using elasticsearch curator to e.g. close older indices and optimize them.

Load Balancing Between Two elasticsearch servers

I have two ElasticSearch Servers:
http://12.13.54.333:9200
and
http://65.98.54.10:9200
In the first server I have 100k of data(id=1 to id=100k) and in the second server I have 100k of data(id=100k+1 to 200k).
I want to have a text search for the keyword obama in one request on both servers. Is this possible?
Your question is a little generic...I'll try not to give an "it depends" kind of answer, but in order to do so I have to make a couple of assumptions.
Are those two servers actually two nodes on the same elasticsearch cluster? I suppose so.
Did you index data on an elasticsearch index composed of more than one shard? I suppose so. The default in elasticsearch is five shards, which in your case would lead to having two shards on one node and three on the other.
Then you can just send your query to one of those nodes via REST API. The query will be executed on all the shards that the index (can be even more than one) you are querying is composed of. If you have replicas the replica shards might be used too at query time. The node that received your query will then reduce the search results got from all the shards returning back the most relevant ones.
To be more specific the search phase on every shard will most likely only collect the document ids and their score. Once the node that you hit has reduced the results, it can fetch all the needed fields (usually the _source field) only for the documents that it's supposed to return.
What's nice about elasticsearch is that even if you indexed data on different indexes you can query multiple indices and everything is going to work the same as I described. At the end of the day every index is composed of shards, and querying ten indices with one shard each is the same as querying one index with ten shards.
What I described applies to the default search_type that elasticsearch uses, called query_then_fetch. There are other search types that you can eventually use when needed, like for example the count which doesn't do any reduce nor fetch but just returns the number of hits for a query executing it on all shards and returning the sum of all the hits for each shard.
Revendra Kumar,
Elasticsearch should handler that for you. Elasticsearch was built from scratch to be distributed and do distributed search.
Basically, if those servers are in the same cluster, you will have a two shards (the first one holds the id from 1 to 100k and the second one hold the ids from 100001 to 200k). When you search by something, it doesn't matter which server it hits, it will do a search on both servers and returns the result for the client. The internal behavior of elasticsearch is too extensive to explain here.

Resources