Elasticsearch one storage for all nodes - elasticsearch

In oracle we use RMAN for one base storage for all cluster nodes, so can we have multiple data nodes but only with one storage disk.
Can all storage node act like RAID 5?

You can always mount the same disk on different nodes and store data there. But then that defies the philosophy of replicas and sharding in Elasticsearch. The idea of replicas is to sustain the cluster even if a hard disk or hardware goes down. Here , if a single disk goes down , we will loose the entire cluster data. And in sharding we try to use apply parllel computation on independent hardware so as to improve performance. But then here we are using the same disk. So i don't think this is a good idea.
If you are still going ahead with this plan , make sure you force awareness so that unwanted replicas are not made.

Related

Impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance

What is the impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance ? I am having a HBase cluster hosted on Azure VMs with data stored in azure managed disks. Azure managed disk itself keeps 3 copies of the data for fault tolerance, so thinking of reducing the HDFS replication factor to save on storage overhead. Given that map reduce jobs make use of local availability of the data to avoid data transfer over network, wondering anyone has any information on the impact on map reduce performance if there just one replica of the data available?
This is a difficult question to answer as it depends greatly on what workloads you run.
By decreasing the replication factor, you can speed up the performance of write operations, since the data is written to fewer DataNodes. However, as you noted, you may have decreased locality since it can be more difficult to find a node which has a replica and has free space to execute a task.
Keeping only a single replica can have strong implications on the impact of a single node failure. If a single node dies, all of its data will be unavailable until you restart a new node with the same Azure managed disks. If there are multiple HDFS replicas, data availability is maintained throughout.
Running HDFS DataNodes on top of Azure managed disks sounds like a bit of a bad idea. In addition to breaking some of the core HDFS assumptions ("my disk might fail at any time"), it seems unlikely that you have true data locality if your data is stored in three replicas. I wonder if you have considered:
Using a non-managed disk service. Does Azure provide a way to use a disk which is not replicated? This is much closer to how HDFS is intended to be used.
Storing data in Azure storage (WASB or ADLS) instead of HDFS. This is more "cloud native" way of running things. If you find that performance is lacking, you can use HDFS for intermediate data and only store final data in Azure. HDFS also provides a way to cache data from external storage systems by using Provided Storage.

Elasticsearch path.data multiple disks, adding more

When I originally set up my Elasticsearch cluster, it was recommended to "stripe" the data across multiple disks thusly:
path.data: [ /disk1, /disk2, /disk3 ]
Which I did prevously, and has worked fine, but now I need to add more space (more disks), which I plan to do like this:
path.data: [ /disk1, /disk2, /disk3, /disk4, /disk5 ]
I have not been able to find any authoritative reference that indicates how the data will be re-balanced (or not). It seems that the behavior has changed somewhat over the years/versions, so googling has been difficult.
All the docs say about it is: "path.data settings can be set to multiple paths, in which case all paths will be used to store data" which is rather vague.
I am running Elasticsearch 5.6.
I would like to understand what will happen when disks 1,2,3 are above the 85% "low watermark" (but not yet at the high 90% mark), and I introduce 2 new disks to the mix. Will new indices go to the 2 new disks only?
The docs say: "ES will not allocate new shards to nodes once they have more than 85% disk used". Does this mean the whole node, or just the disks that are at 85% on that node?
My indices are daily logging data, and are pruned with Curator every N days, so I imagine at some point, things will even out but may take a while. Is there any way to proactively relocate shards to a different disk or should I just let it self-balance over time?
Using multiple disks (via data paths) is not STRIPING. Data is distributed across disks by shards count and not disk space usage. Even if a single disk goes past watermark, the node will get affected. So adding new disks to data path won't distribute data to new disks.
To use data striping use atleast RAID0 or other options as per your data safety requirement.
REFER Data storage architecture

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

How to setup ElasticSearch cluster with auto-scaling on Amazon EC2?

There is a great tutorial elasticsearch on ec2 about configuring ES on Amazon EC2. I studied it and applied all recommendations.
Now I have AMI and can run any number of nodes in the cluster from this AMI. Auto-discovery is configured and the nodes join the cluster as they really should.
The question is How to configure cluster in way that I can automatically launch/terminate nodes depending on cluster load?
For example I want to have only 1 node running when we don't have any load and 12 nodes running on peak load. But wait, if I terminate 11 nodes in cluster what would happen with shards and replicas? How to make sure I don't lose any data in cluster if I terminate 11 nodes out of 12 nodes?
I might want to configure S3 Gateway for this. But all the gateways except for local are deprecated.
There is an article in the manual about shards allocation. May be I'm missing something very basic but I should admit I failed to figure out if it is possible to configure one node to always hold all the shards copies. My goal is to make sure that if this would be the only node running in the cluster we still don't lose any data.
The only solution I can imagine now is to configure index to have 12 shards and 12 replicas. Then when up to 12 nodes are launched every node would have copy of every shard. But I don't like this solution cause I would have to reconfigure cluster if I might want to have more then 12 nodes on peak load.
Auto scaling doesn't make a lot of sense with ElasticSearch.
Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly. (If you want to limit the effect you should throttle cluster recovery using settings like cluster.routing.allocation.cluster_concurrent_rebalance, indices.recovery.concurrent_streams, indices.recovery.max_size_per_sec . This will limit the impact but will also slow the re-balancing and recovery).
Also, if you care about your data you don't want to have only 1 node ever. You need your data to be replicated, so you will need at least 2 nodes (or more if you feel safer with a higher replication level).
Another thing to remember is that while you can change the number of replicas, you can't change the number of shards. This is configured when you create your index and cannot be changed (if you want more shards you need to create another index and reindex all your data). So your number of shards should take into account the data size and the cluster size, considering the higher number of nodes you want but also your minimal setup (can fewer nodes hold all the shards and serve the estimated traffic?).
So theoretically, if you want to have 2 nodes at low time and 12 nodes on peak, you can set your index to have 6 shards with 1 replica. So on low times you have 2 nodes that hold 6 shards each, and on peak you have 12 nodes that hold 1 shard each.
But again, I strongly suggest rethinking this and testing the impact of shard moving on your cluster performance.
In cases where the elasticity of your application is driven by a variable query load you could setup ES nodes configured to not store any data (node.data = false, http.enabled = true) and then put them in for auto scaling. These nodes could offload all the HTTP and result conflation processing from your main data nodes (freeing them up for more indexing and searching).
Since these nodes wouldn't have shards allocated to them bringing them up and down dynamically shouldn't be a problem and the auto-discovery should allow them to join the cluster.
I think this is a concern in general when it comes to employing auto-scalable architecture to meet temporary demands, but data still needs to be saved. I think there is a solution that leverages EBS
map shards to specific EBS volumes. Lets say we need 15 shards. We will need 15 EBS Volumes
amazon allows you to mount multiple volumes, so when we start we can start with few instances that have multiple volumes attached to them
as load increase, we can spin up additional instance - upto 15.
The above solution is only advised if you know your max capacity requirements.
I can give you an alternative approach using aws elastic search service(it will cost little bit more than normal ec2 elasticsearch).Write a simple script which continuously monitor the load (through api/cli)on the service and if the load goes beyond the threshold, programatically increase the nodes of your aws elasticsearch-service cluster.Here the advantage is aws will take care of the scaling(As per the documentation they are taking a snaphost and launching a completely new cluster).This will work for scale down also.
Regarding Auto-scaling approach there is some challenges like shard movement has an impact on the existing cluster, also we need to more vigilant while scaling down.You can find a good article on scaling down here which I have tested.If you can do some kind of intelligent automation of the steps in the above link through some scripting(python, shell) or through automation tools like Ansible, then the scaling in/out is achievable.But again you need to start the scaling up well before the normal limits since the scale up activities can have an impact on existing cluster.
Question: is possible to configure one node to always hold all the shards copies?
Answer: Yes,its possible by explicit shard routing.More details here
I would be tempted to suggest solving this a different way in AWS. I dont know what ES data this is or how its updated etc... Making a lot of assumptions I would put the ES instance behind a ALB (app load balancer) I would have a scheduled process that creates updated AMI's regularly (if you do it often then it will be quick to do), then based on load of your single server I would trigger more instances to be created from the latest instance you have available. Add the new instances to the ALB to share some of the load. As this quiet down I would trigger the termination of the temp instances. If you go this route here are a couple more things to consider
Use spot instances since they are cheaper and if it fits your use case
The "T" instances dont fit well here since they need time to build up credits
Use lambdas for the task of turning things on and off, if you want to be fancy you can trigger it based on a webhook to the aws gateway
Making more assumptions about your use case, consider putting a Varnish server in front of your ES machine so that you can more cheaply provide scale based on a cache strategy (lots of assumptions here) based on the stress you can dial in the right TTL for cache eviction. Check out the soft-purge feature for our ES stuff we have gotten a lot of good value from this.
if you do any of what i suggest here make sure to make your spawned ES instances report any logs back to a central addressable place on the persistent ES machine so you don't lose logs when the machines die

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Resources