How to set clusters and sharding in ArangoDB? - sharding

I want to use sharding in arangoDB.I have made coordinators, DBServers as mentioned in documentation 2.8.5. But still can someone still explain it in details and also how can I able to check the performance of my query after and before sharding.

Testing your application can be done with a local cluster, were all instances run on one machine - which is what you already did, if I get that correctly?
An ArangoDB cluster consists of coordinator and dbserver nodes. Coordinators don't have own user specific local collections on disk. Their role is to handle the I/O with the clients, parse, optimize and distribute the queries and the user data to the dbserver nodes. Foxx services will also be run on the coordinators. DBServers are the storage nodes in this setup, they keep the user data.
To compare the performance between clustered and non clustered mode you import a dataset on a clustered instance and a non clustered one and compare the query result times. Since the cluster setup can have more network communication (i.e. if you do a join) than the single server case, the performance can be different. On a
physically distributed cluster you may achieve higher throughput, since in the end the cluster nodes are own machines and have their own IO paths that end on separate physical harddisks.
In the cluster case you create collections specifying the number of shards using the numberOfShards parameter; the shardKeys parameter can control the distribution of your documents across the shards. You should choose that key so documents distribute well across the shards (i.e. are not inbalanced to just one shard). The numberOfShards can be an arbitrary value and doesn't have to corrospond to the number of dbserver nodes - it could even be bigger so you can more easily move a shard from one dbserver to a new dbserver when scaling up your cluster to more nodes in the future to adapt to higher loads.
When you're developping AQL queries with cluster use in mind, its essential to use the explain command to inspect how the query is distributed across the clusters, and where filters can be deployed:
db._create("sharded", {numberOfShards: 2})
db._explain("FOR x IN sharded RETURN x")
Query string:
FOR x IN sharded RETURN x
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 EnumerateCollectionNode 1 - FOR x IN sharded /* full collection scan */
6 RemoteNode 1 - REMOTE
7 GatherNode 1 - GATHER
3 ReturnNode 1 - RETURN x
Indexes used:
none
Optimization rules applied:
Id RuleName
1 scatter-in-cluster
2 remove-unnecessary-remote-scatter
In this simple query the RETURN & GATHER -nodes are on the coordinator; the nodes upwards including the REMOTE-node are deployed to the DB-server.
In general less REMOTE / SCATTER -> GATHER pairs means less cluster communication. The closer FILTER nodes can be deployed to *CollectionNodes to reduce the amount of the documents to be sent via the REMOTE-nodes the better the performance.

Related

Configuring Elastic Search cluster with machines of different capacity(CPU, RAM) for rolling upgrades

Due to cost restrictions, I only have the following types of machines at disposal for setting up an ES cluster.
Node A: Lean(w.r.t. CPU, RAM) Instance
Node B: Beefy(w.r.t. CPU,RAM) Instance
Node M: "Leaner than A"(w.r.t. CPU, RAM) Instance
Disk-wise, both A and B have the same size.
My plan is to set up Node A and Node B acting as Master Eligible, Data node and Node M as Master-Eligible Only node(no data storing).
Because the two data nodes are NOT identical, what would be the implications?
I am going to make it a cluster of 3 machines only for the possibility of Rolling Upgrades(current volume of data and expected growth for few years can be managed with vertical scaling and leaving the default no. of shards and replica would enable me to scale horizontally if there is a need)
There is absolutely no need for your machines to have the same specs. You will need 3 master-eligible nodes not just for rolling-upgrades, but for high availability in general.
If you want to scale horizontally you can do so by either creating more indices to hold your data, or configure your index to have multiple primary and or replica shards. Since version 7 the default for new indices is to get created with 1 primary and 1 replica shard. A single index like this does not really allow you to schedule horizontally.
Update:
With respect to load and shard allocation (where to put data), Elasticsearch by default will simply consider the amount of storage available. When you start up an instance of Elasticsearch, it introspects the hardware and configures its threadpools (number of threads & size of queue) for various tasks accordingly. So the number of available threads to process tasks can vary. If I‘m not mistaken the coordinating node (the node receiving the external request) will distribute indexing/write requests in a round-robin fashion, not taking a load into consideration. Depending on your version of Elasticsearch, this is different for search/read requests where the coordinating node will leverage adaptive replica selection, taking into account the load/response time of the various replicas when distributing requests.
Besides this, sizing and scaling is a too complex topic to be answered comprehensively in a simple response. It typically also involves testing to figure out the limits/boundaries of a single node.
BTW: the number of default primary shards got changed in v7.x of Elasticsearch, as too much oversharding was one of the most common issues Elasticsearch users were facing. A “reasonable” shard size is in the tens of Gigabytes.

Cassandra multiple nodes in different data centers on same server

Just want to know if I can configure multiple nodes from different data centers on the same physical server. Example - Want to have 2 data centers with 3 nodes each. 1 node from each data center will be on each server.
Total of 2 data centers, 6 nodes on 3 physical servers.
You can technically configure it as you describe; however, DataCenter is typically thought of as a location, so having nodes in two locations but configured as a datacenter is confusing (especially for anyone who would have to troubleshoot the environment later).
A best practice would be to have the topology of 3 nodes in each data center (actually be physically located in each data center). Then you could configure the cluster to have your data in both data centers for availability and also have appropriate latency within a single data center for all reads, writes, etc...
For example, using RF: 3 in each data center and then Using a consistency of LOCAL_QUORUM would balance data availability while reducing latency of your request. This example configuration would ensure the read/write occurs in a single data center (lower latency than across datacenters) but ensures the data is saved across both data centers (eventually consistent design).
Yes it is possible to follow the topology you have listed but think about the following scenario
With two nodes from different DC on single machine, there is high chance that you will have the unit of data replicated on a single machine in two different data center nodes. If the single machine fails you would loose two copies of a piece of data.
Assuming you have RF of DC1:2 DC2:2 and using a CF of Quorum, you would need 3 nodes to respond to read requests. With one physical server being down a unit of data will be loosing 2 replicas and your reads will fail and indeed the writes with same CF will also fail.

Using multiple node clients in elasticsearch

I'm trying to think of ways to scale our elasticsearch setup. Do people use multiple node clients on an Elasticsearch cluster and put them in front of a load balancer/reverse proxy like Nginx. Other ideas would be great.
So I'd start with re-capping the three different kinds of nodes you can configure in Elasticsearch:
Data Node - node.data set to true and node.master set to false -
these are your core nodes of an elasticsearch cluster, where the data
is stored.
Dedicated Master Node - node.data is set to false and node.master is
set to true - these are responsible for managing the cluster state.
Client Node - node.data is set to false and node.master is set to
false - these respond to client data requests, querying for results
from the data nodes and gathering the data to return to the client.
By splitting the functions into 3 different base node types you have a great degree of granularity and control in managing the scale of your cluster. As each node type handles a more isolated set of responsibilities you are better able to tune each one and to scale appropriately.
For data nodes, it's a function of handling indexing and query responses, along with making certain you have enough storage allocated to each node. You'll want to monitor storage usage and disk thru-put for each node, along with cpu and memory usage. You want to avoid configurations where you run out of disk, or saturate disk thru-put, while still have substantial excess cpu and memory, or the reverse where memory and cpu max but you have lot's of disk available. The best way to determine this is thru some benchmarking of typical indexing and querying activities.
For master nodes, you should always have at least 3 and should always have an odd number. The quorum should be set to N/2 + 1 where is N is the number of master nodes. This way you don't run into split brain issues with your cluster. Dedicated master nodes tend not to be heavily loaded so that can be quite small.
For client nodes you can indeed put them behind a load balancer, or use dns entries to point to them. They are easily scaled up and down by just adding more to the cluster and should be added for both redundancy and as you see cpu and memory usage climb. Not much need for a lot of disk.
No matter what your configuration, in addition to benchmarking likely loads ahead of time I'd strongly advise close monitoring of cpu, memory and disk - ES is easy to start rolling out but it does need watching as you scale into larger numbers of transactions and more nodes. Dealing with a yellow or red status cluster due to node failures from memory or disk exhaustion is not a lot of fun.
I'd take a close read of this article for some background:
http://elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
Plus this series of articles:
http://elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html

elasticsearch cluster setup information

I'm newbie to search and elasticsearch. I have gone some online docs and developed some app using elasticsearch setup in our test environment. So far, its smooth in developing and testing, Now do create in production and setup the cluster, i need some expert advise on,
Number of shards
Number of replicas
Should i need to separate out master and data nodes
can all the nodes be data node
i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
My usecase and traffic patterns are,
Upto 100M read per day
Upto 1M write/update per day
Initial data size 10GB, grow rate 1 GB every 6 months
Cluster info
1. Initial cluster size 14 machines, 28 GB RAM / 120 GB spin hard disk / 12 cores
2. load balancer with dns, would distribute the traffic to any 14 machines.
I have used unicast and i have bootstrap.mlockall: true and index.routing.allocation.disable_allocation: false
Please advise.
Thanks
1. Number of shards
The number of shards in Elasticsearch is a one-time setting, once your shard size is set you cannot change it. So you need to plan how many shards are required for your cluster taking into consideration your current dataset size plus any index growth. To do this set up one Elasticsearch node with one shard and zero replicas on a box that has the same specifications as your production boxes.
The capacity of a single shard will depend on a number of factors:
The size of your documents
The size of your fields
The amount a RAM you assign the JVM that runs Elasticsearch. If you have lots of aggregations,
sorting and parent/child documents, you will need to make sure that you have assigned enough RAM
to Elasticsearch so it can cache the results.
Your number of queries per second requirement.
The maximum search request response time allowed.
Index documents into your single shard node at iterations of x million (or less), at each iteration perform benchmarks by executing x queries per second using a testing tool like JMeter. When the queries in your tests are returning response times that are reaching your maximum search request time you have the amount of documents a single shard can index. Once you have this value you can calculate the number of shards that is required for your full dataset and calculate how many shards you will need for index growth.
2. Number of replicas
Start with 1 replica, a replica shard will be placed on a different node from its primary shard so if one node goes down you still have the full dataset available. One replica is usually sufficient, if you find you need more you can always add them in later on.
3.Should i need to separate out master and data nodes
It depends on the size of your cluster, if you have more than 5 nodes in your cluster it is advisable to have master only nodes to maintain cluster state only.
4. can all the nodes be data node
There must always be at least one master node in your cluster, the master node maintains the cluster state. If you have a small cluster (< 5 nodes), you can make every node in your cluster both a data node and a master node. One of of the nodes will be elected as the master, if the master node goes down another node in the cluster will be elected as the master. If you have master only nodes as described in point 3, the rest of the of the nodes in the cluster can be data only nodes.
5. i dont have any advanced search use case, but atleast need plural match (phone) should match all docs with phones and vice versa, any special stemming need in this case ?
Yes, stemming will handle your use case.
Also, Elasticsearch comes with very good configurations OOTB, you should start out by only changing the configurations listed in the link below.
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_important_configuration_changes.html

Mapreduce performance speeed up check on simple mongodb installation with 2 secondary and a primary node

I have simple mongodb installation with two secondary and one primary nodes. When i run a mapreduce query on a datasize of 5 gb it takes same time which it was taking on a standalone mongodb installation on one node. I am using command line. Do I have to use any specific command to exploit extra replica sets for mapreduce?
Thank you in advance.
You can speed up your job if you can use aggregation framework instead of mapreduce - aggregation framework is a lot faster.
You can't really scale your operations using replica sets, since replica sets are for high availability and failover (plus redundancy of data) not for scaling. You can run mapReduce or aggregation on a secondary, just connect to the secondary and specify rs.slaveOk() and then run mapReduce/aggregate - but you cannot not output results to a collection then, since you cannot write to a secondary, so it has to return results inline.
This will move the extra load from the primary, but it won't make it faster per se. If you want to utilize multiple servers, you need to shard your database - by distributing the data over multiple shards/hosts you will automatically cause your mapReduce and/or aggregation queries to run over multiple servers - even though a small penalty will exist for managing the results (they have to be merged still) the longest part of the job will likely more than offset the extra overhead.

Resources