Distribution of Key value pairs in Jboss Data Grid - caching

I am loading 20 million non expiry entries in the Jboss Data Grid using Hotrod clients. My Hot rod clients are running on 5 different machines to load the data. The entries got added successfully. We have given a replication factor of 2. So there will be total 40 million entries in the grid. We found a variation of more than 10 % in the no of entries being added in each node. For eg, One node has 7.8 million entries while other node has 12 million entries.
So I was thinking why the entries are not equally distributed, ideally each node should have about 10 million entries. Our objective of the above test was to check whether the load/requests are getting equally distributed on all the nodes.
Any pointers on how the key/value pairs are distributed in JDG would be appreciated.

In Infinispan the hash space is divided into segments which then get mapped to the nodes in the cluster.
Entries are hashed by their keys by applying the MurmurHash3 function to them. This determines the segment which owns the key. It could be possible that your keys are causing a somewhat uneven distribution. You could try increasing the number of segments in your configuration. With your cluster, use at least 100 segments.
Also I had to lookup the meaning of "crore" and "lakh", as I had no idea what they were. You should probably use the 10M and 100K notations instead to make it easier to understand.

Related

SpatialHadoop: no scaling with multiple computing nodes

I am using SpatialHadoop to store and index a dataset with 87 million points. I then apply various range queries.
I tested on 3 different cluster configurations: 1 , 2 and 4 nodes.
Unfortunately, I don't see a runtime decrease with growing node number.
Any ideas why there is no horizontal-scaling effect?
How big is your file in megabytes? While it has 87 million points, it can still be small enough that Hadoop decides to create one or two splits only out of it.
If this is the case, you can try reducing the block size in your HDFS configuration so that the file will be split into several blocks.
Another possibility is that you might be running virtual nodes on the same machine which means that you do not get a real distributed environment.

Create sets of minimal cardinality from set of pairs

I have a set of pairs of IDs like
(123;1765)
(1212;8977)...
I need to separate those pairs into n groups with an inividual size (number of pairs) each. Those sets should have minimum cardinality (=there should be as few as possible different ids in each group).
Are there any existing algorithms which solve this problem? I'm not sure where/how to search for it.
This is necessary, because I currently work on the load balancing of one of my projects and each node should have to load as few IDs as possible because of limited RAM (each ID is connected to a larger dataset).
Edit:
Some background:
Different nodes in a cluster have to compare datasets identified by IDs. Each comparison is a pair of IDs (compare dataset of ID1 with ID2). Each node gets a bunch of pairs to know which IDs it has to compare and loads the corresponding datasets into RAM. A master node divides a big bunch of pairs into smaller bunches and distributes them to the slave nodes. Because each node can only store a limited amount of datasets, those smaller bunches need to contain as few different IDs as possible. But the nodes have different amounts of RAM, so the groups with minimal cardinality should have different sizes.
The comparison is symmetric, so compare(ID1, ID2) is the same as compare(ID2, ID1), so each pair is unique. Which datasets need to be compared is degined by a client which sents those jobs to the master as a bunch of pairs of IDs.
An example:
A client wants the comparison of dataset (1;2), (7;9), (9;105), (7;105), (2;4), (4;1) (usually here should be much more comparisons, so millions usually)
The client sends those pairs to the master, which has two registered slaves. Now the master needs to divide that stack of work into two groups, but the more different IDs are part of each group the more datasets need to be loaded by the slaves (ID corresponds to specific dataset, remember?).
So ideally the master would create a group like ((1;2), (2;4), (4;1)) (only contains 3 different IDs, so the slave only has to load 3 datasets) and ((7;9), (9;105), (7; 105)) (again just three IDs) instead of:
((1;2), (9;105)...) and ((2;4), (7;105)...). Here both slaves need to load 4 IDs and more, and e.g. both slaves need to load the datasets no. 2 and 105.
This needs to be optimized somehow..
My first instinct is to say that perhaps this could be resolved with a special cluster analysis where you customize the aggregation and distance functions.
The cluster members would be pairs.
The cluster aggregate would be the set-theoretical union of all pairs in the
cluster (this is instead of an average or median in the standard approach).
The distance function of any pair in comparison to the cluster would be the
number of elements in the pair that are not found in the cluster aggregate
(so the cardinality of the set difference; this replaces the Euclidean
distance in the standard approach).
Some cluster algorithms have you set the number of desired clusters in
advance, so you would set it to two.
And finally, because you need to balance things so that the cluster
aggregates have the same number of elements, further tweaking, but still
doable.
But, you say you will have millions of points to compare. The processing required for cluster analysis increases exponentially the more input you put in. In this situation, it is worth researching whether your problem is NP or NP-complete. I'm not well versed in that, but I suspect it is, in which case a true optimum will always escape you.
But, if you discover that your problem is in fact NP-complete, then you can still optimize, you just won't be able to guarantee arrival at the global optimum in a reasonable amount of time. So, for instance, you can break your set of pairs into subsets and run an algorithm such as above on the subsets. That may still be an improvement.

Hbase table duplication

There is a way to duplicate table data on every node of a cluster?
I need to do a performance test with the maximum grade of locality of the data.
By default, HBase distributes data on a small fraction of the cluster nodes (on 1 or 2 nodes), maybe because my data isn't very big-data ( ~ 2 GB ).
I know that Hbase is designed for much larger data sets, but in this case, it is a requirement for me.
There are a lot of nice reads* about it (see the end of the post) but I'll try to explain it with my own words ;)
HBase is not responsible of data replication, the Hadoop HDFS is, and by default is configured with a replication factor of 3, that means all data will be stored in at least 3 nodes.
Data locality is a key aspect to get good performance, but achieving maximum data locality is easy: you only need to colocate your HBase Regionservers (RS) along to the Hadoop Datanodes (DN), so, all your DN should have also the RS role. Once you have that, HBase will automatically move the data where it's needed (on major compactions) to achieve data locality and that's all, as long as each RS has the data of the regions it serves locally you'll have data locality.
Even when you have the data replicated to multiple DN, each region (and the rows they contain) will be served by just one RS, it doesn't matter you have a replication factor of 3, 10 or 100... Reading a row belonging to the region #1 will always hit the same RS, and that will be the one that hosts the region (which will read the data locally from the HDFS because of data locality). If the RS hosting that region goes down, the region will be assigned to another RS automatically (because the data is also replicated to other DN)
What you can do is to split your table in a way each RS has even buckets of rows (regions) assigned to it, so as much different RS as possible work simultaneously when you read or write data, increasing your overall throughput as long as you don't always hit the same regions (called regionserver hotspotting**).
Therefore, you should always start by ensuring that all the regions of your table are assigned to different RS and they receive the same volume of R/W requests. Once you've done that you can split your table into more regions once until you have an even number of regions on all the RS of your cluster (you may need to assign them manually if you're not happy with the load balancer).
Just remind that even when you seem to have a perfect distribution of regions you can still have poor performance if your data access pattern is not right (or it's uneven) and doesn't reach all regions evenly, in the end it all depends on your application.
(*) Recommended reads:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
(**) To avoid RS hotspotting we always design our tables to have non-monotonically increasing row keys, so rows 1, 2, 3 ... N are hosted different regions, the common approach is to use the MD5(id) + id as rowkey. This approach has it's own set of drawbacks: you cannot scan the first 10 rows because they're salted.

Determining the number of buckets in Hive Table

I have a question regarding the number of buckets to be used. I understand the uses of bucketing and how it positively impacts SMB joins and sampling. But what if the data volume spikes exponentially?
Let's say looking at the initial data volume, I decide to use 4 buckets and partitioned by day. When I insert into this table it would take 4 reducers at some point (Last job in the insertion query). This is fine. But lets say the data volume suddenly spikes a whole lot for some partitions. It would still take 4 reducers which is not optimal and it is also possible it could fail with OOM.
I could decide on using more buckets initially but that would start creating too many small files until I reach the high volume, as each bucket goes into a file.
Is it possible to have more than one file for a bucket value?
Your inputs are appreciated.
K
Focusing on 'your data volume suddenly spikes a whole lot for some partitions', you could consider using list bucketing, which allows you to put bucketed column values with low volume into one directory.

ElasticSearch - Optimal number of Shards per node

I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
I'm late to the party, but I just wanted to point out a couple of things:
The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding.
In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
Just got back from configuring some log storage for 10 TB so let's talk sharding :D
Node limitations
Main source: The definitive guide to elasticsearch
HEAP: 32 GB at most:
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
HEAP: 50% of the server memory at most. The rest is left to filesystem caches (thus 64 GB servers are a common sweet spot):
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.
[An index split in] N shards can spread the load over N servers:
1 shard can use all the processing power from 1 node (it's like an independent index). Operations on sharded indices are run concurrently on all shards and the result is aggregated.
Less shards is better (the ideal is 1 shard):
The overhead of sharding is significant. See this benchmark for numbers https://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/
Less servers is better (the ideal is 1 server (with 1 shard)]):
The load on an index can only be split across nodes by sharding (A shard is enough to use all resources on a node). More shards allow to use more servers but more servers bring more overhead for data aggregation... There is no free lunch.
Configuration
Usage: A single big index
We put everything in a single big index and let elasticsearch do all the hard work relating to sharding data. There is no logic whatsoever in the application so it's easier to dev and maintain.
Let's suppose that we plan for the index to be at most 111 GB in the future and we've got 50 GB servers (25 GB heap) from our cloud provider.
That means we should have 5 shards.
Note: Most people tend to overestimate their growth, try to be realistic. For instance, this 111GB example is already a BIG index. For comparison the stackoverflow index is 430 GB (2016) and it's a top 50 site worldwide, made entirely of written texts by millions of people.
Usage: Index by time
When there're too much data for a single index or it's getting too annoying to manage, the next thing is to split the index by time period.
The most extreme example is logging applications (logstach and graylog) which are using a new index every day.
The ideal configuration of 1-single-shard-per-index makes perfect sense in scenario. The index rotation period can be adjusted, if necessary, to keep the index smaller than the heap.
Special case: Let's imagine a popular internet forum with monthly indices. 99% of requests are hitting the last index. We have to set multiple shards (e.g. 3) to spread the load over multiple nodes. (Note: It's probably unnecessary optimization. A 99% hitrate is unlikely in the real world and the shard replica could distribute part of the read-only load anyway).
Usage: Going Exascale (just for the record)
ElasticSearch is magic. It's the easiest database to setup in cluster and it's one of the very few able to scale to many nodes (excluding Spanner ).
It's possible to go exascale with hundreds of elasticsearch nodes. There must be many indices and shards to spread the load on that many machines and that takes an appropriate sharding configuration (eventually adjusted per index).
The final bit of magic is to tune elasticsearch routing to target specific nodes for specific operations.
It might be also a good idea to have more than one primary shard per node, depends on use case. I have found out that bulk indexing was pretty slow, only one CPU core was used - so we had idle CPU power and very low IO, definitely hardware was not a bottleneck. Thread pool stats shown, that during indexing only one bulk thread was active. We have a lot of analyzers and complex tokenizer (decomposed analysis of German words). Increasing number of shards per node has resulted in more bulk threads being active (one per shard on node) and it has dramatically improved speed of indexing.
Number of primary shards and replicas depend upon following parameters:
No of Data Nodes: The replica shards for the given primary shard meant to be present on different data nodes, which means if there are 3 data Nodes: DN1, DN2, DN3 then if primary shard is in DN1 then the replica shard should be present in DN2 and/or DN3. Hence no of replicas should be less than total no of Data Nodes.
Capacity of each of the Data Nodes: Size of the shard cannot be more than the size of the data nodes hard disk and hence depending upon the expected size for the given index, no of primary shards should be defined.
Recovering mechanism in case of failure: If the data on the given index has quick recovering mechanism then 1 replica should be enough.
Performance requirement from the given index: As sharding helps in directing the client node to appropriate shard to improve the performance and hence depending upon the query parameter and size of the data belonging to that query parameter should be considered in defining the no of primary shards.
These are the ideal and basic guidelines to be followed, it should be optimized depending upon the actual use cases.
I have not tested this yet, but aws has a good articale about ES best practises. Look at Choosing Instance Types and Testing part.
Elastic.co recommends to:
[…] keep the number of shards per node below 20 per GB heap it has configured

Resources