Load-balancing with a parameter in a highly-concurrent scenario - algorithm

Let's say there are two service clusters A and B, namely there are tens of or hundreds of hosts in both A and B, and of course sometimes some hosts may restart or be removed or added. running and . Services in A do RPC calls to Services in B with a method doRemoteCall(String shopId, ..). It is in a highly concurrently scenario, the cluster qps could be 100k or more.
Now I hope that the load balancing for A to do RPC calls to B follows the three rules below:
RPC requests with the same shopId can be routed to the same host(idealy) or the same groups of hosts on B with high probability(of course, the higher, the better, ideally it could be 100%)
the RPC calls are relatively evenly distributed among the hosts on B.
the routing decision in hosts on A are made independently by each host, without knowing information of other hosts (because it could be complex for each host to get information from other hosts, especially in highly concurrent and dynamic(hosts leave or join ocassionally) scenario)

The magic google words are "consistent hashing".
In the classic consistent hashing scheme, each A host would do this (the constants 256 and 64-bit are arbitrary, but appropriate for this problem size):
For each host B and all k in 1...256, calculate a 64-bit server hash Hs = hash(B,k)
For each user id, calculate a 64-bit hash Hu
Assign the user id to the host with the smallest Hs such that Hs >= Hu. If there is none, use the highest Hs.
Of course, the set of server hashes only needs to be modified when B hosts go up or down.
I usually prefer rendezvous hashing, but with hundreds of hosts to balance over, it gets slow:
For each user id, for each host, calculate a 64-bit hash(userid,host)
Assign the userid to the host with the smallest hash.

Related

AnyLogic - How can I prioritize a pool of employees with Quality Matrix?

I have a pool of 10 employees. These employees are to operate two machines. 7 of these employees can operate machine A better than the remaining 3, who can operate machine B better.
Example: Worker XYZ works on machine A, on which he is better. On machine B, however, one worker is missing, because all 3 workers there are overloaded. Now worker XYZ should leave his prioritized work on machine A and go to machine B. As soon as the workload there becomes less again, worker XYZ should return to the prioritized machine A.
Does anyone have an idea how to do this kind of prioritization? In the best case still a prioritization on the basis of a quality matrix, so that one could take not only boolean values, but also double as value?
I had two approaches so far:
1.) create a single resource for each of the workers and do a function query for priority when a seize-block is received or.
2.) using the maintenance block in the different priorities for different tasks and try to assign them to the individual employees.
Unfortunately, I'm not really getting anywhere with either approach. Can anyone help me? Thank you very much in advance.

Choosing a safe number of members for a CP Subsystem

Tried scouring the documentation, but I'm still uncertain about the CP Subsystem setup for my current situation.
We have a Hazelcast cluster spread across 2 data centers, each data center having an even number of members, say 4, but can have as many as double during rollout.
The boxes in each data center are configured to be part of a separate partition group => 2 data centers - 2 partition groups, with 4-8 members each at a snapshot in time.
What would be the best number to set as CP Subsystem member count, considering that one data center might be decoupled as part of BAU?
I initially thought of setting the count to 5, to enforce having at least one box from each data center in the Raft consensus as a general situation (rollover happens only for a short amount of time during redeployment, so maybe it is not that big of a deal), but that might mean that consensus will not be possible when one data center will be decoupled. On the other hand, if I set up a value smaller than the box count in one dc, say 3, what would happen if all the boxes in the consensus group were to be assigned in the same dc and that dc would go away abruptly due to network conditions? These are mostly assumptions, since CP is a relatively new topic for me, so please correct me if I am wrong.
We prefer three datacenters, but sometimes a third is not available.
My team was faced with this same decision several years ago when expanding into a new jurisdiction. There were a lot of options, here are some. In all of these scenarios we did extensive testing for how the system behaved with network partitions.
Make a primary datacenter and a secondary datacenter
This is the option we ended up going with. We put 2/3 of the hosts in one datacenter and 1/3 in the secondary data-center. As much as possible, we weighted client traffic towards the primary datacenter. We also communicated with our customers about this preference so they could do the same if they wanted.
If the datacenter had multiple rooms, we made sure to have hosts spread across the different rooms to help mitigate power/network outages within the datacenter. At the minimum, we ensured the hosts are on different racks.
We also had multiple clusters and for each cluster we usually switched which datacenter was the primary and which was the secondary. We didn't do this in some jurisdictions with notorious power troubles.
Split half and half
It's up to the gods what happens when a datacenter goes down. This is why we chose the first option: we wanted the choice of what happens when each datacenter goes down.
Have a tie-breaker in a different region
Put a host in an entirely different region from the two datacenters. Most of the time the latency will be too high for this host to fully participate in making consensus decisions, but in the case of a network partition it can help move the majority to one of the partitions.
The tie-breaker host must be a part of the quorum and cannot be kicked out because of latency delays.
Build a new datacenter
These things are very expensive, but it makes the durability story much nicer. Not always an option.

How can I pre-split a table in HBase

I am storing data in HBase having 5 region servers. I am using md5 hash of url as my row keys. Currently all the data is getting stored in one region server only. So I want to pre-split the regions so that data will go uniformly across all region server.
I want to have table split into five regions by first character of a rowkey, so that data with rowkey starting from 0 to 3 goes in 1st region server, 3-6 to 2nd , 7-9 to 3rd, a-d to 4th, d-f to 5th. How can I do it?
You can provide a SPLITS property when creating the table.
create 'tableName', 'cf1', {SPLITS => ['3','6','9','d']}
The 4 split points will generate 5 regions.
Please be noticed that HBase's DefaultLoadBalancer doesn't guarantee a 100% even distribution between regionservers, it could happen that a regionserver hosts multiple regions from the same table.
For more information about how it works take a look at this:
public List<RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)
Generate a global load balancing plan according to the specified map
of server information to the most loaded regions of each server. The
load balancing invariant is that all servers are within 1 region of
the average number of regions per server. If the average is an integer
number, all servers will be balanced to the average. Otherwise, all
servers will have either floor(average) or ceiling(average) regions.
HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so
that we can fetch from both ends of the queue. At the beginning, we
check whether there was empty region server just discovered by Master.
If so, we alternately choose new / old regions from head / tail of
regionsToMove, respectively. This alternation avoids clustering young
regions on the newly discovered region server. Otherwise, we choose
new regions from head of regionsToMove. Another improvement from
HBASE-3609 is that we assign regions from regionsToMove to underloaded
servers in round-robin fashion. Previously one underloaded server
would be filled before we move onto the next underloaded server,
leading to clustering of young regions. Finally, we randomly shuffle
underloaded servers so that they receive offloaded regions relatively
evenly across calls to balanceCluster(). The algorithm is currently
implemented as such:
Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).
Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server
that already has <= MAX regions. Order the regions to move from most
recent to least.
Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that
already has >= MIN regions. Regions being assigned to underloaded
servers are those that were shed in the previous step. It is possible
that there were not enough regions shed to fill each underloaded
server to MIN. If so we end up with a number of regions required to do
so, neededRegions. It is also possible that we were able to fill each
underloaded but ended up with regions that were unassigned from
overloaded servers but that still do not have assignment. If neither
of these conditions hold (no regions needed to fill the underloaded
servers, no regions leftover from overloaded servers), we are done and
return. Otherwise we handle these cases below.
If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from
each (this brings them from having MAX regions to having MIN regions).
We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded
servers. Iterate the least loaded servers filling each to MIN. If we
still have more regions that need assignment, again iterate the least
loaded servers, this time giving each one (filling them to MAX) until
we run out.
All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX
regions at the end of the balancing. This ensures the minimal number
of regions possible are moved.
TODO: We can at-most reassign the number of regions away from a
particular server to be how many they report as most loaded. Should we
just keep all assignment in memory? Any objections? Does this mean we
need HeapSize on HMaster? Or just careful monitor? (current thinking
is we will hold all assignments in memory)
If you have all the data have already been stored, I recommend you just move some regions to another region servers manually using hbase shell.
hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’
Move a region. Optionally specify target regionserver else we choose
one at random. NOTE: You pass the encoded region name, not the region
name so this command is a little different to the others. The encoded
region name is the hash suffix on region names: e.g. if the region
name were
TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396.
then the encoded region name portion is
527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus
startcode. For example: host187.example.com,60020,1289493121758
In case you are using Apache Phoenix for creating tables in HBase, you can specify SALT_BUCKETS in the CREATE statement. The table will split into as many regions as the bucket mentioned. Phoenix calculates the Hash of rowkey (most probably a numeric hash % SALT_BUCKETS) and assigns the column cell to the appropriate region.
CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city)) SALT_BUCKETS=3;
This will pre-split the table into 3 regions
Alternatively, the HBase default UI, allows you to split regions accordingly.

Determine Request Latency

I'm working on creating a version of Pastry natively in Go. From the design [PDF]:
It is assumed that the application
provides a function that allows each Pastry node to determine the “distance” of a node
with a given IP address to itself. A node with a lower distance value is assumed to be
more desirable. An application is expected to implements this function depending on its
choice of a proximity metric, using network services like traceroute or Internet subnet
maps, and appropriate caching and approximation techniques to minimize overhead.
I'm trying to figure out what the best way to determine the "proximity" (i.e., network latency) between two EC2 instances programmatically from Go. Unfortunately, I'm not familiar enough with low-level networking to be able to differentiate between the different types of requests I could use. Googling did not turn up any suggestions for measuring latency from Go, and general latency techniques always seem to be Linux binaries, which I'm hoping to avoid in the name of fewer dependencies. Any help?
Also, I note that the latency should be on the scale of 1ms between two EC2 instances. While I plan to use the implementation on EC2, it could hypothetically be used anywhere. Is latency generally so bad that I should expend the effort to ensure the network proximity of two nodes? Keep in mind that most Pastry requests can be served in log base 16 of the number of servers in the cluster (so for 10,000 servers, it would take approximately 3 requests, on average, to find the key being searched for). Is the latency from, for example, EC2's Asia-Pacific region to EC2's US-East region enough to justify the increased complexity and the overhead introduced by the latency checks when adding nodes?
A common distance metric in networking is to count the number of hops (node-hops in-between) a packet needs to reach its destination. This metric was also mentioned in the text you quoted. This could give you adequate distance values even for the low-latency environment you mentioned (EC2 “local”).
For the go logic itself, one would think the net package is what you are looking for. And indeed, for latency tests (ICMP ping) you could use it to create an IP connection
conn, err := net.Dial("ip4", "127.0.0.1")
create your ICMP package structure and data, and send it. (See Wikipedia page on ICMP; IPv6 needs a different format.) Unfortunately you can’t create an ICMP connection directly, like you can with TCP and UDP, thus you will have to handle the package structure yourself.
As conn of type Conn is a Writer, you can then pass it your data, the ICMP data you defined.
In the ICMP Type field you can specify the message type. Values 8, 1 and 30 are the ones you are looking for. 8 for your echo request, the reply will be of type 1. And maybe 30 gives you some more information.
Unfortunately, for counting the network hops, you will need the IP packet header fields. This means, you will have to construct your own IP packets, which net does not seem to allow.
Checking the source of Dial(), it uses internetSocket, which is not exported/public. I’m not really sure if I’m missing something, but it seems there is no simple way to construct your own IP packets to send, with customizable header values. You’d have to further check how DialIP sends packages with internetSocket and duplicate and adapt that code/concept. Alternatively, you could use cgo and a system library to construct your own packages (this would add yet more complexity though).
If you are planning on using IPv6, you will (also) have to look into ICMPv6. Both packages have a different structure over their v4 versions.
So, I’d suggest using simple latency (timed ping) as a simple(r) implementation and then add node-hops at a later time/afterwards, if you need it. If you have both in place, maybe you also want to combine those 2 (less hops does not automatically mean better; think long overseas-cables etc).

Algorithms behind load-balancers?

I need to study about load-balancers, such as Network Load Balancing, Linux Virtual Server, HAProxy, etc. There're something under-the-hood I need to know:
What algorithms/technologies are used in these load-balancers? Which is the most popular? most effective?
I expect that these algorithms/technologies will not be too complicated. Are there some resources written about them?
Load balancing in Apache, for example, is taken care of by the module called mod_proxy_balancer. This module supports 3 load balancing algorithms:
Request counting
Weighted traffic counting
Pending request counting
For more details, take a look here: mod_proxy_balancer
Not sure if this belongs on serverfault or not, but some load balancing techniques are:
Round Robin
Least Connections
I used least connections. It just made the most sense to send the person to the machine which had the least amount of load.
In general, load balancing is all about sending new client requests to servers which are the least busy. Based on the application running, assign a 'busy factor' to each server: basically a number reflecting one/several points of interest for your load balancing algorithm (connected clients, cpu/mem usage, etc.) and then, at runtime, choose the server with the lowest such score. Basically ANY load balancing technique is based on something like this:
Round robin does not implement a 'busy score' per se, but assigns each consecutive request to the next server in a circular queue.
Least connections has its score = number_of_open_connections to the server. Obviously, a server with fewer connections is a better choice.
Random assignment is a special case - you make an uninformed decision about the server's load, but assume that the function has a statistically even distribution.
In addition to those already mentioned, a simple random assignment can be a good enough algorithm for load balancing, especially with a large number of servers.
Here's one link from Oracle: http://download-llnw.oracle.com/docs/cd/E11035_01/wls100/cluster/load_balancing.html

Resources