How does Elasticsearch manage its shards? - elasticsearch

I have three nodes working together. On those nodes I have five indexes, each having 5 primary shards and each primary shard has 2 replicas. It looks like this (i cut to see only two indices out of 5 ):
![IMG]http://i59.tinypic.com/2ez1wjt.png
As you can see on the picture:
- the node 1 has primary shards 0 and 3 (and replicas 1,2 and 4)
- the node 2 has primary shards 2 (and replicas 0, 1, 3 and 4)
- the node 3 has primary shards 1 and 4 (and replicas 0 ,2 and 3)
and this is the case for each index (the 5 of them).
I understand that if I restart my nodes this "organisation" will change but still the "look" of index will be the same as index2, 3, 4, 5. For example, after restarting, I would have:
- the node 1 has primary shards 1 and 2 (and replicas 0, 3 and 4)
- the node 2 has primary shards 3 (and replicas 0, 1, 2 and 4)
- the node 3 has primary shards 0 and 4 (and replicas 1 ,2 and 3)
and this would be the case for each index (the 5 of them).
Is there a reason why I find always the same pattern for each of my index?
Thanks

Related

Reverse 0/1 knapsack: how is this row in the table possible?

I'm solving a reverse 0/1 knapsack problem, i.e. I'm trying to recreate the list of weights and values of all the items using only the DP-table.
I have this table:
[0][1] [4][5][6] [12]
[0] 0 0 0 0 0 0 0 0 0 0 0 0 0
[1] 0 4 4 4 4 4 4 4 4 4 4 4 4
[2] 0 4 4 4 6 10 10 10 10 10 10 10 10
I don't understand how row [2] is possible.
[0] - it is clear that if we do not put anything in the knapsack, the answer total value 0.
[1] - in row [1] I see that [1][1]=4 and I hope that I correctly conclude that the first item has weight = 1 and value = 4. So, since we put only 1 item it is the only weight we can hope for in this row.
[2] - when we reach [2][4], we have 6, 6 > [2-1][4] and I assume that we use 2 items here, one weight = 1 and value = 4 (the old one) and weight = 4-1 and value = 6-4 = weight = 3 and value = 2, which is the new one.
Question: How is it possible to have [2][5] = 10? We can't put more than 1 item on a row, as I understand this chart. If we have two items in use here, shouldn't we have 6 for all the elements in row [2] starting from [2][4] to the end of the row?
This seems possible if you have two items, one with weight 1 and value 4 and one with weight 4, value 6.
How? When you're at index (2, 4) you have a weight capacity of 4 for the first time in the row that considers item 2 (weight 4, value 6). This lets you take the item with value 6 instead of the weight 1, value 4 item you previously took at index (2, 3), effectively building from the subproblem at index (2, 0).
Now, when you're at index (2, 5) with a weight capacity of 5, the total value of 10 is possible because you can take both items. That's the best you can do for the rest of the row.
See also How to find which elements are in the bag, using Knapsack Algorithm [and not only the bag's value]?

Binary Heap - inserting order

Does inserting order affects the structure of binary heap? I mean, is it possible to get a little different parent-children relations when inserting the same elements in different orders,
for example:
20 6 3 5 7 8 16 10 (inserting order #1) and 6 3 20 10 16 3 7 5 (inserting order #2)
or the final result should always be the same?
A heap can store a given collection of values in different ways. For instance, if the heap happens to be a perfect binary tree, then you can swap any subtree with its sibling subtree without violating the heap property.
For example, if the data collection has the values 1, 2 and 3, there are 2 possible heaps that can represent that data set:
1 1
/ \ / \
2 3 3 2
The first will be the result when 2 is inserted before 3, and the second heap will be the result when 2 is inserted after 3.
If we look at an input with four values (e.g. 1, 2, 3 and 4), we can represent that in four heaps:
1 1 1 1
/ \ / \ / \ / \
2 3 2 4 3 2 4 2
/ / / /
4 3 4 3
Again, the order of insertion will determine which of those four heaps will be the end result.
If you imagine the sequences 1 1 1 1 2 2 and 1 1 2 1 1 2, you should end up with different heaps.
1 versus 1
1 1 1 2
1 2 2 1 1 2

Elasticsearch shard allocation

I have 1 ES cluster with 3 nodes, 1 index, 3 shards and 2 replica per shard.
For some reason, all my primary shards are located on the same node:
Node 1: replica 0, replica 1, replica 2
Node 2: replica 0, replica 1, replica 2
Node 3: primary 0, primary 1, primary 2.
What I should do to rebalance shards? I want to have 1 primary shard per 1 node, for example:
Node 1: primary 0, replica 1, replica 2
Node 2: replica 0, primary 1, replica 2
Node 3: replica 0, replica 1, primary 2.

having a bidirectional graph, best way to remove paths that join certain nodes?

Imagine i have a bidirectional graph with 4 nodes with the following connections:
0 <-> 2 ; 0 <-> 3 ; 1 <-> 2 ; 1 <-> 3
now imagine i have a group of nodes K (0 and 1), and i want to calculate the minimum amount of connections i have to remove so that those nodes aren't ALL connected.
0 <-> 3 ; 1 <-> 2
this way theres no path that can connect 0 and 1. in fact even if the group of nodes K were something like 10 nodes, 9 could be connected if at least 1 isn't (thats why i used high case for "all" above).
another example would be:
0 <-> 2 ; 0 <-> 3 ; 0 <-> 4 ; 1 <-> 2 ; 1<->3
and a group of nodes K (0, 1, 4) i would only need to remove 1 connection to avoid them ALL connecting
0 <-> 4
I've tried a lot of things on my own, like calculating all paths of the K group and checking for repetitive paths and removing those, but it doesn't work for all cases (like the first one i posted above).
is there an algorithm that can help me with this? i've tried google but i cant find documentation for this type of problem, maybe its not very common.
thanks in advance.
Example 1:
From your graph:
(0,2),(0,3),(1,2),(1,3)
2
/ \
0 1
\ /
3
K(0, 1)
Create a tree like this:
0
/ \
2 3
/ \
1 1
Each branch begins at 0 and ends at 1. If a branch does not reach 1, it's not included. Remove the topmost edges (in case of branching below that point). It doesn't matter if you build the tree from 0 to 1 or from 1 to 0 since the graph is bidirectional.
Example 2:
Graph:
(0,1),(1,2),(2,3)
0 -- 1 -- 2 -- 3
K(1, 2)
Tree:
1
|
2
Remove:
(1,2)
Example 3:
Graph:
(0,2),(0,3),(0,4),(1,2),(1,3)
0
/ | \
2 3 4
\ /
1
K(0, 1, 4)
Tree:
0
/ | \ <-- 2 edges leading to 1; 1 edge leading to 4
2 3 4
| |
1 1
Remove:
(0,4)
You can count the number of edges that each node has. If you disconnect all the edges from a node, you disconnect the graph. So, the minimum amount of connections you have to remove is the number of edges that the vertex with the least edges has.
Say your graph has bidirectional connections 0-1, 0-2, 0-3, 2-3, 3-1.
0 has 3 edges connecting it.
3 has 3 edges connecting it.
1 has 2 edges connecting it.
2 has 2 edges connecting it.
So you should remove 0-2 and 2-3 to disconnect 2 from the graph. Now you can no longer go from 2 to any of the other points.

File sharding between servers algorithm

I want to distribute files across multiple servers and have them available with very little overhead. So I was thinking of the following naive algorithm:
Providing that each file has an unique ID number: 120151 I'm thinking of segmenting the files using the modulo (%) operator. This works if I know the number of servers in advance:
Example with 2 servers (stands for n servers):
server 1 : ID % 2 = 0 (contains even IDs)
server 2 : ID % 2 = 1 (contains odd IDs)
However when I need to scale this and add more servers I will have to re-shuffle the files to obey the new algorithm rules and we don't want that.
Example:
Say I add server 3 into the mix because I cannot handle the load. Server 3 will contain files that respect the following criteria:
server 3 : ID%3 = 2
Step 1 is to move the files from server 1 and server 2 where ID%3 = 2.
However, I'll have to move some files between server 1 and server 2 so that the following occurs:
server 1 : ID%3 = 0
server 2 : ID%3 = 1
What's the optimal way to achieve this?
My approach would be to use consistent hashing. From Wikipedia:
Consistent hashing is a special kind of hashing such that when a hash
table is resized and consistent hashing is used, only K/n keys need to
be remapped on average, where K is the number of keys, and n is the
number of slots.
The general idea is this:
Think of your servers as arranged on a ring, ordered by their server_id
Each server is assigned a uniformly distributed (random) id, e.g. server_id = SHA(node_name).
Each file is equally assigned a uniformly distributed id, e.g. file_id = SHA(ID), where ID is as given in your example.
Choose the server that is 'closest' to the file_id, i.e. where server_id > file_id (start choosing with the smallest server_id).
If there is no such node, there is a wrap around on the ring
Note: you can use any hash function that generates uniformly distributed hashes, so long as you use the same hash function for both servers and files.
This way, you get to keep O(1) access, and adding/removing is straight forward and does not require reshuffling all files:
a) adding a new server, the new node gets all the files from the next node on the ring with ids lower than the new server
b) removing a server, all of its files are given to the next node on the ring
Tom White's graphically illustrated overview explains in more detail.
To summarize your requirements:
Each server should store an (almost) equal amount of files.
You should be able to determine which server holds a given file - based only on the file's ID, in O(1).
When adding a file, requirements 1 and 2 should hold.
When adding a server, you want to move some files to it from all existing servers, such that requirements 1 and 2 would hold.
Your strategy when adding a 3rd server (x is the file's ID):
x%6 Old New
0 0 0
1 1 1
2 0 --> 2
3 1 --> 0
4 0 --> 1
5 1 --> 2
Alternative strategy:
x%6 Old New
0 0 0
1 1 1
2 0 0
3 1 1
4 0 --> 2
5 1 --> 2
To locate a server after the change:
0: x%6 in [0,2]
1: x%6 in [1,3]
2: x%6 in [4,5]
Adding a 4th server:
x%12 Old New
0 0 0
1 1 1
2 0 0
3 1 1
4 2 2
5 2 2
6 0 0
7 1 1
8 0 --> 3
9 1 --> 3
10 2 2
11 2 --> 3
To locate a server after the change:
0: x%12 in [0,2, 6]
1: x%12 in [1,3, 7]
2: x%12 in [4,5,10]
3: x%12 in [8,9,11]
When you add server, you can always build a new function (actually several alternative functions). The value of the divisor for n servers equals to lcm(1,2,...,n), so it grows very fast.
Note that you didn't mention if files are removed, and if you plan to handle that.

Resources