File sharding between servers algorithm - algorithm

I want to distribute files across multiple servers and have them available with very little overhead. So I was thinking of the following naive algorithm:
Providing that each file has an unique ID number: 120151 I'm thinking of segmenting the files using the modulo (%) operator. This works if I know the number of servers in advance:
Example with 2 servers (stands for n servers):
server 1 : ID % 2 = 0 (contains even IDs)
server 2 : ID % 2 = 1 (contains odd IDs)
However when I need to scale this and add more servers I will have to re-shuffle the files to obey the new algorithm rules and we don't want that.
Example:
Say I add server 3 into the mix because I cannot handle the load. Server 3 will contain files that respect the following criteria:
server 3 : ID%3 = 2
Step 1 is to move the files from server 1 and server 2 where ID%3 = 2.
However, I'll have to move some files between server 1 and server 2 so that the following occurs:
server 1 : ID%3 = 0
server 2 : ID%3 = 1
What's the optimal way to achieve this?

My approach would be to use consistent hashing. From Wikipedia:
Consistent hashing is a special kind of hashing such that when a hash
table is resized and consistent hashing is used, only K/n keys need to
be remapped on average, where K is the number of keys, and n is the
number of slots.
The general idea is this:
Think of your servers as arranged on a ring, ordered by their server_id
Each server is assigned a uniformly distributed (random) id, e.g. server_id = SHA(node_name).
Each file is equally assigned a uniformly distributed id, e.g. file_id = SHA(ID), where ID is as given in your example.
Choose the server that is 'closest' to the file_id, i.e. where server_id > file_id (start choosing with the smallest server_id).
If there is no such node, there is a wrap around on the ring
Note: you can use any hash function that generates uniformly distributed hashes, so long as you use the same hash function for both servers and files.
This way, you get to keep O(1) access, and adding/removing is straight forward and does not require reshuffling all files:
a) adding a new server, the new node gets all the files from the next node on the ring with ids lower than the new server
b) removing a server, all of its files are given to the next node on the ring
Tom White's graphically illustrated overview explains in more detail.

To summarize your requirements:
Each server should store an (almost) equal amount of files.
You should be able to determine which server holds a given file - based only on the file's ID, in O(1).
When adding a file, requirements 1 and 2 should hold.
When adding a server, you want to move some files to it from all existing servers, such that requirements 1 and 2 would hold.
Your strategy when adding a 3rd server (x is the file's ID):
x%6 Old New
0 0 0
1 1 1
2 0 --> 2
3 1 --> 0
4 0 --> 1
5 1 --> 2
Alternative strategy:
x%6 Old New
0 0 0
1 1 1
2 0 0
3 1 1
4 0 --> 2
5 1 --> 2
To locate a server after the change:
0: x%6 in [0,2]
1: x%6 in [1,3]
2: x%6 in [4,5]
Adding a 4th server:
x%12 Old New
0 0 0
1 1 1
2 0 0
3 1 1
4 2 2
5 2 2
6 0 0
7 1 1
8 0 --> 3
9 1 --> 3
10 2 2
11 2 --> 3
To locate a server after the change:
0: x%12 in [0,2, 6]
1: x%12 in [1,3, 7]
2: x%12 in [4,5,10]
3: x%12 in [8,9,11]
When you add server, you can always build a new function (actually several alternative functions). The value of the divisor for n servers equals to lcm(1,2,...,n), so it grows very fast.
Note that you didn't mention if files are removed, and if you plan to handle that.

Related

Segment tree built on "light bulbs"

I have encountered following problem:
There are n numbers (0 or 1) and there are 2 operations. You can swich all numbers to 0 or 1 on a specific range(note that switching 001 to 0 is 000, not 110) and you can also ask about how many elements are turned on on a specific range.
Example:
->Our array is 0100101
We set elements from 1 to 3 to 1:
->Our array is 1110101 now
We set elements from 2 to 5 to 0:
->Our array is 1000001 now
We are asking about sum from 2nd to 7th element
-> The answer is 1
Brute force soltion is too slow(O(n*q), where q is number of quetions), so I assume that there has to be a faster one. Probably using segment tree, but I can not find it...
You could build subsampling binary tree in the fashion of mipmaps used in computer graphics.
Each node of the tree contains the sum of its children's values.
E.g.:
0100010011110100
1 0 1 0 2 2 1 0
1 1 4 1
2 5
7
This will bring down complexity for a single query to O(log₂n).
For an editing operation, you also get O(log₂n) by implementing a shortcut: Instead of applying changes recursively, you stop at a node that is fully covered by the input range; thus creating a sparse representation of the sequence. Each node representing M light bulbs either
has value 0 and no children, or
has value M and no children, or
has a value in the range 1..M-1 and 2 children.
The tree above would actually be stored like this:
7
2 5
1 1 4 1
1 0 1 0 1 0
01 01 01
You end up with O(q*log₂n).

Algorithm strategy to prevent values from bouncing between 2 values when on edge of two 'buckets'

I'm tracking various colored balls in OpenCV (Python) in real-time. The tracking is very stable. i.e. when stationary the values do not change with more then 1 / 2 pixels for the center of the circle.
However i'm running into what must surely be a well researched issue: I need to now place the positions of the balls into an rougher grid - essentially simply dividing (+ rounding) the x,y positions.
e.g.
input range is 0 -> 9
target range is 0 -> 1 (two buckets)
so i do: floor(input / 5)
input: [0 1 2 3 4 5 6 7 8 9]
output: [0 0 0 0 0 1 1 1 1 1]
This is fine, but the problem occurs when just a small change in the initial value might result it to be either in quickly changes output single I.e. at the 'edge' of the divisions -or a 'sensitive' area.
input: [4 5 4 5 4 5 5 4 ...]
output:[0 1 0 1 0 1 1 0 ...]
i.e. values 4 and 5 (which falls withing the 1 pixel error/'noisy' margin) cause rapid changes in output.
What are some of the strategems / algorithms that deal with these so help me further?
I searched but it seems i do not how to express the issue correctly for Google (or StackOverflow).
I tried adding 'deadzones'. i.e. rather then purely dividing i leave 'gaps' in my ouput range which means a value sometimes has no output (i.e. between 'buckets'). This somewhat works but means i have a lot (i.e. the range of the fluctuation) of the screen that is not used...
i.e.
input = [0 1 2 3 4 5 6 7 8 9]
output = [0 0 0 0 x x 1 1 1 1]
Temporal averaging is not ideal (and doesn't work too well either) - and increases the latency.
I just have a 'hunch' there is a whole set of Computer / Signal science about this.

Algorithm for read matrixes

An algorithm that need process a matrix n x m that is scalable.
E.g. I have a time series of 3 seconds containing the values: 2,1,4.
I need to decompose it to take a 3 x 4 matrix, where 3 is the number of elements of time series and 4 the maximum value. The resulting matrix that would look like this:
1 1 1
1 0 1
0 0 1
0 0 1
Is this a bad solution or is it only considered a data entry problem?
The question is,
do I need to distribute information from each row of the matrix for various elements without losing the values?

Re-sort a vector after a small number of elements have been modified

If we have a vector of size N that was previously sorted, and replace up to M elements with arbitrary values (where M is much smaller than N), is there an easy way to re-sort them at lower cost (i.e. generate a sorting network of reduced depth) than a full sort?
For example if N=10 and M=2 the input might be
10 20 30 40 999 60 70 80 90 -1
Note: the indices of the modified elements are not known (until we compare them with the surrounding elements.)
Here is an example where I know the solution because the input size is small and I was able to find it with a brute-force search:
if N = 5 and M is 1, these would be valid inputs:
0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 1 1
0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1
For example the input may be 0 1 1 0 1 if the previously sorted vector was 0 1 1 1 1 and the 4th element was modified, but there is no way to form 0 1 0 1 0 as a valid input, because it differs in at least 2 elements from any sorted vector.
This would be a valid sorting network for re-sorting these inputs:
>--*---*-----*-------->
| | |
>--*---|-----|-*---*-->
| | | |
>--*---|-*---*-|---*-->
| | | |
>--*---*-|-----*---*-->
| |
>--------*---------*-->
We do not care that this network fails to sort some invalid inputs (e.g. 0 1 0 1 0.)
And this network has depth 4, a saving of 1 compared with the general case (a depth of 5 generally necessary to sort a 5-element vector.)
Unfortunately the brute-force approach is not feasible for larger input sizes.
Is there a known method for constructing a network to re-sort a larger vector?
My N values will be in the order of a few hundred, with M not much more than √N.
Ok, I'm posting this as an answer since the comment restriction on length drives me nuts :)
You should try this out:
implement a simple sequential sort working on local memory (insertion sort or sth. similar). If you don't know how - I can help with that.
have only a single work-item perform the sorting on the chunk of N elements
calculate the maximum size of local memory per work-group (call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE) and derive the maximum number of work-items per work-group,
because with this approach your number of work-items will most likely be limited by the amount of local memory.
This will probably work rather well I suspect, because:
a simple sort may be perfectly fine, especially since the array is already sorted to a large degree
parallelizing for such a small number of items is not worth the trouble (using local memory however is!)
since you're processing billions of such small arrays, you will achieve a great occupancy even if only single work-items process such arrays
Let me know if you have problems with my ideas.
EDIT 1:
I just realized I used a technique that may be confusing to others:
My proposal for using local memory is not for synchronization or using multiple work items for a single input vector/array. I simply use it to get a low read/write memory latency. Since we use rather large chunks of memory I fear that using private memory may cause swapping to slow global memory without us realizing it. This also means you have to allocate local memory for each work-item. Each work-item will access its own chunk of local memory and use it for sorting (exclusively).
I'm not sure how good this idea is, but I've read that using too much private memory may cause swapping to global memory and the only way to notice is by looking at the performance (not sure if I'm right about this).
Here is an algorithm which should yield very good sorting networks. Probably not the absolute best network for all input sizes, but hopefully good enough for practical purposes.
store (or have available) pre-computed networks for n < 16
sort the largest 2^k elements with an optimal network. eg: bitonic sort for largest power of 2 less than or equal to n.
for the remaining elements, repeat #2 until m < 16, where m is the number of unsorted elements
use a known optimal network from #1 to sort any remaining elements
merge sort the smallest and second-smallest sub-lists using a merge sorting network
repeat #5 until only one sorted list remains
All of these steps can be done artificially, and the comparisons stored into a master network instead of acting on the data.
It is worth pointing out that the (bitonic) networks from #2 can be run in parallel, and the smaller ones will finish first. This is good, because as they finish, the networks from #5-6 can begin to execute.

ELKI - input distance matrix

I'm trying to use ELKI for outlier detection ; I have my custom distance matrix and I'm trying to input it to ELKI to perform LOF (for example, in a first time).
I try to follow http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances but it is not very clear to me. What I do:
I don't want to load data from database so I use:
-dbc DBIDRangeDatabaseConnection -idgen.count 100
(where 100 is the number of objects I'll be analyzing)
I use LOF algo and call the external distance file
-algorithm outlier.LOF
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix testData.ascii -lof.k 3
My distance file is as follows (very simple for testing purposes)
0 0 0
0 1 1
0 2 0.2
0 3 0.1
1 1 0
1 2 0.9
1 3 0.9
2 2 0
2 3 0.2
3 3 0
4 0 0.23
4 1 0.97
4 2 0.15
4 3 0.07
4 4 0
5 0 0.1
5 1 0.85
5 2 0.02
5 3 0.15
5 4 0.1
5 5 0
6 0 1
6 1 1
6 2 1
6 3 1
etc
the results say : "all in one trivial clustering", while this is not clustering and there definitely are outliers in my data.
do I do the stuff right ? Or what am I missing ?
When using DBIDRangeDatabaseConnection, and not giving ELKI any actual data, the visualization cannot produce a particularly useful result (because it doesn't have the actual data, after all). Nor can the data be evaluated automatically.
The "all in one trivial clustering" is an artifact from the automatic attempts to visualize the data, but for the reasons discussed above this cannot work. This clustering is automatically added for unlabeled data, to allow some visualizations to work.
There are two things to do for you:
set an output handler. For example -resulthandler ResultWriter, which will produce an output similar to this:
ID=0 lof-outlier=1.0
Where ID= is the object number, and lof-outlier= is the LOF outlier score.
Alternatively, you can implement your own output handler. An example is found here:
http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/tutorial/outlier/SimpleScoreDumper.java
fix DBIDRangeDatabaseConnection. You are however bitten by a bug in ELKI 0.6.0~beta1: the DBIDRangeDatabaseConnection actually doesn't initialize its parameters correctly.
The trivial bug fix (parameters not initialized correctly in the constructor) is here:
http://elki.dbs.ifi.lmu.de/changeset/11027/elki
Alternatively, you can create a dummy input file and use the regular text input. A file containing
0
1
2
...
should do the trick. Use -dbc.in numbers100.txt -dbc.filter FixedDBIDsFilter -dbc.startid 0. The latter arguments are to have your IDs start at 0, not 1 (default).
This workaround will produce a slightly different output format:
ID=0 0.0 lof-outlier=1.0
where the additional column is from the dummy file. The dummy values will not affect the algorithm result of LOF, when an external distance function is used; but this approach will use some additional memory.

Resources