Replicas created on same node before being transferred - elasticsearch

I have an Elasticsearch cluster made up of 3 nodes.
Every day, I have a batch that feeds in a new index composed of 3 shards then scales the number of replicas to 1. So at the end of the day I'm expecting every node to carry 1 primary and 1 replica.
The figure below shows the disk space usage on each node during this operation.
On node 0 everything seems to be going smoothly during that operation.
However, node 2 is idle most of the time at the beginning while node 1 seems to be is taking care of its own replica plus node 2 replica, before transferring it to node 2 (this is my own understanding, I might be wrong). This is causing a lot of pressure on the disk usage of node 1 which almost reaches 100% of disk space usage.
Why this behaviour? Shouldn't every node take care of its own replica here to even the load? Can I force it to do so somehow? This is worrying because when a disk reaches 100%, the entire node goes down as it happened in the past.
UPDATE to Val's answer:
You will find the outputs below
GET _cat/shards/xxxxxxxxxxxxxxxxxxxxxx_20210617?v
index shard prirep state docs store ip node
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 p STARTED 8925915 13.4gb 172.23.13.255 es-master-0
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 r STARTED 8925915 13.4gb 172.23.10.76 es-master-2
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 r STARTED 8920172 13.4gb 172.23.24.221 es-master-1
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 p STARTED 8920172 13.4gb 172.23.10.76 es-master-2
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 p STARTED 8923889 13.4gb 172.23.24.221 es-master-1
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 r STARTED 8923889 13.5gb 172.23.13.255 es-master-0
GET _cat/recovery/xxxxxxxxxxxxxxxxxxxxxx_20210617?v
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 382ms empty_store done n/a n/a 172.23.24.221 es-master-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 21.9m peer done 172.23.24.221 es-master-1 172.23.13.255 es-master-0 n/a n/a 188 188 100.0% 188 14467579393 14467579393 100.0% 14467579393 55835 55835 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 395ms empty_store done n/a n/a 172.23.13.255 es-master-0 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 9m peer done 172.23.13.255 es-master-0 172.23.10.76 es-master-2 n/a n/a 188 188 100.0% 188 14486949488 14486949488 100.0% 14486949488 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 17.8m peer done 172.23.10.76 es-master-2 172.23.24.221 es-master-1 n/a n/a 134 134 100.0% 134 14470475298 14470475298 100.0% 14470475298 1894 1894 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 409ms empty_store done n/a n/a 172.23.10.76 es-master-2 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%

First, if you have 3 nodes and your index has 3 primaries with each having 1 replica, there's absolutely no guarantee whatsoever that each node will hold one primary and one replica.
The only guarantees that you have is that:
the shard count will be balanced over the nodes and
a primary and its replica will never land on the same node.
That being said, it's perfectly possible for a node to get two primaries, another two replicas and the 3rd one gets one primary and one replica.
Looking at the chart, what I think happens in your case is that
node 2 gets two primaries and
node 0 gets one primary
Then, when you add the replica:
node 0 (which has only one primary) gets one replica (the curve is less steep)
node 1 (which has nothing so far) gets two replicas (the curve grows steeper)
node 2 stays flat because it already has two primaries
A little later, when node 1's disk approaches saturation, one shard is relocated away from it to node 2 (at 23:16 the curve starts to increase).
The end situation seems to be:
node 0 with one primary and one replica
node 1 with only one replica
node 2 with two primaries and one replica
I think it would be nice to confirm this with the following two commands:
# you can see where each shard is located now
GET _cat/shards/tax*?v
# you can see which shards went from which node to which node
GET _cat/recovery/indexname*?v

Related

Token bucket vs Fixed window (Traffic Burst)

I was comparing Token bucket and Fixed window rate limiting algorithm, But a bit confused with traffic bursts in both algorithm.
Let's say i want to limit traffic to 10 requests/minute.
In Token bucket, tokens are added at the rate of 10 tokens per minute.
Time Requests AvailableTokens
10:00:00 0 10 (added 10 tokens)
10:00:58 10 0
10:01:00 0 10 (added 10 tokens)
10:01:01 10 0
Now if we see at timestamp 10:01:01, in last minute 20 requests were allowed, more than our limit.
Similarly, With Fixed window algorithms.
Window size: 1 minute.
Window RequestCount IncomingRequests
10:00:00 10 10 req at 10:00:58
10:01:00 10 10 req at 10:01:01
Similar problem is here as well.
Does both the algorithms suffer from this problem, or is there a gap in my understanding?
I had the same confusion about those algorithms.
The trick with the Token Bucket is that Bucket size(b) and Refill rate(r) don't have to be equal.
For your particular example, you could set Bucket size to be b = 5 and refill rate r = 1/10 (1 token per 10 seconds).
With this example, the client is still able to make 11 requests per minute, but that's already less than 20 as in your example, and they are spread over time. And I also believe if you play with the parameters you can achieve a strategy when >10 requests/min is not allowed at all.
Time Requests AvailableTokens
10:00:00 0 5 (we have 5 tokens initially)
10:00:10 0 5 (refill attempt failed cause Bucket is full)
10:00:20 0 5 (refill attempt failed cause Bucket is full)
10:00:30 0 5 (refill attempt failed cause Bucket is full)
10:00:40 0 5 (refill attempt failed cause Bucket is full)
10:00:50 0 5 (refill attempt failed cause Bucket is full)
10:00:58 5 0
10:01:00 0 1 (refill 1 token)
10:01:10 0 2 (refill 1 token)
10:01:20 0 3 (refill 1 token)
10:01:30 0 4 (refill 1 token)
10:01:40 0 5 (refill 1 token)
10:01:49 5 0
10:01:50 0 1 (refill 1 token)
10:01:56 1 0
Other options:
b = 10 and r = 1/10
b = 9 and r = 1/10

MQTT Keep Alive byte format

The MQTT 3.1.1 documentation is very clear and helpful, however I am having trouble understanding the meaning of one section regarding the Keep Alive byte structure in the connect message.
The documentation states:
The Keep Alive is a time interval measured in seconds. Expressed as a 16-bit word, it is the maximum time interval that is permitted to elapse between the point at which the Client finishes transmitting one Control Packet and the point it starts sending the next.
And gives an example of a keep alive payload:
Keep Alive MSB (0) 0 0 0 0 0 0 0 0
Keep Alive LSB (10) 0 0 0 0 1 0 1 0
I have interpreted this to represent a keep alive interval of 10 seconds, as the interval is given in seconds and that makes the most sense. However I'm not sure how you would represent longer intervals of, for example, 10 minutes.
Finally, would the maximum keep alive interval of 65535 seconds (~18 hours) be represented by these bytes
Keep Alive MSB (255) 1 1 1 1 1 1 1 1
Keep Alive LSB (255) 1 1 1 1 1 1 1 1
Thank you for your help
2^16=65536 seconds 65536/60 = 1092.27 minutes 1092.27/60 = 18.20 hours
0.20hour*60 = 12minutes y 0.27min*60 = 16.2sec
result: 18 hours,12minutes, 16sec
10 minutes = 600 seconds
600 in binary -> 0000 0010 0101 1000
And yes 65535 is the largest number that can be represented by a 16bit binary field, but there are very few situations where an 18 hour keep alive interval would make sense.

DSE - Cassandra : Commit Log Disk Impact on Performances

I'm running a DSE 4.6.5 Cluster (Cassandra 2.0.14.352).
Following datastax's guidelines, on every machine, I separated the data directory from the commitlog/saved caches directories:
data is on blazing fast drives
commit log and saved caches are on the system drives : 2 HDD RAID1
Monitoring disks with OpsCenter while performing intensive writes, I see no issue with the first, however I see the queue size from the later (commit log) averaging around 300 to 400 with spikes up to 700 requests. Of course the latency is also fairly high on theses drives ...
Is this affecting, the performance of my cluster ?
Would you recommend putting the commit log and saved cache on a SSD ? separated from the system disks ?
Thanks.
Edit - Adding tpstats from one of nodes :
[root#dbc4 ~]# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 15938 0 0
RequestResponseStage 0 0 154745533 0 0
MutationStage 1 0 306973172 0 0
ReadRepairStage 0 0 253 0 0
ReplicateOnWriteStage 0 0 0 0 0
GossipStage 0 0 340298 0 0
CacheCleanupExecutor 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemoryMeter 1 1 36284 0 0
FlushWriter 0 0 23419 0 996
ValidationExecutor 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MemtablePostFlusher 0 0 27007 0 0
MiscStage 0 0 0 0 0
PendingRangeCalculator 0 0 7 0 0
CompactionExecutor 8 10 7400 0 0
commitlog_archiver 0 0 0 0 0
HintedHandoff 0 1 222 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 49547
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Edit 2 - sar output :
04:10:02 AM CPU %user %nice %system %iowait %steal %idle
04:10:02 PM all 22.25 26.33 1.93 0.48 0.00 49.02
04:20:01 PM all 23.23 26.19 1.90 0.49 0.00 48.19
04:30:01 PM all 23.71 26.44 1.90 0.49 0.00 47.45
04:40:01 PM all 23.89 26.22 1.86 0.47 0.00 47.55
04:50:01 PM all 23.58 26.13 1.88 0.53 0.00 47.88
Average: all 21.60 26.12 1.71 0.56 0.00 50.01
Monitoring disks with OpsCenter while performing intensive writes, I see no issue with the first,
Cassandra persists writes in memory (memtable) and on the commitlog (disk).
When the memtable size grows to a threshold, or when you manually trigger it, Cassandra will write everything to disk (flush the memtables).
To make sure your setup is capable of handling your workload try to manually flush all your memtables
nodetool flush
on a node. Or just a specific keyspace with
nodetool flush [keyspace] [columnfamilfy]
At the same time monitor your disks I/O.
If you have high I/O wait you can either share the workload by adding more nodes, or switch the data drives to better one with higher throughput.
Keep an eye to dropped mutations (can be other nodes sending the writes/hints) and dropped flush-writer.
I see the queue size from the later (commit log) averaging around 300 to 400 with spikes up to 700 requests.
This will probably be your writes to the commitlog.
Is your hardware serving any other thing? Is it software raid? Do you have swap disabled?
Cassandra works best alone :) So yes, put at least, the commitlog on a separate (can be smaller) disk.

Re-sort a vector after a small number of elements have been modified

If we have a vector of size N that was previously sorted, and replace up to M elements with arbitrary values (where M is much smaller than N), is there an easy way to re-sort them at lower cost (i.e. generate a sorting network of reduced depth) than a full sort?
For example if N=10 and M=2 the input might be
10 20 30 40 999 60 70 80 90 -1
Note: the indices of the modified elements are not known (until we compare them with the surrounding elements.)
Here is an example where I know the solution because the input size is small and I was able to find it with a brute-force search:
if N = 5 and M is 1, these would be valid inputs:
0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 1 1
0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1
For example the input may be 0 1 1 0 1 if the previously sorted vector was 0 1 1 1 1 and the 4th element was modified, but there is no way to form 0 1 0 1 0 as a valid input, because it differs in at least 2 elements from any sorted vector.
This would be a valid sorting network for re-sorting these inputs:
>--*---*-----*-------->
| | |
>--*---|-----|-*---*-->
| | | |
>--*---|-*---*-|---*-->
| | | |
>--*---*-|-----*---*-->
| |
>--------*---------*-->
We do not care that this network fails to sort some invalid inputs (e.g. 0 1 0 1 0.)
And this network has depth 4, a saving of 1 compared with the general case (a depth of 5 generally necessary to sort a 5-element vector.)
Unfortunately the brute-force approach is not feasible for larger input sizes.
Is there a known method for constructing a network to re-sort a larger vector?
My N values will be in the order of a few hundred, with M not much more than √N.
Ok, I'm posting this as an answer since the comment restriction on length drives me nuts :)
You should try this out:
implement a simple sequential sort working on local memory (insertion sort or sth. similar). If you don't know how - I can help with that.
have only a single work-item perform the sorting on the chunk of N elements
calculate the maximum size of local memory per work-group (call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE) and derive the maximum number of work-items per work-group,
because with this approach your number of work-items will most likely be limited by the amount of local memory.
This will probably work rather well I suspect, because:
a simple sort may be perfectly fine, especially since the array is already sorted to a large degree
parallelizing for such a small number of items is not worth the trouble (using local memory however is!)
since you're processing billions of such small arrays, you will achieve a great occupancy even if only single work-items process such arrays
Let me know if you have problems with my ideas.
EDIT 1:
I just realized I used a technique that may be confusing to others:
My proposal for using local memory is not for synchronization or using multiple work items for a single input vector/array. I simply use it to get a low read/write memory latency. Since we use rather large chunks of memory I fear that using private memory may cause swapping to slow global memory without us realizing it. This also means you have to allocate local memory for each work-item. Each work-item will access its own chunk of local memory and use it for sorting (exclusively).
I'm not sure how good this idea is, but I've read that using too much private memory may cause swapping to global memory and the only way to notice is by looking at the performance (not sure if I'm right about this).
Here is an algorithm which should yield very good sorting networks. Probably not the absolute best network for all input sizes, but hopefully good enough for practical purposes.
store (or have available) pre-computed networks for n < 16
sort the largest 2^k elements with an optimal network. eg: bitonic sort for largest power of 2 less than or equal to n.
for the remaining elements, repeat #2 until m < 16, where m is the number of unsorted elements
use a known optimal network from #1 to sort any remaining elements
merge sort the smallest and second-smallest sub-lists using a merge sorting network
repeat #5 until only one sorted list remains
All of these steps can be done artificially, and the comparisons stored into a master network instead of acting on the data.
It is worth pointing out that the (bitonic) networks from #2 can be run in parallel, and the smaller ones will finish first. This is good, because as they finish, the networks from #5-6 can begin to execute.

Multiple Inputs for Backpropagation Neural Network

I've been working on this for about a week. There are no errors in my coding, I just need to get algorithm and concept right. I've implemented a neural network consisting of 1 hidden layer. I use the backpropagation algorithm to correct the weights.
My problem is that the network can only learn one pattern. If I train it with the same training data over and over again, it produces the desired outputs when given input that is numerically close to the training data.
training_input:1, 2, 3
training_output: 0.6, 0.25
after 300 epochs....
input: 1, 2, 3
output: 0.6, 0.25
input 1, 1, 2
output: 0.5853, 0.213245
But if I use multiple varying training sets, it only learns the last pattern. Aren't neural networks supposed to learn multiple patterns? Is this a common beginner mistake? If yes then point me in the right direction. I've looked at many online guides, but I've never seen one that goes into detail about dealing with multiple input. I'm using sigmoid for the hidden layer and tanh for the output layer.
+
Example training arrays:
13 tcp telnet SF 118 2425 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 26 10 0.38 0.12 0.04 0 0 0 0.12 0.3 anomaly
0 udp private SF 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 3 0 0 0 0 0.75 0.5 0 255 254 1 0.01 0.01 0 0 0 0 0 anomaly
0 tcp telnet S3 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 255 79 0.31 0.61 0 0 0.21 0.68 0.6 0 anomaly
The last columns(anomaly/normal) are the expected outputs. I turn everything into numbers, so each word can be represented by a unique integer.
I give the network one array at a time, then I use the last column as the expected output to adjust the weights. I have around 300 arrays like these.
As for the hidden neurons, I tried from 3, 6 and 20 but nothing changed.
+
To update the weights, I calculate the gradient for the output and hidden layers. Then I calculate the deltas and add them to their associated weights. I don't understand how that is ever going to learn to map multiple inputs to multiple outputs. It looks linear.
If you train a neural network too much, with respect to the number of iterations through the back-propagation algorithm, on one data set the weights will eventually converge to a state where it will give the best outcome for that specific training set (overtraining for machine learning). It will only learn the relationships between input and target data for that specific training set, but not the broader more general relationship that you might be looking for. It's better to merge some distinctive sets and train your network on the full set.
Without seeing the code for the back-propagation algorithm I could not give you any advice on if it's working correctly. One problem I had when implementing the back-propagation was not properly calculating the derivative of the activation function around the input value. This website was very helpful for me.
No Neural networks are not supposed to know multiple tricks.
You train them for a specific task.
Yes they can be trained for other tasks as well
But then they get optimized for another task.
So thats why you should create load and save functions, for your network so that you can easily switch brains and perform other tasks, if required.
If your not sure what taks it is currently train a neural to find the diference between the tasks.

Resources