DSE - Cassandra : Commit Log Disk Impact on Performances - performance

I'm running a DSE 4.6.5 Cluster (Cassandra 2.0.14.352).
Following datastax's guidelines, on every machine, I separated the data directory from the commitlog/saved caches directories:
data is on blazing fast drives
commit log and saved caches are on the system drives : 2 HDD RAID1
Monitoring disks with OpsCenter while performing intensive writes, I see no issue with the first, however I see the queue size from the later (commit log) averaging around 300 to 400 with spikes up to 700 requests. Of course the latency is also fairly high on theses drives ...
Is this affecting, the performance of my cluster ?
Would you recommend putting the commit log and saved cache on a SSD ? separated from the system disks ?
Thanks.
Edit - Adding tpstats from one of nodes :
[root#dbc4 ~]# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 15938 0 0
RequestResponseStage 0 0 154745533 0 0
MutationStage 1 0 306973172 0 0
ReadRepairStage 0 0 253 0 0
ReplicateOnWriteStage 0 0 0 0 0
GossipStage 0 0 340298 0 0
CacheCleanupExecutor 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemoryMeter 1 1 36284 0 0
FlushWriter 0 0 23419 0 996
ValidationExecutor 0 0 0 0 0
InternalResponseStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MemtablePostFlusher 0 0 27007 0 0
MiscStage 0 0 0 0 0
PendingRangeCalculator 0 0 7 0 0
CompactionExecutor 8 10 7400 0 0
commitlog_archiver 0 0 0 0 0
HintedHandoff 0 1 222 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 0
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 49547
_TRACE 0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0
Edit 2 - sar output :
04:10:02 AM CPU %user %nice %system %iowait %steal %idle
04:10:02 PM all 22.25 26.33 1.93 0.48 0.00 49.02
04:20:01 PM all 23.23 26.19 1.90 0.49 0.00 48.19
04:30:01 PM all 23.71 26.44 1.90 0.49 0.00 47.45
04:40:01 PM all 23.89 26.22 1.86 0.47 0.00 47.55
04:50:01 PM all 23.58 26.13 1.88 0.53 0.00 47.88
Average: all 21.60 26.12 1.71 0.56 0.00 50.01

Monitoring disks with OpsCenter while performing intensive writes, I see no issue with the first,
Cassandra persists writes in memory (memtable) and on the commitlog (disk).
When the memtable size grows to a threshold, or when you manually trigger it, Cassandra will write everything to disk (flush the memtables).
To make sure your setup is capable of handling your workload try to manually flush all your memtables
nodetool flush
on a node. Or just a specific keyspace with
nodetool flush [keyspace] [columnfamilfy]
At the same time monitor your disks I/O.
If you have high I/O wait you can either share the workload by adding more nodes, or switch the data drives to better one with higher throughput.
Keep an eye to dropped mutations (can be other nodes sending the writes/hints) and dropped flush-writer.
I see the queue size from the later (commit log) averaging around 300 to 400 with spikes up to 700 requests.
This will probably be your writes to the commitlog.
Is your hardware serving any other thing? Is it software raid? Do you have swap disabled?
Cassandra works best alone :) So yes, put at least, the commitlog on a separate (can be smaller) disk.

Related

Replicas created on same node before being transferred

I have an Elasticsearch cluster made up of 3 nodes.
Every day, I have a batch that feeds in a new index composed of 3 shards then scales the number of replicas to 1. So at the end of the day I'm expecting every node to carry 1 primary and 1 replica.
The figure below shows the disk space usage on each node during this operation.
On node 0 everything seems to be going smoothly during that operation.
However, node 2 is idle most of the time at the beginning while node 1 seems to be is taking care of its own replica plus node 2 replica, before transferring it to node 2 (this is my own understanding, I might be wrong). This is causing a lot of pressure on the disk usage of node 1 which almost reaches 100% of disk space usage.
Why this behaviour? Shouldn't every node take care of its own replica here to even the load? Can I force it to do so somehow? This is worrying because when a disk reaches 100%, the entire node goes down as it happened in the past.
UPDATE to Val's answer:
You will find the outputs below
GET _cat/shards/xxxxxxxxxxxxxxxxxxxxxx_20210617?v
index shard prirep state docs store ip node
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 p STARTED 8925915 13.4gb 172.23.13.255 es-master-0
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 r STARTED 8925915 13.4gb 172.23.10.76 es-master-2
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 r STARTED 8920172 13.4gb 172.23.24.221 es-master-1
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 p STARTED 8920172 13.4gb 172.23.10.76 es-master-2
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 p STARTED 8923889 13.4gb 172.23.24.221 es-master-1
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 r STARTED 8923889 13.5gb 172.23.13.255 es-master-0
GET _cat/recovery/xxxxxxxxxxxxxxxxxxxxxx_20210617?v
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 382ms empty_store done n/a n/a 172.23.24.221 es-master-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 0 21.9m peer done 172.23.24.221 es-master-1 172.23.13.255 es-master-0 n/a n/a 188 188 100.0% 188 14467579393 14467579393 100.0% 14467579393 55835 55835 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 395ms empty_store done n/a n/a 172.23.13.255 es-master-0 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 1 9m peer done 172.23.13.255 es-master-0 172.23.10.76 es-master-2 n/a n/a 188 188 100.0% 188 14486949488 14486949488 100.0% 14486949488 0 0 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 17.8m peer done 172.23.10.76 es-master-2 172.23.24.221 es-master-1 n/a n/a 134 134 100.0% 134 14470475298 14470475298 100.0% 14470475298 1894 1894 100.0%
xxxxxxxxxxxxxxxxxxxxxx_20210617 2 409ms empty_store done n/a n/a 172.23.10.76 es-master-2 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
First, if you have 3 nodes and your index has 3 primaries with each having 1 replica, there's absolutely no guarantee whatsoever that each node will hold one primary and one replica.
The only guarantees that you have is that:
the shard count will be balanced over the nodes and
a primary and its replica will never land on the same node.
That being said, it's perfectly possible for a node to get two primaries, another two replicas and the 3rd one gets one primary and one replica.
Looking at the chart, what I think happens in your case is that
node 2 gets two primaries and
node 0 gets one primary
Then, when you add the replica:
node 0 (which has only one primary) gets one replica (the curve is less steep)
node 1 (which has nothing so far) gets two replicas (the curve grows steeper)
node 2 stays flat because it already has two primaries
A little later, when node 1's disk approaches saturation, one shard is relocated away from it to node 2 (at 23:16 the curve starts to increase).
The end situation seems to be:
node 0 with one primary and one replica
node 1 with only one replica
node 2 with two primaries and one replica
I think it would be nice to confirm this with the following two commands:
# you can see where each shard is located now
GET _cat/shards/tax*?v
# you can see which shards went from which node to which node
GET _cat/recovery/indexname*?v

HBase wal getting bigger and bigger

HBase wal getting bigger and bigger. The details are as follows:
3.2 K 9.6 K /hbase/.hbase-snapshot
0 0 /hbase/.hbck
0 0 /hbase/.tmp
0 0 /hbase/MasterProcWALs
534.2 G 1.6 T /hbase/WALs
400.3 M 1.2 G /hbase/archive
0 0 /hbase/corrupt
267.0 G 796.5 G /hbase/data
42 126 /hbase/hbase.id
7 21 /hbase/hbase.version
0 0 /hbase/mobdir
1.7 M 5.1 M /hbase/oldWALs
0 0 /hbase/staging
HBase version 2.1.0. I set the following parameters:
hbase.master.logcleaner.ttl 60s
hbase.wal.regiongrouping.numgroups 2
hbase.regionserver.maxlogs 32
I calculated that my actual data size is equal to the size of the /hbase/data file directory. I tested to delete the log data which is relatively long, but the program will report an exception.
My data is mainly written by Phoenix. Is there any reason for this?
In RegionServer logs, I found the following:
2020-11-07 00:36:45,750 INFO org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; count=385, max=32; forcing flush of 10 regions(s): 12bcc8a5c4d5087f7c8607cc943250fc, 94fa37159fb219dd510d20eba243e5a3, b4c2ad111677e61758ada3af4c1e9dbb, d6c6928e0a400ce53c9d8ee614dc552c, d224956d3d3f6f7657d0e9c9c4c544cc, 02bda653c9a66d85b064c4070f3f9e9e, 9297d8fbbd3b535ac0807e3665517752, 5ffdd4261cb294a1922fdac8fbfeef2f, 27c13a71b1d90dc7640f7df2dfd2093b, 159931a1de7fab4a04cdc3bd967d77bc
ah....
https://issues.apache.org/jira/browse/PHOENIX-5250

Algorithm for iteratively testing 2d grid connectiveness

Let's say that I have a 2D grid size that can hold either a zero or one at each index. The grid starts off as full of zeros and then ones are progressively added. At each step, I want to verify that adding the next one will not prevent the zeros from forming one connected component (using a 4-connected grid with north, east, south, and west neighbors).
What is a fast algorithm that will iteratively test a 2D grid for connectedness?
Currently I am using a flood fill at each iteration, but I feel there should be a faster algorithm that uses information from previous iterations.
Additionally, the method that places the ones will sometimes unplace the ones even if they don't disconnect the grid, so the algorithm I'm looking for needs to be able to handle that.
This is inspired by Kruskal's algorithm for maze generation.
I am defining the neighborhood of a square as its 8 surrounding squares, including the outside of the grid (the neighborhood of a corner square is its 3 surrounding squares plus the outside, so 4 "squares" total).
Put the 1s in sets so that any two neighboring 1s belong to the same set. Treat the outside of the grid as one big 1 (which means the first set contains it). When adding a 1, you only need to check its neighbors.
Below are all the possible cases. To make it easier to visualize, I'll number the sets starting from 1 and use the set number instead of the 1 in each square that contains a 1. The outside belongs to the set numbered 1. You can also use this to simplify the implementation. The brackets indicate the newly placed 1.
If the new 1 has no neighboring 1, then it belongs to a new set.
0 0 0 0 0
0 2 0 0 0
0 0 0[3]0
0 0 0 0 0
0 0 1 0 0
If it has one neighboring 1, then it belongs to the same set.
0 0 0 0 0
0 2 0 0 0
0 0[2]0 0
0 0 0 0 0
0 0 1 0 0
If it has multiple neighboring 1s, and all neighbors belonging to the same set are direct neighbors, then you can merge the sets and the new 1 belongs to the resulting set. You don't need to check for a disconnection.
0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 1 0 0 0
0 0[3]1 0 -> 0 0[1]1 0
0 0 1 1 0 0 0 1 1 0
0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 1 0 0 0
0 2 0 1 0 -> 0 1 0 1 0
[3]0 0 1 0 [1]0 0 1 0
1 1 1 0 0 1 1 1 0 0
If it has multiple neighboring 1s of the same set, but they are not all direct neighbors, then you have a disconnection.
0 0 0 0 0 0 0 0 0 0 <- first group of 0s
0 2 0 0 0 0 1 0 0 0
0 0[3]1 0 -> 0 0[1]1 0
0 1 0 1 1 0 1 0 1 1
1 0 0 0 0 1 0 0 0 0 <- second group of 0s
0 0 0 0 0 <- first group of 0s
0 0 1 0 0
0 1 0 1 1
[1]1 0 0 0
0 0 0 0 0 <- second group of 0s
0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 1 0 0 0
0 2 0 1 0 -> 0 1 0 1 0
[3]0 0 1 0 [1]0 0 1 0
0{1}1 0 0 lone 0 -> 0{1}1 0 0
In this last example, the 1 marked {1} and the outside technically are neighbors, but not from the point of view of the newly placed 1.
In the general case, when removing a 1 that has multiple neighboring 1s, you need to check whether they are still connected after the removal (for example, by running a pathfinder between them). If not, separate them in different sets.
If you know the 0s are all connected, then you can check locally: removing a 1 will not split the set it belongs to if its neighbors are all direct neighbors (careful with the outside, though). It will if there is are multiple "gaps" in its neighborhood.
In the special case where you only remove the 1s in the reverse order you added them, you can keep track of which newly added 1s join multiple sets (and even what the sets are at that moment, if you need). These will split their set when you remove them later on.

Multi-core CPU interrupts

How does multi-core processors handle interrupts?
I know of how single core processors handle interrupts.
I also know of the different types of interrupts.
I want to know how multi core processors handle hardware, program, CPU time sequence and input/output interrupt
This should be considered as a continuation for or an expansion of the other answer.
Most multiprocessors support programmable interrupt controllers such as Intel's APIC. These are complicated chips that consist of a number components, some of which could be part of the chipset. At boot-time, all I/O interrupts are delivered to core 0 (the bootstrap processor). Then, in an APIC system, the OS can specify for each interrupt which core(s) should handle that interrupt. If more than one core is specified, it means that it's up to the APIC system to decide which of the cores should handle an incoming interrupt request. This is called interrupt affinity. Many scheduling algorithms have been proposed for both the OS and the hardware. One obvious technique is to load-balance the system by scheduling the interrupts in a round-robin fashion. Another is this technique from Intel that attempts to balance performance and power.
On a Linux system, you can open /proc/interrupts to see how many interrupts of each type were handled by each core. The contents of that file may look something like this on a system with 8 logical cores:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 19 0 0 0 0 0 0 0 IR-IO-APIC 2-edge timer
1: 1 1 0 0 0 0 0 0 IR-IO-APIC 1-edge i8042
8: 0 0 1 0 0 0 0 0 IR-IO-APIC 8-edge rtc0
9: 0 0 0 0 1 0 0 2 IR-IO-APIC 9-fasteoi acpi
12: 3 0 0 0 0 0 1 0 IR-IO-APIC 12-edge i8042
16: 84 4187879 7 3 3 14044994 6 5 IR-IO-APIC 16-fasteoi ehci_hcd:usb1
19: 1 0 0 0 6 8 7 0 IR-IO-APIC 19-fasteoi
23: 50 2 0 3 273272 8 1 4 IR-IO-APIC 23-fasteoi ehci_hcd:usb2
24: 0 0 0 0 0 0 0 0 DMAR-MSI 0-edge dmar0
25: 0 0 0 0 0 0 0 0 DMAR-MSI 1-edge dmar1
26: 0 0 0 0 0 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
27: 11656 381 178 47851679 1170 481 593 104 IR-PCI-MSI 512000-edge 0000:00:1f.2
28: 5 59208205 0 1 3 3 0 1 IR-PCI-MSI 409600-edge eth0
29: 274 8 29 4 15 18 40 64478962 IR-PCI-MSI 32768-edge i915
30: 19 0 0 0 2 2 0 0 IR-PCI-MSI 360448-edge mei_me
31: 96 18 23 11 386 18 40 27 IR-PCI-MSI 442368-edge snd_hda_intel
32: 8 88 17 275 208 301 43 76 IR-PCI-MSI 49152-edge snd_hda_intel
NMI: 4 17 30 17 4 5 17 24 Non-maskable interrupts
LOC: 357688026 372212163 431750501 360923729 188688672 203021824 257050174 203510941 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 4 17 30 17 4 5 17 24 Performance monitoring interrupts
IWI: 2 0 0 0 0 0 0 140 IRQ work interrupts
RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries
RES: 15122413 11566598 15149982 12360156 8538232 12428238 9265882 8192655 Rescheduling interrupts
CAL: 4086842476 4028729722 3961591824 3996615267 4065446828 4033019445 3994553904 4040202886 Function call interrupts
TLB: 2649827127 3201645276 3725606250 3581094963 3028395194 2952606298 3092015503 3024230859 TLB shootdowns
TRM: 169827 169827 169827 169827 169827 169827 169827 169827 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 7194 7194 7194 7194 7194 7194 7194 7194 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 0 0 0 0 Posted-interrupt notification event
PIW: 0 0 0 0 0 0 0 0 Posted-interrupt wakeup event
The first column specifies the interrupt request (IRQ) number. All the IRQ numbers that are in use can be found in the list. The file /proc/irq/N/smp_affinity contains a single value that specifies the affinity of IRQ N. This value should be interpreted depending on the current mode of operation of the APIC.
A logical core can receive multiple I/O and IPI interrupts. At that point, local interrupt scheduling takes place, which is also configurable by assigning priorities to interrupts.
Other programmable interrupt controllers are similar.
In general, this depends on the particular system you have under test.
The broader approach is to have a specific chip in each processor1 that is assigned, either statically or dinamically2, a unique ID and that can send and receive interrupts over a shared or dedicated bus.
The IDs allows specific processors to be targets of interrupts.
Code running on the processor A can ask its interrupt chip to raise an interrupt on processor B, when this happens a message is sent along the above-mentioned bus, routed to processor B where the relative interrupt chip picks it up, decode it and raise the corresponding interrupt.
At the system level, one or more, general interrupt controllers are present to route interrupt requests from the IO devices (in any bus) to the processors.
These controllers are programmable, the OS can balance the interrupt load across all the processors (or implement any other convenient policy).
This is the most flexible approach, a wired approach is also possible.
In this case, processor A signals are wired directly to processor B inputs and vice versa; asserting these signals give rise to an interrupt on the target processor.
The general concept is called Inter-processor Interrupt (IPI).
The x86 architecture follows the first approach closely3 (beware of the nomenclature though, processor has a different meaning).
Other architectures may not, like the IBM OS/360 M65MP that uses a wired approach4.
Software generated interrupts are just instructions in a program, each processor executes their own instruction stream and thus if program X generate an exception when running on processor A, it is processor A that handles it.
Task scheduling is distributed across all the processors usually (that's what Linux does.
Time-keeping is usually done by a designated processor that serves a hardware timer interrupt.
This is not always the case, I haven't looked at the precise details of the implementations of the modern OSes.
1 Usually an integrated chip, so we can think of it as a functional unit of the processor.
2 By a power-on protocol.
3 Actually, this is reverse causality.
4 I'm following the Wikipedia examples.

Multiple Inputs for Backpropagation Neural Network

I've been working on this for about a week. There are no errors in my coding, I just need to get algorithm and concept right. I've implemented a neural network consisting of 1 hidden layer. I use the backpropagation algorithm to correct the weights.
My problem is that the network can only learn one pattern. If I train it with the same training data over and over again, it produces the desired outputs when given input that is numerically close to the training data.
training_input:1, 2, 3
training_output: 0.6, 0.25
after 300 epochs....
input: 1, 2, 3
output: 0.6, 0.25
input 1, 1, 2
output: 0.5853, 0.213245
But if I use multiple varying training sets, it only learns the last pattern. Aren't neural networks supposed to learn multiple patterns? Is this a common beginner mistake? If yes then point me in the right direction. I've looked at many online guides, but I've never seen one that goes into detail about dealing with multiple input. I'm using sigmoid for the hidden layer and tanh for the output layer.
+
Example training arrays:
13 tcp telnet SF 118 2425 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 26 10 0.38 0.12 0.04 0 0 0 0.12 0.3 anomaly
0 udp private SF 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 3 0 0 0 0 0.75 0.5 0 255 254 1 0.01 0.01 0 0 0 0 0 anomaly
0 tcp telnet S3 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 255 79 0.31 0.61 0 0 0.21 0.68 0.6 0 anomaly
The last columns(anomaly/normal) are the expected outputs. I turn everything into numbers, so each word can be represented by a unique integer.
I give the network one array at a time, then I use the last column as the expected output to adjust the weights. I have around 300 arrays like these.
As for the hidden neurons, I tried from 3, 6 and 20 but nothing changed.
+
To update the weights, I calculate the gradient for the output and hidden layers. Then I calculate the deltas and add them to their associated weights. I don't understand how that is ever going to learn to map multiple inputs to multiple outputs. It looks linear.
If you train a neural network too much, with respect to the number of iterations through the back-propagation algorithm, on one data set the weights will eventually converge to a state where it will give the best outcome for that specific training set (overtraining for machine learning). It will only learn the relationships between input and target data for that specific training set, but not the broader more general relationship that you might be looking for. It's better to merge some distinctive sets and train your network on the full set.
Without seeing the code for the back-propagation algorithm I could not give you any advice on if it's working correctly. One problem I had when implementing the back-propagation was not properly calculating the derivative of the activation function around the input value. This website was very helpful for me.
No Neural networks are not supposed to know multiple tricks.
You train them for a specific task.
Yes they can be trained for other tasks as well
But then they get optimized for another task.
So thats why you should create load and save functions, for your network so that you can easily switch brains and perform other tasks, if required.
If your not sure what taks it is currently train a neural to find the diference between the tasks.

Resources